Skip to main content
The cover image for "Securing Markdown user content with Mozilla Bleach"

Securing Markdown user content with Mozilla Bleach

Sidebar

Markdown is a common choice for rich text formatting due to its readability and ease-of-use. Unlike a lot of markup, it aims to match natural text. It's even easy for beginner users, and there are WYSIWYG editors available. We will be using the [Python Markdown](https://python-markdown.github.io/) library to convert Markdown to HTML. Markdown doesn't have a well-defined standard. The library aims to comply with what little is defined by the [Markdown syntax specification](https://daringfireball.net/projects/markdown/syntax), meaning that it is also often stricter than other parsers.

Markdown is a common choice for rich text formatting due to its readability and ease-of-use. Unlike a lot of markup, it aims to match natural text. It’s even easy for beginner users, and there are WYSIWYG editors available.

We will be using the Python Markdown library to convert Markdown to HTML. Markdown doesn’t have a well-defined standard. The library aims to comply with what little is defined by the Markdown syntax specification, meaning that it is also often stricter than other parsers.

To convert Markdown to HTML:

from markdown import Markdown
md = Markdown(extensions=["fenced_code", "tables"], output_format="html5")
html = md.convert(source)

You can use another library to interpret Markdown, if you wish. The rest of the code will be dealing with the HTML output, so is independent of Markdown parsers.

Avoid XSS attacks #

When allowing user submitted content, it’s important to sanitise it to avoid Cross-Site Scripting attacks (XSS). If you don’t sanitise user input, then an attacker will be able to add HTML tags to run JavaScript when other users view your website. This can be used to steal login credentials, run bitcoin mining malware, or deface your website. So not ideal.

Bleach, by Mozilla, is library to sanitised untrusted HTML. It works based on a whitelist of tags and their attributes. I have based my list on the mdx_bleach extension, which you could use directly with the markdown library - but I prefer to use the Bleach library directly after generating the HTML as dependencies have a tendency to break.

Another thing that Bleach does is safely linkify text - it can convert text resembling a URL into a link.

# List of allowed HTML tags
ALLOWED_TAGS = [
    "h1", "h2", "h3", "h4", "h5", "h6", "hr",
    "ul", "ol", "li", "p", "br",
    "pre", "code", "blockquote",
    "strong", "em", "a", "img", "b", "i",
    "table", "thead", "tbody", "tr", "th", "td",
]

# A map of HTML tags to allowed attributes
# If a tag isn't here, then no attributes are allowed
ALLOWED_ATTRIBUTES = {
    "h1": ["id"], "h2": ["id"], "h3": ["id"],  "h4": ["id"],
    "a": ["href", "title"],
    "img": ["src", "title", "alt"],
}

# Allowed protocols in links.
ALLOWED_PROTOCOLS = ["http", "https", "mailto"]

md = Markdown(output_format="html5")


def render_markdown(source):
    html = md.convert(source)

    cleaner = Cleaner(
            tags=ALLOWED_TAGS,
            attributes=ALLOWED_ATTRIBUTES,
            protocols=ALLOWED_PROTOCOLS,
            filters=[partial(LinkifyFilter, callbacks=bleach.linkifier.DEFAULT_CALLBACKS)])

    return cleaner.clean(html)

Supporting code highlighting #

The CodeHilite extension for Python-Markdown uses Pygments to provide syntax highlighting. You can enable the extension by adding it to the extensions list.

md = Markdown(extensions=["fenced_code", "tables", "codehilite"], output_format="html5")

You will also need to provide the .css files for the style (demo) you choose. I ended up going with Darcula due to personal preference.

Bleach will strip attributes that aren’t whitelisted, including the class names needed by code highlighting. It’s important that you don’t just allow any class attribute values to be used, as this would allow malicious users to use any CSS class and deface your website. Instead, we will provide a function to the ALLOWED_ATTRIBUTES dictionary, which will check whether the provided values are allowed.

ALLOWED_TAGS = [
    # ...

    "div", "span",
]

ALLOWED_CSS_CLASSES = [
    "highlight", "codehilite",
    "hll", "c", "err", "g", "k", "l", "n", "o", "x", "p", "ch", "cm", "cp", "cpf", "c1", "cs",
    "gd", "ge", "gr", "gh", "gi", "go", "gp", "gs", "gu", "gt", "kc", "kd", "kn", "kp", "kr",
    "kt", "ld", "m", "s", "na", "nb", "nc", "no", "nd", "ni", "ne", "nf", "nl", "nn", "nx",
    "py", "nt", "nv", "ow", "w", "mb", "mf", "mh", "mi", "mo", "sa", "sb", "sc", "dl", "sd",
    "s2", "se", "sh", "si", "sx", "sr", "s1", "ss", "bp", "fm", "vc", "vg", "vi", "vm", "il",
]

def allow_class(_tag, name, value):
    return name == "class" and value in ALLOWED_CSS_CLASSES

ALLOWED_ATTRIBUTES = {
    # etc
    "code": allow_class,
    "div": allow_class,
    "span": allow_class,
}

md = Markdown(output_format="html5")

And there you are! You can now render untrusted user markdown safely, with code highlighting and linkify.

rubenwardy's profile picture, the letter R

Hi, I'm Andrew Ward. I'm a software developer, an open source maintainer, and a graduate from the University of Bristol. I’m a core developer for Luanti, an open source voxel game engine.

Comments

Leave comment

Shown publicly next to your comment. Leave blank to show as "Anonymous".
Optional, to notify you if rubenwardy replies. Not shown publicly.
Max 1800 characters. You may use plain text, HTML, or Markdown.