Securing Markdown user content with Mozilla Bleach

Tutorials

Markdown is a common choice for rich text formatting due to its readability and ease-of-use. Unlike a lot of markup, it aims to match natural text. It’s even easy for beginner users, and there are WYSIWYG editors available.

We will be using the Python Markdown library to convert Markdown to HTML. Markdown doesn’t have a well-defined standard. The library aims to comply with what little is defined by the Markdown syntax specification, meaning that it is also often stricter than other parsers.

To convert Markdown to HTML:

from markdown import Markdown
md = Markdown(extensions=["fenced_code", "tables"], output_format="html5")
html = md.convert(source)

You can use another library to interpret Markdown, if you wish. The rest of the code will be dealing with the HTML output, so is independent of Markdown parsers.

Avoid XSS attacks

When allowing user submitted content, it’s important to sanitise it to avoid Cross-Site Scripting attacks (XSS). If you don’t sanitise user input, then an attacker will be able to add HTML tags to run JavaScript when other users view your website. This can be used to steal login credentials, run bitcoin mining malware, or deface your website. So not ideal.

Bleach, by Mozilla, is library to sanitised untrusted HTML. It works based on a whitelist of tags and their attributes. I have based my list on the mdx_bleach extension, which you could use directly with the markdown library - but I prefer to use the Bleach library directly after generating the HTML as dependencies have a tendency to break.

Another thing that Bleach does is safely linkify text - it can convert text resembling a URL into a link.

# List of allowed HTML tags
ALLOWED_TAGS = [
    "h1", "h2", "h3", "h4", "h5", "h6", "hr",
    "ul", "ol", "li", "p", "br",
    "pre", "code", "blockquote",
    "strong", "em", "a", "img", "b", "i",
    "table", "thead", "tbody", "tr", "th", "td",
]

# A map of HTML tags to allowed attributes
# If a tag isn't here, then no attributes are allowed
ALLOWED_ATTRIBUTES = {
    "h1": ["id"], "h2": ["id"], "h3": ["id"],  "h4": ["id"],
    "a": ["href", "title"],
    "img": ["src", "title", "alt"],
}

# Allowed protocols in links.
ALLOWED_PROTOCOLS = ["http", "https", "mailto"]

md = Markdown(output_format="html5")


def render_markdown(source):
    html = md.convert(source)

    cleaner = Cleaner(
            tags=ALLOWED_TAGS,
            attributes=ALLOWED_ATTRIBUTES,
            protocols=ALLOWED_PROTOCOLS,
            filters=[partial(LinkifyFilter, callbacks=bleach.linkifier.DEFAULT_CALLBACKS)])

    return cleaner.clean(html)

Supporting code highlighting

The CodeHilite extension for Python-Markdown uses Pygments to provide syntax highlighting. You can enable the extension by adding it to the extensions list.

md = Markdown(extensions=["fenced_code", "tables", "codehilite"], output_format="html5")

You will also need to provide the .css files for the style (demo) you choose. I ended up going with Darcula due to personal preference.

Bleach will strip attributes that aren’t whitelisted, including the class names needed by code highlighting. It’s important that you don’t just allow any class attribute values to be used, as this would allow malicious users to use any CSS class and deface your website. Instead, we will provide a function to the ALLOWED_ATTRIBUTES dictionary, which will check whether the provided values are allowed.

ALLOWED_TAGS = [
    # ...

    "div", "span",
]

ALLOWED_CSS_CLASSES = [
    "highlight", "codehilite",
    "hll", "c", "err", "g", "k", "l", "n", "o", "x", "p", "ch", "cm", "cp", "cpf", "c1", "cs",
    "gd", "ge", "gr", "gh", "gi", "go", "gp", "gs", "gu", "gt", "kc", "kd", "kn", "kp", "kr",
    "kt", "ld", "m", "s", "na", "nb", "nc", "no", "nd", "ni", "ne", "nf", "nl", "nn", "nx",
    "py", "nt", "nv", "ow", "w", "mb", "mf", "mh", "mi", "mo", "sa", "sb", "sc", "dl", "sd",
    "s2", "se", "sh", "si", "sx", "sr", "s1", "ss", "bp", "fm", "vc", "vg", "vi", "vm", "il",
]

def allow_class(_tag, name, value):
    return name == "class" and value in ALLOWED_CSS_CLASSES

ALLOWED_ATTRIBUTES = {
    # etc
    "code": allow_class,
    "div": allow_class,
    "span": allow_class,
}

md = Markdown(output_format="html5")

And there you are! You can now render untrusted user markdown safely, with code highlighting and linkify.