

HTML Purifier is a PHP library to securely filter any HTML (user) input – I stumbled onto it when reading the book PHP-Sicherheit by Christopher Kunz, Stefan Esser and Peter Prochaska. The book describes ways to prevent security breaches in PHP, and one possibility to completely prevent XSS vulnerabilities is HTML Purifier.
HTML Purifier makes your input standard compliant (you can decide which standard it should follow), lets you define a whitelist of allowed elements and attributes and even add additional elements or attributes if you need them. So in theory it could filter any XML based code exactly as you wish – that’s what makes it great.
The basic functionality is easy to use out-of-the-box – here an example on how to filter content with a list of allowed HTML elements (instead of allowing all elements in XHTML 1.0 Transitional):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
/** * Create settings object for new instance of HTML Purifier */ $config = HTMLPurifier_Config::createDefault(); $config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional'); $config->set('HTML', 'DefinitionID', 'new-filter-for-user-input'); $config->set('HTML', 'DefinitionRev', 1); // No caching of this filter definition - remove later! $config->set('Core', 'DefinitionCache', null); $config->set('HTML', 'Allowed', 'a[href|title],strong,em,ol[type|start],ul[type],li,blockquote'); /** * Create HTMLPurifier object to use on our user input */ $purifier = new HTMLPurifier($config); /** * Filter the user input */ $filtered_input = $purifier->purify($unfiltered_input); |
This should already be sufficient for most user input filtering – only links, lists, quotes and setting the font weight/style are allowed, all other HTML elements are filtered. As you can see the definition of the whitelist is straightforward (line 10) – just compile a list with the allowed tags, seperate them with commas and (if necessary) also limit the allowed attributes of a tag in square brackets seperated by |’s. Line 9 should be removed in production, it only prevents caching when testing the code.
You may want to spice things up sometimes and maybe add your own elements and/or attributes to the mix so users can, for example, integrate videos with a custom tag (which then can be processed and converted to regular HTML by PHP). If you want to stick to a regular HTML doctype and just add minor things, the following code is an example on how it works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
/** * Prepare HTML Purifier to handle the post text */ $config = HTMLPurifier_Config::createDefault(); $config->set('HTML', 'Doctype', 'XHTML 1.0 Transitional'); $config->set('HTML', 'DefinitionID', 'new-filter-for-user-input'); $config->set('HTML', 'DefinitionRev', 1); // No caching of this filter definition - remove later! $config->set('Core', 'DefinitionCache', null); $config->set('HTML', 'Allowed', 'a[href|title],strong,em,ol[type|start],ul[type],li,blockquote,video'); /** * Get HTML definition we want to expand/tamper with */ $def =& $config->getHTMLDefinition(true); /** * The * makes the href attribute of -elements mandatory. * This means 's without a href will be filtered out. * (Just an example on how to extend/modify existing elements) */ $def->addAttribute('a', 'href*', 'URI'); /** * Adds the element to the HTML definition, * with mandatory attribute "src" */ $form =& $def->addElement( 'video', // name of the element 'Inline', // inline element, just as |
With the above code the element is now allowed in the user input, and additional attributes like “formatâ€/â€type†or “repeat†could be added to the definition if necessary. Adding proprietary elements and/or modifying existing elements is easy this way.
More information about adding new elements and/or attributes to the document type can be found in the Customization documentation on the HTMLPurifier website. It contains all the details, as this article barely scratches the surface.