diff options
Diffstat (limited to 'lib/htmlpurifier/docs/enduser-overview.txt')
-rw-r--r-- | lib/htmlpurifier/docs/enduser-overview.txt | 59 |
1 files changed, 0 insertions, 59 deletions
diff --git a/lib/htmlpurifier/docs/enduser-overview.txt b/lib/htmlpurifier/docs/enduser-overview.txt deleted file mode 100644 index fe7f8705d..000000000 --- a/lib/htmlpurifier/docs/enduser-overview.txt +++ /dev/null @@ -1,59 +0,0 @@ - -HTML Purifier - by Edward Z. Yang - -There are a number of ad hoc HTML filtering solutions out there on the web -(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that -claim to filter HTML properly, preventing malicious JavaScript and layout -breaking HTML from getting through the parser. None of them, however, -demonstrates a thorough knowledge of neither the DTD that defines the HTML -nor the caveats of HTML that cannot be expressed by a DTD. Configurable -filters (such as kses or PHP's built-in striptags() function) have trouble -validating the contents of attributes and can be subject to security attacks -due to poor configuration. Other filters take the naive approach of -blacklisting known threats and tags, failing to account for the introduction -of new technologies, new tags, new attributes or quirky browser behavior. - -However, HTML Purifier takes a different approach, one that doesn't use -specification-ignorant regexes or narrow blacklists. HTML Purifier will -decompose the whole document into tokens, and rigorously process the tokens by: -removing non-whitelisted elements, transforming bad practice tags like <font> -into <span>, properly checking the nesting of tags and their children and -validating all attributes according to their RFCs. - -To my knowledge, there is nothing like this on the web yet. Not even MediaWiki, -which allows an amazingly diverse mix of HTML and wikitext in its documents, -gets all the nesting quirks right. Existing solutions hope that no JavaScript -will slip through, but either do not attempt to ensure that the resulting -output is valid XHTML or send the HTML through a draconic XML parser (and yet -still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a> -tags from being nested within each other). - -This document no longer is a detailed description of how HTMLPurifier works, -as those descriptions have been moved to the appropriate code. The first -draft was drawn up after two rough code sketches and the implementation of a -forgiving lexer. You may also be interested in the unit tests located in the -tests/ folder, which provide a living document on how exactly the filter deals -with malformed input. - -In summary (see corresponding classes for more details): - -1. Parse document into an array of tag and text tokens (Lexer) -2. Remove all elements not on whitelist and transform certain other elements - into acceptable forms (i.e. <font>) -3. Make document well formed while helpfully taking into account certain quirks, - such as the fact that <p> tags traditionally are closed by other block-level - elements. -4. Run through all nodes and check children for proper order (especially - important for tables). -5. Validate attributes according to more restrictive definitions based on the - RFCs. -6. Translate back into a string. (Generator) - -HTML Purifier is best suited for documents that require a rich array of -HTML tags. Things like blog comments are, in all likelihood, most appropriately -written in an extremely restrictive set of markup that doesn't require -all this functionality (or not written in HTML at all), although this may -be changing in the future with the addition of levels of filtering. - - vim: et sw=4 sts=4 |