diff options
author | friendica <info@friendica.com> | 2012-07-18 03:59:10 -0700 |
---|---|---|
committer | friendica <info@friendica.com> | 2012-07-18 03:59:10 -0700 |
commit | 22cf19e174bcee88b44968f2773d1bad2da2b54d (patch) | |
tree | f4e01db6f73754418438b020c2327e18c256653c /lib/htmlpurifier/docs | |
parent | 7a40f4354b32809af3d0cfd6e3af0eda02ab0e0a (diff) | |
download | volse-hubzilla-22cf19e174bcee88b44968f2773d1bad2da2b54d.tar.gz volse-hubzilla-22cf19e174bcee88b44968f2773d1bad2da2b54d.tar.bz2 volse-hubzilla-22cf19e174bcee88b44968f2773d1bad2da2b54d.zip |
bad sync with github windows client
Diffstat (limited to 'lib/htmlpurifier/docs')
46 files changed, 0 insertions, 7840 deletions
diff --git a/lib/htmlpurifier/docs/dev-advanced-api.html b/lib/htmlpurifier/docs/dev-advanced-api.html deleted file mode 100644 index 5b7aaa3c8..000000000 --- a/lib/htmlpurifier/docs/dev-advanced-api.html +++ /dev/null @@ -1,26 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Specification for HTML Purifier's advanced API for defining custom filtering behavior." /> -<link rel="stylesheet" type="text/css" href="style.css" /> - -<title>Advanced API - HTML Purifier</title> - -</head><body> - -<h1>Advanced API</h1> - -<div id="filing">Filed under Development</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p> - Please see <a href="enduser-customize.html">Customize!</a> -</p> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/dev-code-quality.txt b/lib/htmlpurifier/docs/dev-code-quality.txt deleted file mode 100644 index bceedebc4..000000000 --- a/lib/htmlpurifier/docs/dev-code-quality.txt +++ /dev/null @@ -1,29 +0,0 @@ - -Code Quality Issues - -Okay, face it. Programmers can get lazy, cut corners, or make mistakes. They -also can do quick prototypes, and then forget to rewrite them later. Well, -while I can't list mistakes in here, I can list prototype-like segments -of code that should be aggressively refactored. This does not list -optimization issues, that needs to be done after intense profiling. - -docs/examples/demo.php - ad hoc HTML/PHP soup to the extreme - -AttrDef - a lot of duplication, more generic classes need to be created; -a lot of strtolower() calls, no legit casing - Class - doesn't support Unicode characters (fringe); uses regular expressions - Lang - code duplication; premature optimization - Length - easily mistaken for CSSLength - URI - multiple regular expressions; missing validation for parts (?) - CSS - parser doesn't accept advanced CSS (fringe) - Number - constructor interface inconsistent with Integer -Strategy - FixNesting - cannot bubble nodes out of structures, duplicated checks - for special-case parent node - RemoveForeignElements - should be run in parallel with MakeWellFormed -URIScheme - needs to have callable generic checks - mailto - doesn't validate emails, doesn't validate querystring - news - doesn't validate opaque path - nntp - doesn't constrain path - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/dev-config-bcbreaks.txt b/lib/htmlpurifier/docs/dev-config-bcbreaks.txt deleted file mode 100644 index 29a58ca2f..000000000 --- a/lib/htmlpurifier/docs/dev-config-bcbreaks.txt +++ /dev/null @@ -1,79 +0,0 @@ - -Configuration Backwards-Compatibility Breaks - -In version 4.0.0, the configuration subsystem (composed of the outwards -facing Config class, as well as the ConfigSchema and ConfigSchema_Interchange -subsystems), was significantly revamped to make use of property lists. -While most of the changes are internal, some internal APIs were changed for the -sake of clarity. HTMLPurifier_Config was kept completely backwards compatible, -although some of the functions were retrofitted with an unambiguous alternate -syntax. Both of these changes are discussed in this document. - - - -1. Outwards Facing Changes --------------------------------------------------------------------------------- - -The HTMLPurifier_Config class now takes an alternate syntax. The general rule -is: - - If you passed $namespace, $directive, pass "$namespace.$directive" - instead. - -An example: - - $config->set('HTML', 'Allowed', 'p'); - -becomes: - - $config->set('HTML.Allowed', 'p'); - -New configuration options may have more than one namespace, they might -look something like %Filter.YouTube.Blacklist. While you could technically -set it with ('HTML', 'YouTube.Blacklist'), the logical extension -('HTML', 'YouTube', 'Blacklist') does not work. - -The old API will still work, but will emit E_USER_NOTICEs. - - - -2. Internal API Changes --------------------------------------------------------------------------------- - -Some overarching notes: we've completely eliminated the notion of namespace; -it's now an informal construct for organizing related configuration directives. - -Also, the validation routines for keys (formerly "$namespace.$directive") -have been completely relaxed. I don't think it really should be necessary. - -2.1 HTMLPurifier_ConfigSchema - -First off, if you're interfacing with this class, you really shouldn't. -HTMLPurifier_ConfigSchema_Builder_ConfigSchema is really the only class that -should ever be creating HTMLPurifier_ConfigSchema, and HTMLPurifier_Config the -only class that should be reading it. - -All namespace related methods were removed; they are completely unnecessary -now. Any $namespace, $name arguments must be replaced with $key (where -$key == "$namespace.$name"), including for addAlias(). - -The $info and $defaults member variables are no longer indexed as -[$namespace][$name]; they are now indexed as ["$namespace.$name"]. - -All deprecated methods were finally removed, after having yelled at you as -an E_USER_NOTICE for a while now. - -2.2 HTMLPurifier_ConfigSchema_Interchange - -Member variable $namespaces was removed. - -2.3 HTMLPurifier_ConfigSchema_Interchange_Id - -Member variable $namespace and $directive removed; member variable $key added. -Any method that took $namespace, $directive now takes $key. - -2.4 HTMLPurifier_ConfigSchema_Interchange_Namespace - -Removed. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/dev-config-naming.txt b/lib/htmlpurifier/docs/dev-config-naming.txt deleted file mode 100644 index 66db5bce3..000000000 --- a/lib/htmlpurifier/docs/dev-config-naming.txt +++ /dev/null @@ -1,164 +0,0 @@ -Configuration naming - -HTML Purifier 4.0.0 features a new configuration naming system that -allows arbitrary nesting of namespaces. While there are certain cases -in which using two namespaces is obviously better (the canonical example -is where we were using AutoFormatParam to contain directives for AutoFormat -parameters), it is unclear whether or not a general migration to highly -namespaced directives is a good idea or not. - -== Case studies == - -=== Attr.* === - -We have a dead duck HTML.Attr.Name.UseCDATA which migrated before we decided -to think this out thoroughly. - -We currently have a large number of directives in the Attr.* namespace. -These directives tweak the behavior of some HTML attributes. They have -the properties: - -* While they apply to only one attribute at a time, the attribute can - span over multiple elements (not necessarily all attributes, either). - The information of which elements it impacts is either omitted or - informally stated (EnableID applies to all elements, DefaultImageAlt - applies to <img> tags, AllowedRev doesn't say but only applies to a tags). - -* There is a certain degree of clustering that could be applied, especially - to the ID directives. The clustering could be done with respect to - what element/attribute was used, i.e. - - *.id -> EnableID, IDBlacklistRegexp, IDBlacklist, IDPrefixLocal, IDPrefix - img.src -> DefaultInvalidImage - img.alt -> DefaultImageAlt, DefaultInvalidImageAlt - bdo.dir -> DefaultTextDir - a.rel -> AllowedRel - a.rev -> AllowedRev - a.target -> AllowedFrameTargets - a.name -> Name.UseCDATA - -* The directives often reference generic attribute types that were specified - in the DTD/specification. However, some of the behavior specifically relies - on the fact that other use cases of the attribute are not, at current, - supported by HTML Purifier. - - AllowedRel, AllowedRev -> heavily <a> specific; if <link> ends up being - allowed, we will also have to give users specificity there (we also - want to preserve generality) DTD %Linktypes, HTML5 distinguishes - between <link> and <a>/<area> - AllowedFrameTargets -> heavily <a> specific, but also used by <area> - and <form>. Transitional DTD %FrameTarget, not present in strict, - HTML5 calls them "browsing contexts" - Default*Image* -> as a default parameter, is almost entirely exlcusive - to <img> - EnableID -> global attribute - Name.UseCDATA -> heavily <a> specific, but has heavy other usage by - many things - -== AutoFormat.* == - -These have the fairly normal pluggable architecture that lends itself to -large amounts of namespaces (pluggability may be the key to figuring -out when gratuitous namespacing is good.) Properties: - -* Boolean directives are fair game for being namespaced: for example, - RemoveEmpty.RemoveNbsp triggers RemoveEmpty.RemoveNbsp.Exceptions, - the latter of which only makes sense when RemoveEmpty.RemoveNbsp - is set to true. (The same applies to RemoveNbsp too) - -The AutoFormat string is a bit long, but is the only bit of repeated -context. - -== Core.* == - -Core is the potpourri of directives, mostly regarding some minor behavioral -tweaks for HTML handling abilities. - - AggressivelyFixLt - ConvertDocumentToFragment - DirectLexLineNumberSyncInterval - LexerImpl - MaintainLineNumbers - Lexer - CollectErrors - Language - Error handling (Language is ostensibly a little more general, but - it's only used for error handling right now) - ColorKeywords - CSS and HTML - Encoding - EscapeNonASCIICharacters - Character encoding - EscapeInvalidChildren - EscapeInvalidTags - HiddenElements - RemoveInvalidImg - Lexing/Output - RemoveScriptContents - Deprecated - -== HTML.* == - - AllowedAttributes - AllowedElements - AllowedModules - Allowed - ForbiddenAttributes - ForbiddenElements - Element set tuning - BlockWrapper - Child def advanced twiddle - CoreModules - CustomDoctype - Advanced HTMLModuleManager twiddles - DefinitionID - DefinitionRev - Caching - Doctype - Parent - Strict - XHTML - Global environment - MaxImgLength - Attribute twiddle? (applies to two attributes) - Proprietary - SafeEmbed - SafeObject - Trusted - Extra functionality/tagsets - TidyAdd - TidyLevel - TidyRemove - Tidy - -== Output.* == - -These directly affect the output of Generator. These are all advanced -twiddles. - -== URI.* == - - AllowedSchemes - OverrideAllowedSchemes - Scheme tuning - Base - DefaultScheme - Host - Global environment - DefinitionID - DefinitionRev - Caching - DisableExternalResources - DisableExternal - DisableResources - Disable - Contextual/authority tuning - HostBlacklist - Authority tuning - MakeAbsolute - MungeResources - MungeSecretKey - Munge - Transformation behavior (munge can be grouped) - - diff --git a/lib/htmlpurifier/docs/dev-config-schema.html b/lib/htmlpurifier/docs/dev-config-schema.html deleted file mode 100644 index 07aecd35a..000000000 --- a/lib/htmlpurifier/docs/dev-config-schema.html +++ /dev/null @@ -1,412 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> - <head> - <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> - <meta name="description" content="Describes config schema framework in HTML Purifier." /> - <link rel="stylesheet" type="text/css" href="./style.css" /> - <title>Config Schema - HTML Purifier</title> - </head> - <body> - - <h1>Config Schema</h1> - - <div id="filing">Filed under Development</div> - <div id="index">Return to the <a href="index.html">index</a>.</div> - <div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - - <p> - HTML Purifier has a fairly complex system for configuration. Users - interact with a <code>HTMLPurifier_Config</code> object to - set configuration directives. The values they set are validated according - to a configuration schema, <code>HTMLPurifier_ConfigSchema</code>. - </p> - - <p> - The schema is mostly transparent to end-users, but if you're doing development - work for HTML Purifier and need to define a new configuration directive, - you'll need to interact with it. We'll also talk about how to define - userspace configuration directives at the very end. - </p> - - <h2>Write a directive file</h2> - - <p> - Directive files define configuration directives to be used by - HTML Purifier. They are placed in <code>library/HTMLPurifier/ConfigSchema/schema/</code> - in the form <code><em>Namespace</em>.<em>Directive</em>.txt</code> (I - couldn't think of a more descriptive file extension.) - Directive files are actually what we call <code>StringHash</code>es, - i.e. associative arrays represented in a string form reminiscent of - <a href="http://qa.php.net/write-test.php">PHPT</a> tests. Here's a - sample directive file, <code>Test.Sample.txt</code>: - </p> - - <pre>Test.Sample -TYPE: string/null -DEFAULT: NULL -ALLOWED: 'foo', 'bar' -VALUE-ALIASES: 'baz' => 'bar' -VERSION: 3.1.0 ---DESCRIPTION-- -This is a sample configuration directive for the purposes of the -<code>dev-config-schema.html<code> documentation. ---ALIASES-- -Test.Example</pre> - - <p> - Each of these segments has a specific meaning: - </p> - - <table class="table"> - <thead> - <tr> - <th>Key</th> - <th>Example</th> - <th>Description</th> - </tr> - </thead> - <tbody> - <tr> - <td>ID</td> - <td>Test.Sample</td> - <td>The name of the directive, in the form Namespace.Directive - (implicitly the first line)</td> - </tr> - <tr> - <td>TYPE</td> - <td>string/null</td> - <td>The type of variable this directive accepts. See below for - details. You can also add <code>/null</code> to the end of - any basic type to allow null values too.</td> - </tr> - <tr> - <td>DEFAULT</td> - <td>NULL</td> - <td>A parseable PHP expression of the default value.</td> - </tr> - <tr> - <td>DESCRIPTION</td> - <td>This is a...</td> - <td>An HTML description of what this directive does.</td> - </tr> - <tr> - <td>VERSION</td> - <td>3.1.0</td> - <td><em>Recommended</em>. The version of HTML Purifier this directive was added. - Directives that have been around since 1.0.0 don't have this, - but any new ones should.</td> - </tr> - <tr> - <td>ALIASES</td> - <td>Test.Example</td> - <td><em>Optional</em>. A comma separated list of aliases for this directive. - This is most useful for backwards compatibility and should - not be used otherwise.</td> - </tr> - <tr> - <td>ALLOWED</td> - <td>'foo', 'bar'</td> - <td><em>Optional</em>. Set of allowed value for a directive, - a comma separated list of parseable PHP expressions. This - is only allowed string, istring, text and itext TYPEs.</td> - </tr> - <tr> - <td>VALUE-ALIASES</td> - <td>'baz' => 'bar'</td> - <td><em>Optional</em>. Mapping of one value to another, and - should be a comma separated list of keypair duples. This - is only allowed string, istring, text and itext TYPEs.</td> - </tr> - <tr> - <td>DEPRECATED-VERSION</td> - <td>3.1.0</td> - <td><em>Not shown</em>. Indicates that the directive was - deprecated this version.</td> - </tr> - <tr> - <td>DEPRECATED-USE</td> - <td>Test.NewDirective</td> - <td><em>Not shown</em>. Indicates what new directive should be - used instead. Note that the directives will functionally be - different, although they should offer the same functionality. - If they are identical, use an alias instead.</td> - </tr> - <tr> - <td>EXTERNAL</td> - <td>CSSTidy</td> - <td><em>Not shown</em>. Indicates if there is an external library - the user will need to download and install to use this configuration - directive. As of right now, this is merely a Google-able name; future - versions may also provide links and instructions.</td> - </tr> - </tbody> - </table> - - <p> - Some notes on format and style: - </p> - - <ul> - <li> - Each of these keys can be expressed in the short format - (<code>KEY: Value</code>) or the long format - (<code>--KEY--</code> with value beneath). You must use the - long format if multiple lines are needed, or if a long format - has been used already (that's why <code>ALIASES</code> in our - example is in the long format); otherwise, it's user preference. - </li> - <li> - The HTML descriptions should be wrapped at about 80 columns; do - not rely on editor word-wrapping. - </li> - </ul> - - <p> - Also, as promised, here is the set of possible types: - </p> - - <table class="table"> - <thead> - <tr> - <th>Type</th> - <th>Example</th> - <th>Description</th> - </tr> - </thead> - <tbody> - <tr> - <td>string</td> - <td>'Foo'</td> - <td><a href="http://docs.php.net/manual/en/language.types.string.php">String</a> without newlines</td> - </tr> - <tr> - <td>istring</td> - <td>'foo'</td> - <td>Case insensitive ASCII string without newlines</td> - </tr> - <tr> - <td>text</td> - <td>"A<em>\n</em>b"</td> - <td>String with newlines</td> - </tr> - <tr> - <td>itext</td> - <td>"a<em>\n</em>b"</td> - <td>Case insensitive ASCII string without newlines</td> - </tr> - <tr> - <td>int</td> - <td>23</td> - <td>Integer</td> - </tr> - <tr> - <td>float</td> - <td>3.0</td> - <td>Floating point number</td> - </tr> - <tr> - <td>bool</td> - <td>true</td> - <td>Boolean</td> - </tr> - <tr> - <td>lookup</td> - <td>array('key' => true)</td> - <td>Lookup array, used with <code>isset($var[$key])</code></td> - </tr> - <tr> - <td>list</td> - <td>array('f', 'b')</td> - <td>List array, with ordered numerical indexes</td> - </tr> - <tr> - <td>hash</td> - <td>array('key' => 'val')</td> - <td>Associative array of keys to values</td> - </tr> - <tr> - <td>mixed</td> - <td>new stdclass</td> - <td>Any PHP variable is fine</td> - </tr> - </tbody> - </table> - - <p> - The examples represent what will be returned out of the configuration - object; users have a little bit of leeway when setting configuration - values (for example, a lookup value can be specified as a list; - HTML Purifier will flip it as necessary.) These types are defined - in <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/VarParser.php"> - library/HTMLPurifier/VarParser.php</a>. - </p> - - <p> - For more information on what values are allowed, and how they are parsed, - consult <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php"> - library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php</a>, as well - as <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/Interchange/Directive.php"> - library/HTMLPurifier/ConfigSchema/Interchange/Directive.php</a> for - the semantics of the parsed values. - </p> - - <h2>Refreshing the cache</h2> - - <p> - You may have noticed that your directive file isn't doing anything - yet. That's because it hasn't been added to the runtime - <code>HTMLPurifier_ConfigSchema</code> instance. Run - <code>maintenance/generate-schema-cache.php</code> to fix this. - If there were no errors, you're good to go! Don't forget to add - some unit tests for your functionality! - </p> - - <p> - If you ever make changes to your configuration directives, you - will need to run this script again. - </p> - <h2>Adding in-house schema definitions</h2> - - <p> - Placing stuff directly in HTML Purifier's source tree is generally not a - good idea, so HTML Purifier 4.0.0+ has some facilities in place to make your - life easier. - </p> - - <p> - The first is to pass an extra parameter to <code>maintenance/generate-schema-cache.php</code> - with the location of your directory (relative or absolute path will do). For example, - if I'm storing my custom definitions in <em>/var/htmlpurifier/myschema</em>, run: - <code>php maintenance/generate-schema-cache.php /var/htmlpurifier/myschema</code>. - </p> - - <p> - Alternatively, you can create a small loader PHP file in the HTML Purifier base - directory named <code>config-schema.php</code> (this is the same directory - you would place a <code>test-settings.php</code> file). In this file, add - the following line for each directory you want to load: - </p> - -<pre>$builder->buildDir($interchange, '/var/htmlpurifier/myschema');</pre> - - <p>You can even load a single file using:</p> - -<pre>$builder->buildFile($interchange, '/var/htmlpurifier/myschema/MyApp.Directive.txt');</pre> - - <p>Storing custom definitions that you don't plan on sending back upstream in - a separate directory is <em>definitely</em> a good idea! Additionally, picking - a good namespace can go a long way to saving you grief if you want to use - someone else's change, but they picked the same name, or if HTML Purifier - decides to add support for a configuration directive that has the same name.</p> - - <!-- TODO: how to name directives that rely on naming conventions --> - - <h2>Errors</h2> - - <p> - All directive files go through a rigorous validation process - through <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/Validator.php"> - library/HTMLPurifier/ConfigSchema/Validator.php</a>, as well - as some basic checks during building. While - listing every error out here is out-of-scope for this document, we - can give some general tips for interpreting error messages. - There are two types of errors: builder errors and validation errors. - </p> - - <h3>Builder errors</h3> - - <blockquote> - <p> - <strong>Exception:</strong> Expected type string, got - integer in DEFAULT in directive hash 'Ns.Dir' - </p> - </blockquote> - - <p> - You can identify a builder error by the keyword "directive hash." - These are the easiest to deal with, because they directly correspond - with your directive file. Find the offending directive file (which - is the directive hash plus the .txt extension), find the - offending index ("in DEFAULT" means the DEFAULT key) and fix the error. - This particular error would occur if your default value is not the same - type as TYPE. - </p> - - <h3>Validation errors</h3> - - <blockquote> - <p> - <strong>Exception:</strong> Alias 3 in valueAliases in directive - 'Ns.Dir' must be a string - </p> - </blockquote> - - <p> - These are a little trickier, because we're not actually validating - your directive file, or even the direct string hash representation. - We're validating an Interchange object, and the error messages do - not mention any string hash keys. - </p> - - <p> - Nevertheless, it's not difficult to figure out what went wrong. - Read the "context" statements in reverse: - </p> - - <dl> - <dt>in directive 'Ns.Dir'</dt> - <dd>This means we need to look at the directive file <code>Ns.Dir.txt</code></dd> - <dt>in valueAliases</dt> - <dd>There's no key actually called this, but there's one that's close: - VALUE-ALIASES. Indeed, that's where to look.</dd> - <dt>Alias 3</dt> - <dd>The value alias that is equal to 3 is the culprit.</dd> - </dl> - - <p> - In this particular case, you're not allowed to alias integers values to - strings values. - </p> - - <p> - The most difficult part is translating the Interchange member variable (valueAliases) - into a directive file key (VALUE-ALIASES), but there's a one-to-one - correspondence currently. If the two formats diverge, any discrepancies - will be described in <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php"> - library/HTMLPurifier/ConfigSchema/InterchangeBuilder.php</a>. - </p> - - <h2>Internals</h2> - - <p> - Much of the configuration schema framework's codebase deals with - shuffling data from one format to another, and doing validation on this - data. - The keystone of all of this is the <code>HTMLPurifier_ConfigSchema_Interchange</code> - class, which represents the purest, parsed representation of the schema. - </p> - - <p> - Hand-writing this data is unwieldy, however, so we write directive files. - These directive files are parsed by <code>HTMLPurifier_StringHashParser</code> - into <code>HTMLPurifier_StringHash</code>es, which then - are run through <code>HTMLPurifier_ConfigSchema_InterchangeBuilder</code> - to construct the interchange object. - </p> - - <p> - From the interchange object, the data can be siphoned into other forms - using <code>HTMLPurifier_ConfigSchema_Builder</code> subclasses. - For example, <code>HTMLPurifier_ConfigSchema_Builder_ConfigSchema</code> - generates a runtime <code>HTMLPurifier_ConfigSchema</code> object, - which <code>HTMLPurifier_Config</code> uses to validate its incoming - data. There is also an XML serializer, which is used to build documentation. - </p> - - </body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/dev-flush.html b/lib/htmlpurifier/docs/dev-flush.html deleted file mode 100644 index 4a3a78351..000000000 --- a/lib/htmlpurifier/docs/dev-flush.html +++ /dev/null @@ -1,68 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> -<head> - <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> - <meta name="description" content="Discusses when to flush HTML Purifier's various caches." /> - <link rel="stylesheet" type="text/css" href="./style.css" /> - <title>Flushing the Purifier - HTML Purifier</title> -</head> -<body> - -<h1>Flushing the Purifier</h1> - -<div id="filing">Filed under Development</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p> - If you've been poking around the various folders in HTML Purifier, - you may have noticed the <code>maintenance</code> directory. Almost - all of these scripts are devoted to flushing out the various caches - HTML Purifier uses. Normal users don't have to worry about this: - regular library usage is transparent. However, when doing development - work on HTML Purifier, you may find you have to flush one of the - caches. -</p> - -<p> - As a general rule of thumb, run <code>flush.php</code> whenever you make - any <em>major</em> changes, or when tests start mysteriously failing. - In more detail, run this script if: -</p> - -<ul> - <li> - You added new source files to HTML Purifier's main library. - (see <code>generate-includes.php</code>) - </li> - <li> - You modified the configuration schema (see - <code>generate-schema-cache.php</code>). This usually means - adding or modifying files in <code>HTMLPurifier/ConfigSchema/schema/</code>, - although in rare cases modifying <code>HTMLPurifier/ConfigSchema.php</code> - will also require this. - </li> - <li> - You modified a Definition, or its subsystems. The most usual candidate - is <code>HTMLPurifier/HTMLDefinition.php</code>, which also encompasses - the files in <code>HTMLPurifier/HTMLModule/</code> as well as if you've - <a href="enduser-customize.html">customizing definitions</a> without - the cache disabled. (see <code>flush-generation-cache.php</code>) - </li> - <li> - You modified source files, and have been using the standalone - version from the full installation. (see <code>generate-standalone.php</code>) - </li> -</ul> - -<p> - You can check out the corresponding scripts for more information on what they - do. -</p> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/dev-includes.txt b/lib/htmlpurifier/docs/dev-includes.txt deleted file mode 100644 index d3382b593..000000000 --- a/lib/htmlpurifier/docs/dev-includes.txt +++ /dev/null @@ -1,281 +0,0 @@ - -INCLUDES, AUTOLOAD, BYTECODE CACHES and OPTIMIZATION - -The Problem ------------ - -HTML Purifier contains a number of extra components that are not used all -of the time, only if the user explicitly specifies that we should use -them. - -Some of these optional components are optionally included (Filter, -Language, Lexer, Printer), while others are included all the time -(Injector, URIFilter, HTMLModule, URIScheme). We will stipulate that these -are all developer specified: it is conceivable that certain Tokens are not -used, but this is user-dependent and should not be trusted. - -We should come up with a consistent way to handle these things and ensure -that we get the maximum performance when there is bytecode caches and -when there are not. Unfortunately, these two goals seem contrary to each -other. - -A peripheral issue is the performance of ConfigSchema, which has been -shown take a large, constant amount of initialization time, and is -intricately linked to the issue of includes due to its pervasive use -in our plugin architecture. - -Pros and Cons -------------- - -We will assume that user-based extensions will be included by them. - -Conditional includes: - Pros: - - User management is simplified; only a single directive needs to be set - - Only necessary code is included - Cons: - - Doesn't play nicely with opcode caches - - Adds complexity to standalone version - - Optional configuration directives are not exposed without a little - extra coaxing (not implemented yet) - -Include it all: - Pros: - - User management is still simple - - Plays nicely with opcode caches and standalone version - - All configuration directives are present - Cons: - - Lots of (how much?) extra code is included - - Classes that inherit from external libraries will cause compile - errors - -Build an include stub (Let's do this!): - Pros: - - Only necessary code is included - - Plays nicely with opcode caches and standalone version - - require (without once) can be used, see above - - Could further extend as a compilation to one file - Cons: - - Not implemented yet - - Requires user intervention and use of a command line script - - Standalone script must be chained to this - - More complex and compiled-language-like - - Requires a whole new class of system-wide configuration directives, - as configuration objects can be reused - - Determining what needs to be included can be complex (see above) - - No way of autodetecting dynamically instantiated classes - - Might be slow - -Include stubs -------------- - -This solution may be "just right" for users who are heavily oriented -towards performance. However, there are a number of picky implementation -details to work out beforehand. - -The number one concern is how to make the HTML Purifier files "work -out of the box", while still being able to easily get them into a form -that works with this setup. As the codebase stands right now, it would -be necessary to strip out all of the require_once calls. The only way -we could get rid of the require_once calls is to use __autoload or -use the stub for all cases (which might not be a bad idea). - - Aside - ----- - An important thing to remember, however, is that these require_once's - are valuable data about what classes a file needs. Unfortunately, there's - no distinction between whether or not the file is needed all the time, - or whether or not it is one of our "optional" files. Thus, it is - effectively useless. - - Deprecated - ---------- - One of the things I'd like to do is have the code search for any classes - that are explicitly mentioned in the code. If a class isn't mentioned, I - get to assume that it is "optional," i.e. included via introspection. - The choice is either to use PHP's tokenizer or use regexps; regexps would - be faster but a tokenizer would be more correct. If this ends up being - unfeasible, adding dependency comments isn't a bad idea. (This could - even be done automatically by search/replacing require_once, although - we'd have to manually inspect the results for the optional requires.) - - NOTE: This ends up not being necessary, as we're going to make the user - figure out all the extra classes they need, and only include the core - which is predetermined. - -Using the autoload framework with include stubs works nicely with -introspective classes: instead of having to have require_once inside -the function, we can let autoload do the work; we simply need to -new $class or accept the object straight from the caller. Handling filters -becomes a simple matter of ticking off configuration directives, and -if ConfigSchema spits out errors, adding the necessary includes. We could -also use the autoload framework as a fallback, in case the user forgets -to make the include, but doesn't really care about performance. - - Insight - ------- - All of this talk is merely a natural extension of what our current - standalone functionality does. However, instead of having our code - perform the includes, or attempting to inline everything that possibly - could be used, we boot the issue to the user, making them include - everything or setup the fallback autoload handler. - -Configuration Schema --------------------- - -A common deficiency for all of the conditional include setups (including -the dynamically built include PHP stub) is that if one of this -conditionally included files includes a configuration directive, it -is not accessible to configdoc. A stopgap solution for this problem is -to have it piggy-back off of the data in the merge-library.php script -to figure out what extra files it needs to include, but if the file also -inherits classes that don't exist, we're in big trouble. - -I think it's high time we centralized the configuration documentation. -However, the type checking has been a great boon for the library, and -I'd like to keep that. The compromise is to use some other source, and -then parse it into the ConfigSchema internal format (sans all of those -nasty documentation strings which we really don't need at runtime) and -serialize that for future use. - -The next question is that of format. XML is very verbose, and the prospect -of setting defaults in it gives me willies. However, this may be necessary. -Splitting up the file into manageable chunks may alleviate this trouble, -and we may be even want to create our own format optimized for specifying -configuration. It might look like (based off the PHPT format, which is -nicely compact yet unambiguous and human-readable): - -Core.HiddenElements -TYPE: lookup -DEFAULT: array('script', 'style') // auto-converted during processing ---ALIASES-- -Core.InvisibleElements, Core.StupidElements ---DESCRIPTION-- -<p> - Blah blah -</p> - -The first line is the directive name, the lines after that prior to the -first --HEADER-- block are single-line values, and then after that -the multiline values are there. No value is restricted to a particular -format: DEFAULT could very well be multiline if that would be easier. -This would make it insanely easy, also, to add arbitrary extra parameters, -like: - -VERSION: 3.0.0 -ALLOWED: 'none', 'light', 'medium', 'heavy' // this is wrapped in array() -EXTERNAL: CSSTidy // this would be documented somewhere else with a URL - -The final loss would be that you wouldn't know what file the directive -was used in; with some clever regexps it should be possible to -figure out where $config->get($ns, $d); occurs. Reflective calls to -the configuration object is mitigated by the fact that getBatch is -used, so we can simply talk about that in the namespace definition page. -This might be slow, but it would only happen when we are creating -the documentation for consumption, and is sugar. - -We can put this in a schema/ directory, outside of HTML Purifier. The serialized -data gets treated like entities.ser. - -The final thing that needs to be handled is user defined configurations. -They can be added at runtime using ConfigSchema::registerDirectory() -which globs the directory and grabs all of the directives to be incorporated -in. Then, the result is saved. We may want to take advantage of the -DefinitionCache framework, although it is not altogether certain what -configuration directives would be used to generate our key (meta-directives!) - - Further thoughts - ---------------- - Our master configuration schema will only need to be updated once - every new version, so it's easily versionable. User specified - schema files are far more volatile, but it's far too expensive - to check the filemtimes of all the files, so a DefinitionRev style - mechanism works better. However, we can uniquely identify the - schema based on the directories they loaded, so there's no need - for a DefinitionId until we give them full programmatic control. - - These variables should be directly incorporated into ConfigSchema, - and ConfigSchema should handle serialization. Some refactoring will be - necessary for the DefinitionCache classes, as they are built with - Config in mind. If the user changes something, the cache file gets - rebuilt. If the version changes, the cache file gets rebuilt. Since - our unit tests flush the caches before we start, and the operation is - pretty fast, this will not negatively impact unit testing. - -One last thing: certain configuration directives require that files -get added. They may even be specified dynamically. It is not a good idea -for the HTMLPurifier_Config object to be used directly for such matters. -Instead, the userland code should explicitly perform the includes. We may -put in something like: - -REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks - -To indicate that if that class doesn't exist, and the user is attempting -to use the directive, we should fatally error out. The stub includes the core files, -and the user includes everything else. Any reflective things like new -$class would be required to tie in with the configuration. - -It would work very well with rarely used configuration options, but it -wouldn't be so good for "core" parts that can be disabled. In such cases -the core include file would need to be modified, and the only way -to properly do this is use the configuration object. Once again, our -ability to create cache keys saves the day again: we can create arbitrary -stub files for arbitrary configurations and include those. They could -even be the single file affairs. The only thing we'd need to include, -then, would be HTMLPurifier_Config! Then, the configuration object would -load the library. - - An aside... - ----------- - One questions, however, the wisdom of letting PHP files write other PHP - files. It seems like a recipe for disaster, or at least lots of headaches - in highly secured setups, where PHP does not have the ability to write - to its root. In such cases, we could use sticky bits or tell the user - to manually generate the file. - - The other troublesome bit is actually doing the calculations necessary. - For certain cases, it's simple (such as URIScheme), but for AttrDef - and HTMLModule the dependency trees are very complex in relation to - %HTML.Allowed and friends. I think that this idea should be shelved - and looked at a later, less insane date. - -An interesting dilemma presents itself when a configuration form is offered -to the user. Normally, the configuration object is not accessible without -editing PHP code; this facility changes thing. The sensible thing to do -is stipulate that all classes required by the directives you allow must -be included. - -Unit testing ------------- - -Setting up the parsing and translation into our existing format would not -be difficult to do. It might represent a good time for us to rethink our -tests for these facilities; as creative as they are, they are often hacky -and require public visibility for things that ought to be protected. -This is especially applicable for our DefinitionCache tests. - -Migration ---------- - -Because we are not *adding* anything essentially new, it should be trivial -to write a script to take our existing data and dump it into the new format. -Well, not trivial, but fairly easy to accomplish. Primary implementation -difficulties would probably involve formatting the file nicely. - -Backwards-compatibility ------------------------ - -I expect that the ConfigSchema methods should stick around for a little bit, -but display E_USER_NOTICE warnings that they are deprecated. This will -require documentation! - -New stuff ---------- - -VERSION: Version number directive was introduced -DEPRECATED-VERSION: If the directive was deprecated, when was it deprecated? -DEPRECATED-USE: If the directive was deprecated, what should the user use now? -REQUIRES: What classes does this configuration directive require, but are - not part of the HTML Purifier core? - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/dev-naming.html b/lib/htmlpurifier/docs/dev-naming.html deleted file mode 100644 index cea4b006f..000000000 --- a/lib/htmlpurifier/docs/dev-naming.html +++ /dev/null @@ -1,83 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Defines class naming conventions in HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Naming Conventions - HTML Purifier</title> - -</head><body> - -<h1>Naming Conventions</h1> - -<div id="filing">Filed under Development</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>The classes in this library follow a few naming conventions, which may -help you find the correct functionality more quickly. Here they are:</p> - -<dl> - -<dt>All classes occupy the HTMLPurifier pseudo-namespace.</dt> - <dd>This means that all classes are prefixed with HTMLPurifier_. As such, all - names under HTMLPurifier_ are reserved. I recommend that you use the name - HTMLPurifierX_YourName_ClassName, especially if you want to take advantage - of HTMLPurifier_ConfigDef.</dd> - -<dt>All classes correspond to their path if library/ was in the include path</dt> - <dd>HTMLPurifier_AttrDef is located at HTMLPurifier/AttrDef.php; replace - underscores with slashes and append .php and you'll have the location of - the class.</dd> - -<dt>Harness and Test are reserved class names for unit tests</dt> - <dd>The suffix <code>Test</code> indicates that the class is a subclass of UnitTestCase - (of the Simpletest library) and is testable. "Harness" indicates a subclass - of UnitTestCase that is not meant to be run but to be extended into - concrete test cases and contains custom test methods (i.e. assert*())</dd> - -<dt>Class names do not necessarily represent inheritance hierarchies</dt> - <dd>While we try to reflect inheritance in naming to some extent, it is not - guaranteed (for instance, none of the classes inherit from HTMLPurifier, - the base class). However, all class files have the require_once - declarations to whichever classes they are tightly coupled to.</dd> - -<dt>Strategy has a meaning different from the Gang of Four pattern</dt> - <dd>In Design Patterns, the Gang of Four describes a Strategy object as - encapsulating an algorithm so that they can be switched at run-time. While - our strategies are indeed algorithms, they are not meant to be substituted: - all must be present in order for proper functioning.</dd> - -<dt>Abbreviations are avoided</dt> - <dd>We try to avoid abbreviations as much as possible, but in some cases, - abbreviated version is more readable than the full version. Here, we - list common abbreviations: - <ul> - <li>Attr to Attributes (note that it is plural, i.e. <code>$attr = array()</code>)</li> - <li>Def to Definition</li> - <li><code>$ret</code> is the value to be returned in a function</li> - </ul> - </dd> - -<dt>Ambiguity concerning the definition of Def/Definition</dt> - <dd>While a definition normally defines the structure/acceptable values of - an entity, most of the definitions in this application also attempt - to validate and fix the value. I am unsure of a better name, as - "Validator" would exclude fixing the value, "Fixer" doesn't invoke - the proper image of "fixing" something, and "ValidatorFixer" is too long! - Some other suggestions were "Handler", "Reference", "Check", "Fix", - "Repair" and "Heal".</dd> - -<dt>Transform not Transformer</dt> - <dd>Transform is both a noun and a verb, and thus we define a "Transform" as - something that "transforms," leaving "Transformer" (which sounds like an - electrical device/robot toy).</dd> - -</dl> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/dev-optimization.html b/lib/htmlpurifier/docs/dev-optimization.html deleted file mode 100644 index 78f565813..000000000 --- a/lib/htmlpurifier/docs/dev-optimization.html +++ /dev/null @@ -1,33 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Discusses possible methods of optimizing HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Optimization - HTML Purifier</title> - -</head><body> - -<h1>Optimization</h1> - -<div id="filing">Filed under Development</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>Here are some possible optimization techniques we can apply to code sections if -they turn out to be slow. Be sure not to prematurely optimize: if you get -that itch, put it here!</p> - -<ul> - <li>Make Tokens Flyweights (may prove problematic, probably not worth it)</li> - <li>Rewrite regexps into PHP code</li> - <li>Batch regexp validation (do as many per function call as possible)</li> - <li>Parallelize strategies</li> -</ul> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/dev-progress.html b/lib/htmlpurifier/docs/dev-progress.html deleted file mode 100644 index 105896ed6..000000000 --- a/lib/htmlpurifier/docs/dev-progress.html +++ /dev/null @@ -1,309 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Tables detailing HTML element and CSS property implementation coverage in HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Implementation Progress - HTML Purifier</title> - -<style type="text/css"> - -td {padding-right:1em;border-bottom:1px solid #000;padding-left:0.5em;} -th {text-align:left;padding-top:1.4em;font-size:13pt; - border-bottom:2px solid #000;background:#FFF;} -thead th {text-align:left;padding:0.1em;background-color:#EEE;} - -.impl-yes {background:#9D9;} -.impl-partial {background:#FFA;} -.impl-no {background:#CCC;} - -.danger {color:#600;} -.css1 {color:#060;} -.required {font-weight:bold;} -.feature {color:#999;} - -</style> - -</head><body> - -<h1>Implementation Progress</h1> - -<div id="filing">Filed under Development</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p> - <strong>Warning:</strong> This table is kept for historical purposes and - is not being actively updated. -</p> - -<h2>Key</h2> - -<table cellspacing="0"><tbody> -<tr><td class="impl-yes">Implemented</td></tr> -<tr><td class="impl-partial">Partially implemented</td></tr> -<tr><td class="impl-no">Not priority to implement</td></tr> -<tr><td class="danger">Dangerous attribute/property</td></tr> -<tr><td class="css1">Present in CSS1</td></tr> -<tr><td class="feature">Feature, requires extra work</td></tr> -</tbody></table> - -<h2>CSS</h2> - -<table cellspacing="0"> - -<thead> -<tr><th>Name</th><th>Notes</th></tr> -</thead> - -<!-- -<tr><td>-</td><td>-</td></tr> ---> - -<tbody> -<tr><th colspan="2">Standard</th></tr> -<tr class="css1 impl-yes"><td>background-color</td><td>COMPOSITE(<color>, transparent)</td></tr> -<tr class="css1 impl-yes"><td>background</td><td>SHORTHAND, currently alias for background-color</td></tr> -<tr class="css1 impl-yes"><td>border</td><td>SHORTHAND, MULTIPLE</td></tr> -<tr class="css1 impl-yes"><td>border-color</td><td>MULTIPLE</td></tr> -<tr class="css1 impl-yes"><td>border-style</td><td>MULTIPLE</td></tr> -<tr class="css1 impl-yes"><td>border-width</td><td>MULTIPLE</td></tr> -<tr class="css1 impl-yes"><td>border-*</td><td>SHORTHAND</td></tr> -<tr class="impl-yes"><td>border-*-color</td><td>COMPOSITE(<color>, transparent)</td></tr> -<tr class="impl-yes"><td>border-*-style</td><td>ENUM(none, hidden, dotted, dashed, - solid, double, groove, ridge, inset, outset)</td></tr> -<tr class="css1 impl-yes"><td>border-*-width</td><td>COMPOSITE(<length>, thin, medium, thick)</td></tr> -<tr class="css1 impl-yes"><td>clear</td><td>ENUM(none, left, right, both)</td></tr> -<tr class="css1 impl-yes"><td>color</td><td><color></td></tr> -<tr class="css1 impl-yes"><td>float</td><td>ENUM(left, right, none), May require layout - precautions with clear</td></tr> -<tr class="css1 impl-yes"><td>font</td><td>SHORTHAND</td></tr> -<tr class="css1 impl-yes"><td>font-family</td><td>CSS validator may complain if fallback font - family not specified</td></tr> -<tr class="css1 impl-yes"><td>font-size</td><td>COMPOSITE(<absolute-size>, - <relative-size>, <length>, <percentage>)</td></tr> -<tr class="css1 impl-yes"><td>font-style</td><td>ENUM(normal, italic, oblique)</td></tr> -<tr class="css1 impl-yes"><td>font-variant</td><td>ENUM(normal, small-caps)</td></tr> -<tr class="css1 impl-yes"><td>font-weight</td><td>ENUM(normal, bold, bolder, lighter, - 100, 200, 300, 400, 500, 600, 700, 800, 900), maybe special code for - in-between integers</td></tr> -<tr class="css1 impl-yes"><td>letter-spacing</td><td>COMPOSITE(<length>, normal)</td></tr> -<tr class="css1 impl-yes"><td>line-height</td><td>COMPOSITE(<number>, - <length>, <percentage>, normal)</td></tr> -<tr class="css1 impl-yes"><td>list-style-position</td><td>ENUM(inside, outside), - Strange behavior in browsers</td></tr> -<tr class="css1 impl-yes"><td>list-style-type</td><td>ENUM(...), - Well-supported values are: disc, circle, square, - decimal, lower-roman, upper-roman, lower-alpha and upper-alpha. See also - CSS 3. Mostly IE lack of support.</td></tr> -<tr class="css1 impl-yes"><td>list-style</td><td>SHORTHAND</td></tr> -<tr class="css1 impl-yes"><td>margin</td><td>MULTIPLE</td></tr> -<tr class="css1 impl-yes"><td>margin-*</td><td>COMPOSITE(<length>, - <percentage>, auto)</td></tr> -<tr class="css1 impl-yes"><td>padding</td><td>MULTIPLE</td></tr> -<tr class="css1 impl-yes"><td>padding-*</td><td>COMPOSITE(<length>(positive), - <percentage>(positive))</td></tr> -<tr class="css1 impl-yes"><td>text-align</td><td>ENUM(left, right, - center, justify)</td></tr> -<tr class="css1 impl-yes"><td>text-decoration</td><td>No blink (argh my eyes), not - enum, can be combined (composite sorta): underline, overline, - line-through</td></tr> -<tr class="css1 impl-yes"><td>text-indent</td><td>COMPOSITE(<length>, - <percentage>)</td></tr> -<tr class="css1 impl-yes"><td>text-transform</td><td>ENUM(capitalize, uppercase, - lowercase, none)</td></tr> -<tr class="css1 impl-yes"><td>width</td><td>COMPOSITE(<length>, - <percentage>, auto), Interesting</td></tr> -<tr class="css1 impl-yes"><td>word-spacing</td><td>COMPOSITE(<length>, auto), - IE 5 no support</td></tr> -</tbody> - -<tbody> -<tr><th colspan="2">Table</th></tr> -<tr class="impl-yes"><td>border-collapse</td><td>ENUM(collapse, seperate)</td></tr> -<tr class="impl-yes"><td>border-space</td><td>MULTIPLE</td></tr> -<tr class="impl-yes"><td>caption-side</td><td>ENUM(top, bottom)</td></tr> -<tr class="feature"><td>empty-cells</td><td>ENUM(show, hide), No IE support makes this useless, - possible fix with &nbsp;? Unknown release milestone.</td></tr> -<tr class="impl-yes"><td>table-layout</td><td>ENUM(auto, fixed)</td></tr> -<tr class="impl-yes css1"><td>vertical-align</td><td>COMPOSITE(ENUM(baseline, sub, - super, top, text-top, middle, bottom, text-bottom), <percentage>, - <length>) Also applies to others with explicit height</td></tr> -</tbody> - -<tbody> -<tr><th colspan="2">Absolute positioning, unknown release milestone</th></tr> -<tr class="danger impl-no"><td>bottom</td><td rowspan="4">Dangerous, must be non-negative to even be considered, - but it's still possible to arbitrarily position by running over.</td></tr> -<tr class="danger impl-no"><td>left</td></tr> -<tr class="danger impl-no"><td>right</td></tr> -<tr class="danger impl-no"><td>top</td></tr> -<tr class="impl-no"><td>clip</td><td>-</td></tr> -<tr class="danger impl-no"><td>position</td><td>ENUM(static, relative, absolute, fixed) - relative not absolute?</td></tr> -<tr class="danger impl-no"><td>z-index</td><td>Dangerous</td></tr> -</tbody> - -<tbody> -<tr><th colspan="2">Unknown</th></tr> -<tr class="danger css1 impl-yes"><td>background-image</td><td>Dangerous</td></tr> -<tr class="css1 impl-yes"><td>background-attachment</td><td>ENUM(scroll, fixed), - Depends on background-image</td></tr> -<tr class="css1 impl-yes"><td>background-position</td><td>Depends on background-image</td></tr> -<tr class="danger impl-no"><td>cursor</td><td>Dangerous but fluffy</td></tr> -<tr class="danger impl-yes"><td>display</td><td>ENUM(...), Dangerous but interesting; - will not implement list-item, run-in (Opera only) or table (no IE); - inline-block has incomplete IE6 support and requires -moz-inline-box - for Mozilla. Unknown target milestone.</td></tr> -<tr class="css1 impl-yes"><td>height</td><td>Interesting, why use it? Unknown target milestone.</td></tr> -<tr class="danger css1 impl-yes"><td>list-style-image</td><td>Dangerous?</td></tr> -<tr class="impl-no"><td>max-height</td><td rowspan="4">No IE 5/6</td></tr> -<tr class="impl-no"><td>min-height</td></tr> -<tr class="impl-no"><td>max-width</td></tr> -<tr class="impl-no"><td>min-width</td></tr> -<tr class="impl-no"><td>orphans</td><td>No IE support</td></tr> -<tr class="impl-no"><td>widows</td><td>No IE support</td></tr> -<tr><td>overflow</td><td>ENUM, IE 5/6 almost (remove visible if set). Unknown target milestone.</td></tr> -<tr><td>page-break-after</td><td>ENUM(auto, always, avoid, left, right), - IE 5.5/6 and Opera. Unknown target milestone.</td></tr> -<tr><td>page-break-before</td><td>ENUM(auto, always, avoid, left, right), - Mostly supported. Unknown target milestone.</td></tr> -<tr><td>page-break-inside</td><td>ENUM(avoid, auto), Opera only. Unknown target milestone.</td></tr> -<tr class="impl-no"><td>quotes</td><td>May be dropped from CSS2, fairly useless for inline context</td></tr> -<tr class="danger impl-yes"><td>visibility</td><td>ENUM(visible, hidden, collapse), - Dangerous</td></tr> -<tr class="css1 feature impl-partial"><td>white-space</td><td>ENUM(normal, pre, nowrap, pre-wrap, - pre-line), Spotty implementation: - pre (no IE 5/6), <em>nowrap</em> (no IE 5, supported), - pre-wrap (only Opera), pre-line (no support). Fixable? Unknown target milestone.</td></tr> -</tbody> - -<tbody class="impl-no"> -<tr><th colspan="2">Aural</th></tr> -<tr><td>azimuth</td><td>-</td></tr> -<tr><td>cue</td><td>-</td></tr> -<tr><td>cue-after</td><td>-</td></tr> -<tr><td>cue-before</td><td>-</td></tr> -<tr><td>elevation</td><td>-</td></tr> -<tr><td>pause-after</td><td>-</td></tr> -<tr><td>pause-before</td><td>-</td></tr> -<tr><td>pause</td><td>-</td></tr> -<tr><td>pitch-range</td><td>-</td></tr> -<tr><td>pitch</td><td>-</td></tr> -<tr><td>play-during</td><td>-</td></tr> -<tr><td>richness</td><td>-</td></tr> -<tr><td>speak-header</td><td>Table related</td></tr> -<tr><td>speak-numeral</td><td>-</td></tr> -<tr><td>speak-punctuation</td><td>-</td></tr> -<tr><td>speak</td><td>-</td></tr> -<tr><td>speech-rate</td><td>-</td></tr> -<tr><td>stress</td><td>-</td></tr> -<tr><td>voice-family</td><td>-</td></tr> -<tr><td>volume</td><td>-</td></tr> -</tbody> - -<tbody class="impl-no"> -<tr><th colspan="2">Will not implement</th></tr> -<tr><td>content</td><td>Not applicable for inline styles</td></tr> -<tr><td>counter-increment</td><td>Needs content, Opera only</td></tr> -<tr><td>counter-reset</td><td>Needs content, Opera only</td></tr> -<tr><td>direction</td><td>No support</td></tr> -<tr><td>outline-color</td><td rowspan="4">IE Mac and Opera on outside, -Mozilla on inside and needs -moz-outline, no IE support.</td></tr> - <tr><td>outline-style</td></tr> - <tr><td>outline-width</td></tr> - <tr><td>outline</td></tr> -<tr><td>unicode-bidi</td><td>No support</td></tr> -</tbody> - -</table> - -<h2>Interesting Attributes</h2> - -<table cellspacing="0"> - -<thead> -<tr><th>Attribute</th><th>Tags</th><th>Notes</th></tr> -</thead> - -<!-- -<tr><th></th></tr> -<tbody> -<tr><td>-</td><td>-</td><td>-</td></tr> -</tbody> ---> - -<tbody> -<tr><th colspan="3">CSS</th></tr> -<tr class="impl-yes"><td>style</td><td>All</td><td>Parser is reasonably functional. Status here doesn't count individual properties.</td></tr> -</tbody> - -<tbody> -<tr><th colspan="3">Questionable</th></tr> -<tr class="impl-no"><td>accesskey</td><td>A</td><td>May interfere with main interface</td></tr> -<tr class="impl-no"><td>tabindex</td><td>A</td><td>May interfere with main interface</td></tr> -<tr class="impl-yes"><td>target</td><td>A</td><td>Config enabled, only useful for frame layouts, disallowed in strict</td></tr> -</tbody> - -<tbody> -<tr><th colspan="3">Miscellaneous</th></tr> -<tr><td>datetime</td><td>DEL, INS</td><td>No visible effect, ISO format</td></tr> -<tr class="impl-yes"><td>rel</td><td>A</td><td>Largely user-defined: nofollow, tag (see microformats)</td></tr> -<tr class="impl-yes"><td>rev</td><td>A</td><td>Largely user-defined: vote-*</td></tr> -<tr class="feature"><td>axis</td><td>TD, TH</td><td>W3C only: No browser implementation</td></tr> -<tr class="feature"><td>char</td><td>COL, COLGROUP, TBODY, TD, TFOOT, TH, THEAD, TR</td><td>W3C only: No browser implementation</td></tr> -<tr class="feature"><td>headers</td><td>TD, TH</td><td>W3C only: No browser implementation</td></tr> -<tr class="impl-yes"><td>scope</td><td>TD, TH</td><td>W3C only: No browser implementation</td></tr> -</tbody> - -<tbody class="impl-yes"> -<tr><th colspan="3">URI</th></tr> -<tr><td rowspan="2">cite</td><td>BLOCKQUOTE, Q</td><td>For attribution</td></tr> - <tr><td>DEL, INS</td><td>Link to explanation why it changed</td></tr> -<tr><td>href</td><td>A</td><td>-</td></tr> -<tr><td>longdesc</td><td>IMG</td><td>-</td></tr> -<tr class="required"><td>src</td><td>IMG</td><td>Required</td></tr> -</tbody> - -<tbody> -<tr><th colspan="3">Transform</th></tr> -<tr class="impl-yes"><td rowspan="5">align</td><td>CAPTION</td><td>'caption-side' for top/bottom, 'text-align' for left/right</td></tr> - <tr class="impl-yes"><td>IMG</td><td rowspan="3">See specimens/html-align-to-css.html</td></tr> - <tr class="impl-yes"><td>TABLE</td></tr> - <tr class="impl-yes"><td>HR</td></tr> - <tr class="impl-yes"><td>H1, H2, H3, H4, H5, H6, P</td><td>Equivalent style 'text-align'</td></tr> -<tr class="required impl-yes"><td>alt</td><td>IMG</td><td>Required, insert image filename if src is present or default invalid image text</td></tr> -<tr class="impl-yes"><td rowspan="3">bgcolor</td><td>TABLE</td><td>Superset style 'background-color'</td></tr> - <tr class="impl-yes"><td>TR</td><td>Superset style 'background-color'</td></tr> - <tr class="impl-yes"><td>TD, TH</td><td>Superset style 'background-color'</td></tr> -<tr class="impl-yes"><td>border</td><td>IMG</td><td>Equivalent style <code>border:[number]px solid</code></td></tr> -<tr class="impl-yes"><td>clear</td><td>BR</td><td>Near-equiv style 'clear', transform 'all' into 'both'</td></tr> -<tr class="impl-no"><td>compact</td><td>DL, OL, UL</td><td>Boolean, needs custom CSS class; rarely used anyway</td></tr> -<tr class="required impl-yes"><td>dir</td><td>BDO</td><td>Required, insert ltr (or configuration value) if none</td></tr> -<tr class="impl-yes"><td>height</td><td>TD, TH</td><td>Near-equiv style 'height', needs px suffix if original was in pixels</td></tr> -<tr class="impl-yes"><td>hspace</td><td>IMG</td><td>Near-equiv styles 'margin-top' and 'margin-bottom', needs px suffix</td></tr> -<tr class="impl-yes"><td>lang</td><td>*</td><td>Copy value to xml:lang</td></tr> -<tr class="impl-yes"><td rowspan="2">name</td><td>IMG</td><td>Turn into ID</td></tr> - <tr class="impl-yes"><td>A</td><td>Turn into ID</td></tr> -<tr class="impl-yes"><td>noshade</td><td>HR</td><td>Boolean, style 'border-style:solid;'</td></tr> -<tr class="impl-yes"><td>nowrap</td><td>TD, TH</td><td>Boolean, style 'white-space:nowrap;' (not compat with IE5)</td></tr> -<tr class="impl-yes"><td>size</td><td>HR</td><td>Near-equiv 'height', needs px suffix if original was pixels</td></tr> -<tr class="required impl-yes"><td>src</td><td>IMG</td><td>Required, insert blank or default img if not set</td></tr> -<tr class="impl-yes"><td>start</td><td>OL</td><td>Poorly supported 'counter-reset', allowed in loose, dropped in strict</td></tr> -<tr class="impl-yes"><td rowspan="3">type</td><td>LI</td><td rowspan="3">Equivalent style 'list-style-type', different allowed values though. (needs testing)</td></tr> - <tr class="impl-yes"><td>OL</td></tr> - <tr class="impl-yes"><td>UL</td></tr> -<tr class="impl-yes"><td>value</td><td>LI</td><td>Poorly supported 'counter-reset', allowed in loose, dropped in strict</td></tr> -<tr class="impl-yes"><td>vspace</td><td>IMG</td><td>Near-equiv styles 'margin-left' and 'margin-right', needs px suffix, see hspace</td></tr> -<tr class="impl-yes"><td rowspan="2">width</td><td>HR</td><td rowspan="2">Near-equiv style 'width', needs px suffix if original was pixels</td></tr> - <tr class="impl-yes"><td>TD, TH</td></tr> -</tbody> - -</table> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/dtd/xhtml1-transitional.dtd b/lib/htmlpurifier/docs/dtd/xhtml1-transitional.dtd deleted file mode 100644 index 628f27ac5..000000000 --- a/lib/htmlpurifier/docs/dtd/xhtml1-transitional.dtd +++ /dev/null @@ -1,1201 +0,0 @@ -<!-- - Extensible HTML version 1.0 Transitional DTD - - This is the same as HTML 4 Transitional except for - changes due to the differences between XML and SGML. - - Namespace = http://www.w3.org/1999/xhtml - - For further information, see: http://www.w3.org/TR/xhtml1 - - Copyright (c) 1998-2002 W3C (MIT, INRIA, Keio), - All Rights Reserved. - - This DTD module is identified by the PUBLIC and SYSTEM identifiers: - - PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" - SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" - - $Revision: 1.2 $ - $Date: 2002/08/01 18:37:55 $ - ---> - -<!--================ Character mnemonic entities =========================--> - -<!ENTITY % HTMLlat1 PUBLIC - "-//W3C//ENTITIES Latin 1 for XHTML//EN" - "xhtml-lat1.ent"> -%HTMLlat1; - -<!ENTITY % HTMLsymbol PUBLIC - "-//W3C//ENTITIES Symbols for XHTML//EN" - "xhtml-symbol.ent"> -%HTMLsymbol; - -<!ENTITY % HTMLspecial PUBLIC - "-//W3C//ENTITIES Special for XHTML//EN" - "xhtml-special.ent"> -%HTMLspecial; - -<!--================== Imported Names ====================================--> - -<!ENTITY % ContentType "CDATA"> - <!-- media type, as per [RFC2045] --> - -<!ENTITY % ContentTypes "CDATA"> - <!-- comma-separated list of media types, as per [RFC2045] --> - -<!ENTITY % Charset "CDATA"> - <!-- a character encoding, as per [RFC2045] --> - -<!ENTITY % Charsets "CDATA"> - <!-- a space separated list of character encodings, as per [RFC2045] --> - -<!ENTITY % LanguageCode "NMTOKEN"> - <!-- a language code, as per [RFC3066] --> - -<!ENTITY % Character "CDATA"> - <!-- a single character, as per section 2.2 of [XML] --> - -<!ENTITY % Number "CDATA"> - <!-- one or more digits --> - -<!ENTITY % LinkTypes "CDATA"> - <!-- space-separated list of link types --> - -<!ENTITY % MediaDesc "CDATA"> - <!-- single or comma-separated list of media descriptors --> - -<!ENTITY % URI "CDATA"> - <!-- a Uniform Resource Identifier, see [RFC2396] --> - -<!ENTITY % UriList "CDATA"> - <!-- a space separated list of Uniform Resource Identifiers --> - -<!ENTITY % Datetime "CDATA"> - <!-- date and time information. ISO date format --> - -<!ENTITY % Script "CDATA"> - <!-- script expression --> - -<!ENTITY % StyleSheet "CDATA"> - <!-- style sheet data --> - -<!ENTITY % Text "CDATA"> - <!-- used for titles etc. --> - -<!ENTITY % FrameTarget "NMTOKEN"> - <!-- render in this frame --> - -<!ENTITY % Length "CDATA"> - <!-- nn for pixels or nn% for percentage length --> - -<!ENTITY % MultiLength "CDATA"> - <!-- pixel, percentage, or relative --> - -<!ENTITY % Pixels "CDATA"> - <!-- integer representing length in pixels --> - -<!-- these are used for image maps --> - -<!ENTITY % Shape "(rect|circle|poly|default)"> - -<!ENTITY % Coords "CDATA"> - <!-- comma separated list of lengths --> - -<!-- used for object, applet, img, input and iframe --> -<!ENTITY % ImgAlign "(top|middle|bottom|left|right)"> - -<!-- a color using sRGB: #RRGGBB as Hex values --> -<!ENTITY % Color "CDATA"> - -<!-- There are also 16 widely known color names with their sRGB values: - - Black = #000000 Green = #008000 - Silver = #C0C0C0 Lime = #00FF00 - Gray = #808080 Olive = #808000 - White = #FFFFFF Yellow = #FFFF00 - Maroon = #800000 Navy = #000080 - Red = #FF0000 Blue = #0000FF - Purple = #800080 Teal = #008080 - Fuchsia= #FF00FF Aqua = #00FFFF ---> - -<!--=================== Generic Attributes ===============================--> - -<!-- core attributes common to most elements - id document-wide unique id - class space separated list of classes - style associated style info - title advisory title/amplification ---> -<!ENTITY % coreattrs - "id ID #IMPLIED - class CDATA #IMPLIED - style %StyleSheet; #IMPLIED - title %Text; #IMPLIED" - > - -<!-- internationalization attributes - lang language code (backwards compatible) - xml:lang language code (as per XML 1.0 spec) - dir direction for weak/neutral text ---> -<!ENTITY % i18n - "lang %LanguageCode; #IMPLIED - xml:lang %LanguageCode; #IMPLIED - dir (ltr|rtl) #IMPLIED" - > - -<!-- attributes for common UI events - onclick a pointer button was clicked - ondblclick a pointer button was double clicked - onmousedown a pointer button was pressed down - onmouseup a pointer button was released - onmousemove a pointer was moved onto the element - onmouseout a pointer was moved away from the element - onkeypress a key was pressed and released - onkeydown a key was pressed down - onkeyup a key was released ---> -<!ENTITY % events - "onclick %Script; #IMPLIED - ondblclick %Script; #IMPLIED - onmousedown %Script; #IMPLIED - onmouseup %Script; #IMPLIED - onmouseover %Script; #IMPLIED - onmousemove %Script; #IMPLIED - onmouseout %Script; #IMPLIED - onkeypress %Script; #IMPLIED - onkeydown %Script; #IMPLIED - onkeyup %Script; #IMPLIED" - > - -<!-- attributes for elements that can get the focus - accesskey accessibility key character - tabindex position in tabbing order - onfocus the element got the focus - onblur the element lost the focus ---> -<!ENTITY % focus - "accesskey %Character; #IMPLIED - tabindex %Number; #IMPLIED - onfocus %Script; #IMPLIED - onblur %Script; #IMPLIED" - > - -<!ENTITY % attrs "%coreattrs; %i18n; %events;"> - -<!-- text alignment for p, div, h1-h6. The default is - align="left" for ltr headings, "right" for rtl --> - -<!ENTITY % TextAlign "align (left|center|right|justify) #IMPLIED"> - -<!--=================== Text Elements ====================================--> - -<!ENTITY % special.extra - "object | applet | img | map | iframe"> - -<!ENTITY % special.basic - "br | span | bdo"> - -<!ENTITY % special - "%special.basic; | %special.extra;"> - -<!ENTITY % fontstyle.extra "big | small | font | basefont"> - -<!ENTITY % fontstyle.basic "tt | i | b | u - | s | strike "> - -<!ENTITY % fontstyle "%fontstyle.basic; | %fontstyle.extra;"> - -<!ENTITY % phrase.extra "sub | sup"> -<!ENTITY % phrase.basic "em | strong | dfn | code | q | - samp | kbd | var | cite | abbr | acronym"> - -<!ENTITY % phrase "%phrase.basic; | %phrase.extra;"> - -<!ENTITY % inline.forms "input | select | textarea | label | button"> - -<!-- these can occur at block or inline level --> -<!ENTITY % misc.inline "ins | del | script"> - -<!-- these can only occur at block level --> -<!ENTITY % misc "noscript | %misc.inline;"> - -<!ENTITY % inline "a | %special; | %fontstyle; | %phrase; | %inline.forms;"> - -<!-- %Inline; covers inline or "text-level" elements --> -<!ENTITY % Inline "(#PCDATA | %inline; | %misc.inline;)*"> - -<!--================== Block level elements ==============================--> - -<!ENTITY % heading "h1|h2|h3|h4|h5|h6"> -<!ENTITY % lists "ul | ol | dl | menu | dir"> -<!ENTITY % blocktext "pre | hr | blockquote | address | center | noframes"> - -<!ENTITY % block - "p | %heading; | div | %lists; | %blocktext; | isindex |fieldset | table"> - -<!-- %Flow; mixes block and inline and is used for list items etc. --> -<!ENTITY % Flow "(#PCDATA | %block; | form | %inline; | %misc;)*"> - -<!--================== Content models for exclusions =====================--> - -<!-- a elements use %Inline; excluding a --> - -<!ENTITY % a.content - "(#PCDATA | %special; | %fontstyle; | %phrase; | %inline.forms; | %misc.inline;)*"> - -<!-- pre uses %Inline excluding img, object, applet, big, small, - font, or basefont --> - -<!ENTITY % pre.content - "(#PCDATA | a | %special.basic; | %fontstyle.basic; | %phrase.basic; | - %inline.forms; | %misc.inline;)*"> - -<!-- form uses %Flow; excluding form --> - -<!ENTITY % form.content "(#PCDATA | %block; | %inline; | %misc;)*"> - -<!-- button uses %Flow; but excludes a, form, form controls, iframe --> - -<!ENTITY % button.content - "(#PCDATA | p | %heading; | div | %lists; | %blocktext; | - table | br | span | bdo | object | applet | img | map | - %fontstyle; | %phrase; | %misc;)*"> - -<!--================ Document Structure ==================================--> - -<!-- the namespace URI designates the document profile --> - -<!ELEMENT html (head, body)> -<!ATTLIST html - %i18n; - id ID #IMPLIED - xmlns %URI; #FIXED 'http://www.w3.org/1999/xhtml' - > - -<!--================ Document Head =======================================--> - -<!ENTITY % head.misc "(script|style|meta|link|object|isindex)*"> - -<!-- content model is %head.misc; combined with a single - title and an optional base element in any order --> - -<!ELEMENT head (%head.misc;, - ((title, %head.misc;, (base, %head.misc;)?) | - (base, %head.misc;, (title, %head.misc;))))> - -<!ATTLIST head - %i18n; - id ID #IMPLIED - profile %URI; #IMPLIED - > - -<!-- The title element is not considered part of the flow of text. - It should be displayed, for example as the page header or - window title. Exactly one title is required per document. - --> -<!ELEMENT title (#PCDATA)> -<!ATTLIST title - %i18n; - id ID #IMPLIED - > - -<!-- document base URI --> - -<!ELEMENT base EMPTY> -<!ATTLIST base - id ID #IMPLIED - href %URI; #IMPLIED - target %FrameTarget; #IMPLIED - > - -<!-- generic metainformation --> -<!ELEMENT meta EMPTY> -<!ATTLIST meta - %i18n; - id ID #IMPLIED - http-equiv CDATA #IMPLIED - name CDATA #IMPLIED - content CDATA #REQUIRED - scheme CDATA #IMPLIED - > - -<!-- - Relationship values can be used in principle: - - a) for document specific toolbars/menus when used - with the link element in document head e.g. - start, contents, previous, next, index, end, help - b) to link to a separate style sheet (rel="stylesheet") - c) to make a link to a script (rel="script") - d) by stylesheets to control how collections of - html nodes are rendered into printed documents - e) to make a link to a printable version of this document - e.g. a PostScript or PDF version (rel="alternate" media="print") ---> - -<!ELEMENT link EMPTY> -<!ATTLIST link - %attrs; - charset %Charset; #IMPLIED - href %URI; #IMPLIED - hreflang %LanguageCode; #IMPLIED - type %ContentType; #IMPLIED - rel %LinkTypes; #IMPLIED - rev %LinkTypes; #IMPLIED - media %MediaDesc; #IMPLIED - target %FrameTarget; #IMPLIED - > - -<!-- style info, which may include CDATA sections --> -<!ELEMENT style (#PCDATA)> -<!ATTLIST style - %i18n; - id ID #IMPLIED - type %ContentType; #REQUIRED - media %MediaDesc; #IMPLIED - title %Text; #IMPLIED - xml:space (preserve) #FIXED 'preserve' - > - -<!-- script statements, which may include CDATA sections --> -<!ELEMENT script (#PCDATA)> -<!ATTLIST script - id ID #IMPLIED - charset %Charset; #IMPLIED - type %ContentType; #REQUIRED - language CDATA #IMPLIED - src %URI; #IMPLIED - defer (defer) #IMPLIED - xml:space (preserve) #FIXED 'preserve' - > - -<!-- alternate content container for non script-based rendering --> - -<!ELEMENT noscript %Flow;> -<!ATTLIST noscript - %attrs; - > - -<!--======================= Frames =======================================--> - -<!-- inline subwindow --> - -<!ELEMENT iframe %Flow;> -<!ATTLIST iframe - %coreattrs; - longdesc %URI; #IMPLIED - name NMTOKEN #IMPLIED - src %URI; #IMPLIED - frameborder (1|0) "1" - marginwidth %Pixels; #IMPLIED - marginheight %Pixels; #IMPLIED - scrolling (yes|no|auto) "auto" - align %ImgAlign; #IMPLIED - height %Length; #IMPLIED - width %Length; #IMPLIED - > - -<!-- alternate content container for non frame-based rendering --> - -<!ELEMENT noframes %Flow;> -<!ATTLIST noframes - %attrs; - > - -<!--=================== Document Body ====================================--> - -<!ELEMENT body %Flow;> -<!ATTLIST body - %attrs; - onload %Script; #IMPLIED - onunload %Script; #IMPLIED - background %URI; #IMPLIED - bgcolor %Color; #IMPLIED - text %Color; #IMPLIED - link %Color; #IMPLIED - vlink %Color; #IMPLIED - alink %Color; #IMPLIED - > - -<!ELEMENT div %Flow;> <!-- generic language/style container --> -<!ATTLIST div - %attrs; - %TextAlign; - > - -<!--=================== Paragraphs =======================================--> - -<!ELEMENT p %Inline;> -<!ATTLIST p - %attrs; - %TextAlign; - > - -<!--=================== Headings =========================================--> - -<!-- - There are six levels of headings from h1 (the most important) - to h6 (the least important). ---> - -<!ELEMENT h1 %Inline;> -<!ATTLIST h1 - %attrs; - %TextAlign; - > - -<!ELEMENT h2 %Inline;> -<!ATTLIST h2 - %attrs; - %TextAlign; - > - -<!ELEMENT h3 %Inline;> -<!ATTLIST h3 - %attrs; - %TextAlign; - > - -<!ELEMENT h4 %Inline;> -<!ATTLIST h4 - %attrs; - %TextAlign; - > - -<!ELEMENT h5 %Inline;> -<!ATTLIST h5 - %attrs; - %TextAlign; - > - -<!ELEMENT h6 %Inline;> -<!ATTLIST h6 - %attrs; - %TextAlign; - > - -<!--=================== Lists ============================================--> - -<!-- Unordered list bullet styles --> - -<!ENTITY % ULStyle "(disc|square|circle)"> - -<!-- Unordered list --> - -<!ELEMENT ul (li)+> -<!ATTLIST ul - %attrs; - type %ULStyle; #IMPLIED - compact (compact) #IMPLIED - > - -<!-- Ordered list numbering style - - 1 arabic numbers 1, 2, 3, ... - a lower alpha a, b, c, ... - A upper alpha A, B, C, ... - i lower roman i, ii, iii, ... - I upper roman I, II, III, ... - - The style is applied to the sequence number which by default - is reset to 1 for the first list item in an ordered list. ---> -<!ENTITY % OLStyle "CDATA"> - -<!-- Ordered (numbered) list --> - -<!ELEMENT ol (li)+> -<!ATTLIST ol - %attrs; - type %OLStyle; #IMPLIED - compact (compact) #IMPLIED - start %Number; #IMPLIED - > - -<!-- single column list (DEPRECATED) --> -<!ELEMENT menu (li)+> -<!ATTLIST menu - %attrs; - compact (compact) #IMPLIED - > - -<!-- multiple column list (DEPRECATED) --> -<!ELEMENT dir (li)+> -<!ATTLIST dir - %attrs; - compact (compact) #IMPLIED - > - -<!-- LIStyle is constrained to: "(%ULStyle;|%OLStyle;)" --> -<!ENTITY % LIStyle "CDATA"> - -<!-- list item --> - -<!ELEMENT li %Flow;> -<!ATTLIST li - %attrs; - type %LIStyle; #IMPLIED - value %Number; #IMPLIED - > - -<!-- definition lists - dt for term, dd for its definition --> - -<!ELEMENT dl (dt|dd)+> -<!ATTLIST dl - %attrs; - compact (compact) #IMPLIED - > - -<!ELEMENT dt %Inline;> -<!ATTLIST dt - %attrs; - > - -<!ELEMENT dd %Flow;> -<!ATTLIST dd - %attrs; - > - -<!--=================== Address ==========================================--> - -<!-- information on author --> - -<!ELEMENT address (#PCDATA | %inline; | %misc.inline; | p)*> -<!ATTLIST address - %attrs; - > - -<!--=================== Horizontal Rule ==================================--> - -<!ELEMENT hr EMPTY> -<!ATTLIST hr - %attrs; - align (left|center|right) #IMPLIED - noshade (noshade) #IMPLIED - size %Pixels; #IMPLIED - width %Length; #IMPLIED - > - -<!--=================== Preformatted Text ================================--> - -<!-- content is %Inline; excluding - "img|object|applet|big|small|sub|sup|font|basefont" --> - -<!ELEMENT pre %pre.content;> -<!ATTLIST pre - %attrs; - width %Number; #IMPLIED - xml:space (preserve) #FIXED 'preserve' - > - -<!--=================== Block-like Quotes ================================--> - -<!ELEMENT blockquote %Flow;> -<!ATTLIST blockquote - %attrs; - cite %URI; #IMPLIED - > - -<!--=================== Text alignment ===================================--> - -<!-- center content --> -<!ELEMENT center %Flow;> -<!ATTLIST center - %attrs; - > - -<!--=================== Inserted/Deleted Text ============================--> - -<!-- - ins/del are allowed in block and inline content, but its - inappropriate to include block content within an ins element - occurring in inline content. ---> -<!ELEMENT ins %Flow;> -<!ATTLIST ins - %attrs; - cite %URI; #IMPLIED - datetime %Datetime; #IMPLIED - > - -<!ELEMENT del %Flow;> -<!ATTLIST del - %attrs; - cite %URI; #IMPLIED - datetime %Datetime; #IMPLIED - > - -<!--================== The Anchor Element ================================--> - -<!-- content is %Inline; except that anchors shouldn't be nested --> - -<!ELEMENT a %a.content;> -<!ATTLIST a - %attrs; - %focus; - charset %Charset; #IMPLIED - type %ContentType; #IMPLIED - name NMTOKEN #IMPLIED - href %URI; #IMPLIED - hreflang %LanguageCode; #IMPLIED - rel %LinkTypes; #IMPLIED - rev %LinkTypes; #IMPLIED - shape %Shape; "rect" - coords %Coords; #IMPLIED - target %FrameTarget; #IMPLIED - > - -<!--===================== Inline Elements ================================--> - -<!ELEMENT span %Inline;> <!-- generic language/style container --> -<!ATTLIST span - %attrs; - > - -<!ELEMENT bdo %Inline;> <!-- I18N BiDi over-ride --> -<!ATTLIST bdo - %coreattrs; - %events; - lang %LanguageCode; #IMPLIED - xml:lang %LanguageCode; #IMPLIED - dir (ltr|rtl) #REQUIRED - > - -<!ELEMENT br EMPTY> <!-- forced line break --> -<!ATTLIST br - %coreattrs; - clear (left|all|right|none) "none" - > - -<!ELEMENT em %Inline;> <!-- emphasis --> -<!ATTLIST em %attrs;> - -<!ELEMENT strong %Inline;> <!-- strong emphasis --> -<!ATTLIST strong %attrs;> - -<!ELEMENT dfn %Inline;> <!-- definitional --> -<!ATTLIST dfn %attrs;> - -<!ELEMENT code %Inline;> <!-- program code --> -<!ATTLIST code %attrs;> - -<!ELEMENT samp %Inline;> <!-- sample --> -<!ATTLIST samp %attrs;> - -<!ELEMENT kbd %Inline;> <!-- something user would type --> -<!ATTLIST kbd %attrs;> - -<!ELEMENT var %Inline;> <!-- variable --> -<!ATTLIST var %attrs;> - -<!ELEMENT cite %Inline;> <!-- citation --> -<!ATTLIST cite %attrs;> - -<!ELEMENT abbr %Inline;> <!-- abbreviation --> -<!ATTLIST abbr %attrs;> - -<!ELEMENT acronym %Inline;> <!-- acronym --> -<!ATTLIST acronym %attrs;> - -<!ELEMENT q %Inline;> <!-- inlined quote --> -<!ATTLIST q - %attrs; - cite %URI; #IMPLIED - > - -<!ELEMENT sub %Inline;> <!-- subscript --> -<!ATTLIST sub %attrs;> - -<!ELEMENT sup %Inline;> <!-- superscript --> -<!ATTLIST sup %attrs;> - -<!ELEMENT tt %Inline;> <!-- fixed pitch font --> -<!ATTLIST tt %attrs;> - -<!ELEMENT i %Inline;> <!-- italic font --> -<!ATTLIST i %attrs;> - -<!ELEMENT b %Inline;> <!-- bold font --> -<!ATTLIST b %attrs;> - -<!ELEMENT big %Inline;> <!-- bigger font --> -<!ATTLIST big %attrs;> - -<!ELEMENT small %Inline;> <!-- smaller font --> -<!ATTLIST small %attrs;> - -<!ELEMENT u %Inline;> <!-- underline --> -<!ATTLIST u %attrs;> - -<!ELEMENT s %Inline;> <!-- strike-through --> -<!ATTLIST s %attrs;> - -<!ELEMENT strike %Inline;> <!-- strike-through --> -<!ATTLIST strike %attrs;> - -<!ELEMENT basefont EMPTY> <!-- base font size --> -<!ATTLIST basefont - id ID #IMPLIED - size CDATA #REQUIRED - color %Color; #IMPLIED - face CDATA #IMPLIED - > - -<!ELEMENT font %Inline;> <!-- local change to font --> -<!ATTLIST font - %coreattrs; - %i18n; - size CDATA #IMPLIED - color %Color; #IMPLIED - face CDATA #IMPLIED - > - -<!--==================== Object ======================================--> -<!-- - object is used to embed objects as part of HTML pages. - param elements should precede other content. Parameters - can also be expressed as attribute/value pairs on the - object element itself when brevity is desired. ---> - -<!ELEMENT object (#PCDATA | param | %block; | form | %inline; | %misc;)*> -<!ATTLIST object - %attrs; - declare (declare) #IMPLIED - classid %URI; #IMPLIED - codebase %URI; #IMPLIED - data %URI; #IMPLIED - type %ContentType; #IMPLIED - codetype %ContentType; #IMPLIED - archive %UriList; #IMPLIED - standby %Text; #IMPLIED - height %Length; #IMPLIED - width %Length; #IMPLIED - usemap %URI; #IMPLIED - name NMTOKEN #IMPLIED - tabindex %Number; #IMPLIED - align %ImgAlign; #IMPLIED - border %Pixels; #IMPLIED - hspace %Pixels; #IMPLIED - vspace %Pixels; #IMPLIED - > - -<!-- - param is used to supply a named property value. - In XML it would seem natural to follow RDF and support an - abbreviated syntax where the param elements are replaced - by attribute value pairs on the object start tag. ---> -<!ELEMENT param EMPTY> -<!ATTLIST param - id ID #IMPLIED - name CDATA #REQUIRED - value CDATA #IMPLIED - valuetype (data|ref|object) "data" - type %ContentType; #IMPLIED - > - -<!--=================== Java applet ==================================--> -<!-- - One of code or object attributes must be present. - Place param elements before other content. ---> -<!ELEMENT applet (#PCDATA | param | %block; | form | %inline; | %misc;)*> -<!ATTLIST applet - %coreattrs; - codebase %URI; #IMPLIED - archive CDATA #IMPLIED - code CDATA #IMPLIED - object CDATA #IMPLIED - alt %Text; #IMPLIED - name NMTOKEN #IMPLIED - width %Length; #REQUIRED - height %Length; #REQUIRED - align %ImgAlign; #IMPLIED - hspace %Pixels; #IMPLIED - vspace %Pixels; #IMPLIED - > - -<!--=================== Images ===========================================--> - -<!-- - To avoid accessibility problems for people who aren't - able to see the image, you should provide a text - description using the alt and longdesc attributes. - In addition, avoid the use of server-side image maps. ---> - -<!ELEMENT img EMPTY> -<!ATTLIST img - %attrs; - src %URI; #REQUIRED - alt %Text; #REQUIRED - name NMTOKEN #IMPLIED - longdesc %URI; #IMPLIED - height %Length; #IMPLIED - width %Length; #IMPLIED - usemap %URI; #IMPLIED - ismap (ismap) #IMPLIED - align %ImgAlign; #IMPLIED - border %Length; #IMPLIED - hspace %Pixels; #IMPLIED - vspace %Pixels; #IMPLIED - > - -<!-- usemap points to a map element which may be in this document - or an external document, although the latter is not widely supported --> - -<!--================== Client-side image maps ============================--> - -<!-- These can be placed in the same document or grouped in a - separate document although this isn't yet widely supported --> - -<!ELEMENT map ((%block; | form | %misc;)+ | area+)> -<!ATTLIST map - %i18n; - %events; - id ID #REQUIRED - class CDATA #IMPLIED - style %StyleSheet; #IMPLIED - title %Text; #IMPLIED - name CDATA #IMPLIED - > - -<!ELEMENT area EMPTY> -<!ATTLIST area - %attrs; - %focus; - shape %Shape; "rect" - coords %Coords; #IMPLIED - href %URI; #IMPLIED - nohref (nohref) #IMPLIED - alt %Text; #REQUIRED - target %FrameTarget; #IMPLIED - > - -<!--================ Forms ===============================================--> - -<!ELEMENT form %form.content;> <!-- forms shouldn't be nested --> - -<!ATTLIST form - %attrs; - action %URI; #REQUIRED - method (get|post) "get" - name NMTOKEN #IMPLIED - enctype %ContentType; "application/x-www-form-urlencoded" - onsubmit %Script; #IMPLIED - onreset %Script; #IMPLIED - accept %ContentTypes; #IMPLIED - accept-charset %Charsets; #IMPLIED - target %FrameTarget; #IMPLIED - > - -<!-- - Each label must not contain more than ONE field - Label elements shouldn't be nested. ---> -<!ELEMENT label %Inline;> -<!ATTLIST label - %attrs; - for IDREF #IMPLIED - accesskey %Character; #IMPLIED - onfocus %Script; #IMPLIED - onblur %Script; #IMPLIED - > - -<!ENTITY % InputType - "(text | password | checkbox | - radio | submit | reset | - file | hidden | image | button)" - > - -<!-- the name attribute is required for all but submit & reset --> - -<!ELEMENT input EMPTY> <!-- form control --> -<!ATTLIST input - %attrs; - %focus; - type %InputType; "text" - name CDATA #IMPLIED - value CDATA #IMPLIED - checked (checked) #IMPLIED - disabled (disabled) #IMPLIED - readonly (readonly) #IMPLIED - size CDATA #IMPLIED - maxlength %Number; #IMPLIED - src %URI; #IMPLIED - alt CDATA #IMPLIED - usemap %URI; #IMPLIED - onselect %Script; #IMPLIED - onchange %Script; #IMPLIED - accept %ContentTypes; #IMPLIED - align %ImgAlign; #IMPLIED - > - -<!ELEMENT select (optgroup|option)+> <!-- option selector --> -<!ATTLIST select - %attrs; - name CDATA #IMPLIED - size %Number; #IMPLIED - multiple (multiple) #IMPLIED - disabled (disabled) #IMPLIED - tabindex %Number; #IMPLIED - onfocus %Script; #IMPLIED - onblur %Script; #IMPLIED - onchange %Script; #IMPLIED - > - -<!ELEMENT optgroup (option)+> <!-- option group --> -<!ATTLIST optgroup - %attrs; - disabled (disabled) #IMPLIED - label %Text; #REQUIRED - > - -<!ELEMENT option (#PCDATA)> <!-- selectable choice --> -<!ATTLIST option - %attrs; - selected (selected) #IMPLIED - disabled (disabled) #IMPLIED - label %Text; #IMPLIED - value CDATA #IMPLIED - > - -<!ELEMENT textarea (#PCDATA)> <!-- multi-line text field --> -<!ATTLIST textarea - %attrs; - %focus; - name CDATA #IMPLIED - rows %Number; #REQUIRED - cols %Number; #REQUIRED - disabled (disabled) #IMPLIED - readonly (readonly) #IMPLIED - onselect %Script; #IMPLIED - onchange %Script; #IMPLIED - > - -<!-- - The fieldset element is used to group form fields. - Only one legend element should occur in the content - and if present should only be preceded by whitespace. ---> -<!ELEMENT fieldset (#PCDATA | legend | %block; | form | %inline; | %misc;)*> -<!ATTLIST fieldset - %attrs; - > - -<!ENTITY % LAlign "(top|bottom|left|right)"> - -<!ELEMENT legend %Inline;> <!-- fieldset label --> -<!ATTLIST legend - %attrs; - accesskey %Character; #IMPLIED - align %LAlign; #IMPLIED - > - -<!-- - Content is %Flow; excluding a, form, form controls, iframe ---> -<!ELEMENT button %button.content;> <!-- push button --> -<!ATTLIST button - %attrs; - %focus; - name CDATA #IMPLIED - value CDATA #IMPLIED - type (button|submit|reset) "submit" - disabled (disabled) #IMPLIED - > - -<!-- single-line text input control (DEPRECATED) --> -<!ELEMENT isindex EMPTY> -<!ATTLIST isindex - %coreattrs; - %i18n; - prompt %Text; #IMPLIED - > - -<!--======================= Tables =======================================--> - -<!-- Derived from IETF HTML table standard, see [RFC1942] --> - -<!-- - The border attribute sets the thickness of the frame around the - table. The default units are screen pixels. - - The frame attribute specifies which parts of the frame around - the table should be rendered. The values are not the same as - CALS to avoid a name clash with the valign attribute. ---> -<!ENTITY % TFrame "(void|above|below|hsides|lhs|rhs|vsides|box|border)"> - -<!-- - The rules attribute defines which rules to draw between cells: - - If rules is absent then assume: - "none" if border is absent or border="0" otherwise "all" ---> - -<!ENTITY % TRules "(none | groups | rows | cols | all)"> - -<!-- horizontal placement of table relative to document --> -<!ENTITY % TAlign "(left|center|right)"> - -<!-- horizontal alignment attributes for cell contents - - char alignment char, e.g. char=':' - charoff offset for alignment char ---> -<!ENTITY % cellhalign - "align (left|center|right|justify|char) #IMPLIED - char %Character; #IMPLIED - charoff %Length; #IMPLIED" - > - -<!-- vertical alignment attributes for cell contents --> -<!ENTITY % cellvalign - "valign (top|middle|bottom|baseline) #IMPLIED" - > - -<!ELEMENT table - (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))> -<!ELEMENT caption %Inline;> -<!ELEMENT thead (tr)+> -<!ELEMENT tfoot (tr)+> -<!ELEMENT tbody (tr)+> -<!ELEMENT colgroup (col)*> -<!ELEMENT col EMPTY> -<!ELEMENT tr (th|td)+> -<!ELEMENT th %Flow;> -<!ELEMENT td %Flow;> - -<!ATTLIST table - %attrs; - summary %Text; #IMPLIED - width %Length; #IMPLIED - border %Pixels; #IMPLIED - frame %TFrame; #IMPLIED - rules %TRules; #IMPLIED - cellspacing %Length; #IMPLIED - cellpadding %Length; #IMPLIED - align %TAlign; #IMPLIED - bgcolor %Color; #IMPLIED - > - -<!ENTITY % CAlign "(top|bottom|left|right)"> - -<!ATTLIST caption - %attrs; - align %CAlign; #IMPLIED - > - -<!-- -colgroup groups a set of col elements. It allows you to group -several semantically related columns together. ---> -<!ATTLIST colgroup - %attrs; - span %Number; "1" - width %MultiLength; #IMPLIED - %cellhalign; - %cellvalign; - > - -<!-- - col elements define the alignment properties for cells in - one or more columns. - - The width attribute specifies the width of the columns, e.g. - - width=64 width in screen pixels - width=0.5* relative width of 0.5 - - The span attribute causes the attributes of one - col element to apply to more than one column. ---> -<!ATTLIST col - %attrs; - span %Number; "1" - width %MultiLength; #IMPLIED - %cellhalign; - %cellvalign; - > - -<!-- - Use thead to duplicate headers when breaking table - across page boundaries, or for static headers when - tbody sections are rendered in scrolling panel. - - Use tfoot to duplicate footers when breaking table - across page boundaries, or for static footers when - tbody sections are rendered in scrolling panel. - - Use multiple tbody sections when rules are needed - between groups of table rows. ---> -<!ATTLIST thead - %attrs; - %cellhalign; - %cellvalign; - > - -<!ATTLIST tfoot - %attrs; - %cellhalign; - %cellvalign; - > - -<!ATTLIST tbody - %attrs; - %cellhalign; - %cellvalign; - > - -<!ATTLIST tr - %attrs; - %cellhalign; - %cellvalign; - bgcolor %Color; #IMPLIED - > - -<!-- Scope is simpler than headers attribute for common tables --> -<!ENTITY % Scope "(row|col|rowgroup|colgroup)"> - -<!-- th is for headers, td for data and for cells acting as both --> - -<!ATTLIST th - %attrs; - abbr %Text; #IMPLIED - axis CDATA #IMPLIED - headers IDREFS #IMPLIED - scope %Scope; #IMPLIED - rowspan %Number; "1" - colspan %Number; "1" - %cellhalign; - %cellvalign; - nowrap (nowrap) #IMPLIED - bgcolor %Color; #IMPLIED - width %Length; #IMPLIED - height %Length; #IMPLIED - > - -<!ATTLIST td - %attrs; - abbr %Text; #IMPLIED - axis CDATA #IMPLIED - headers IDREFS #IMPLIED - scope %Scope; #IMPLIED - rowspan %Number; "1" - colspan %Number; "1" - %cellhalign; - %cellvalign; - nowrap (nowrap) #IMPLIED - bgcolor %Color; #IMPLIED - width %Length; #IMPLIED - height %Length; #IMPLIED - > - diff --git a/lib/htmlpurifier/docs/enduser-customize.html b/lib/htmlpurifier/docs/enduser-customize.html deleted file mode 100644 index 7e1ffa260..000000000 --- a/lib/htmlpurifier/docs/enduser-customize.html +++ /dev/null @@ -1,850 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Tutorial for customizing HTML Purifier's tag and attribute sets." /> -<link rel="stylesheet" type="text/css" href="style.css" /> - -<title>Customize - HTML Purifier</title> - -</head><body> - -<h1 class="subtitled">Customize!</h1> -<div class="subtitle">HTML Purifier is a Swiss-Army Knife</div> - -<div id="filing">Filed under End-User</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p> - HTML Purifier has this quirk where if you try to allow certain elements or - attributes, HTML Purifier will tell you that it's not supported, and that - you should go to the forums to find out how to implement it. Well, this - document is how to implement elements and attributes which HTML Purifier - doesn't support out of the box. -</p> - -<h2>Is it necessary?</h2> - -<p> - Before we even write any code, it is paramount to consider whether or - not the code we're writing is necessary or not. HTML Purifier, by default, - contains a large set of elements and attributes: large enough so that - <em>any</em> element or attribute in XHTML 1.0 or 1.1 (and its HTML variants) - that can be safely used by the general public is implemented. -</p> - -<p> - So what needs to be implemented? (Feel free to skip this section if - you know what you want). -</p> - -<h3>XHTML 1.0</h3> - -<p> - All of the modules listed below are based off of the - <a href="http://www.w3.org/TR/2001/REC-xhtml-modularization-20010410/abstract_modules.html#sec_5.2.">modularization of - XHTML</a>, which, while technically for XHTML 1.1, is quite a useful - resource. -</p> - -<ul> - <li>Structure</li> - <li>Frames</li> - <li>Applets (deprecated)</li> - <li>Forms</li> - <li>Image maps</li> - <li>Objects</li> - <li>Frames</li> - <li>Events</li> - <li>Meta-information</li> - <li>Style sheets</li> - <li>Link (not hypertext)</li> - <li>Base</li> - <li>Name</li> -</ul> - -<p> - If you don't recognize it, you probably don't need it. But the curious - can look all of these modules up in the above-mentioned document. Note - that inline scripting comes packaged with HTML Purifier (more on this - later). -</p> - -<h3>XHTML 1.1</h3> - -<p> - As of HTMLPurifier 2.1.0, we have implemented the - <a href="http://www.w3.org/TR/2001/REC-ruby-20010531/">Ruby module</a>, - which defines a set of tags - for publishing short annotations for text, used mostly in Japanese - and Chinese school texts, but applicable for positioning any text (not - limited to translations) above or below other corresponding text. -</p> - -<h3>HTML 5</h3> - -<p> - <a href="http://www.whatwg.org/specs/web-apps/current-work/">HTML 5</a> - is a fork of HTML 4.01 by WHATWG, who believed that XHTML 2.0 was headed - in the wrong direction. It too is a working draft, and may change - drastically before publication, but it should be noted that the - <code>canvas</code> tag has been implemented by many browser vendors. -</p> - -<h3>Proprietary</h3> - -<p> - There are a number of proprietary tags still in the wild. Many of them - have been documented in <a href="ref-proprietary-tags.txt">ref-proprietary-tags.txt</a>, - but there is currently no implementation for any of them. -</p> - -<h3>Extensions</h3> - -<p> - There are also a number of other XML languages out there that can - be embedded in HTML documents: two of the most popular are MathML and - SVG, and I frequently get requests to implement these. But they are - expansive, comprehensive specifications, and it would take far too long - to implement them <em>correctly</em> (most systems I've seen go as far - as whitelisting tags and no further; come on, what about nesting!) -</p> - -<p> - Word of warning: HTML Purifier is currently <em>not</em> namespace - aware. -</p> - -<h2>Giving back</h2> - -<p> - As you may imagine from the details above (don't be abashed if you didn't - read it all: a glance over would have done), there's quite a bit that - HTML Purifier doesn't implement. Recent architectural changes have - allowed HTML Purifier to implement elements and attributes that are not - safe! Don't worry, they won't be activated unless you set %HTML.Trusted - to true, but they certainly help out users who need to put, say, forms - on their page and don't want to go through the trouble of reading this - and implementing it themself. -</p> - -<p> - So any of the above that you implement for your own application could - help out some other poor sap on the other side of the globe. Help us - out, and send back code so that it can be hammered into a module and - released with the core. Any code would be greatly appreciated! -</p> - -<h2>And now...</h2> - -<p> - Enough philosophical talk, time for some code: -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -if ($def = $config->maybeGetRawHTMLDefinition()) { - // our code will go here -}</pre> - -<p> - Assuming that HTML Purifier has already been properly loaded (hint: - include <code>HTMLPurifier.auto.php</code>), this code will set up - the environment that you need to start customizing the HTML definition. - What's going on? -</p> - -<ul> - <li> - The first three lines are regular configuration code: - <ul> - <li> - %HTML.DefinitionID is set to a unique identifier for your - custom HTML definition. This prevents it from clobbering - other custom definitions on the same installation. - </li> - <li> - %HTML.DefinitionRev is a revision integer of your HTML - definition. Because HTML definitions are cached, you'll need - to increment this whenever you make a change in order to flush - the cache. - </li> - </ul> - </li> - <li> - The fourth line retrieves a raw <code>HTMLPurifier_HTMLDefinition</code> - object that we will be tweaking. Interestingly enough, we have - placed it in an if block: this is because - <code>maybeGetRawHTMLDefinition</code>, as its name suggests, may - return a NULL, in which case we should skip doing any - initialization. This, in fact, will correspond to when our fully - customized object is already in the cache. - </li> -</ul> - -<h2>Turn off caching</h2> - -<p> - To make development easier, we're going to temporarily turn off - definition caching: -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -<strong>$config->set('Cache.DefinitionImpl', null); // TODO: remove this later!</strong> -$def = $config->getHTMLDefinition(true);</pre> - -<p> - A few things should be mentioned about the caching mechanism before - we move on. For performance reasons, HTML Purifier caches generated - <code>HTMLPurifier_Definition</code> objects in serialized files - stored (by default) in <code>library/HTMLPurifier/DefinitionCache/Serializer</code>. - A lot of processing is done in order to create these objects, so it - makes little sense to repeat the same processing over and over again - whenever HTML Purifier is called. -</p> - -<p> - In order to identify a cache entry, HTML Purifier uses three variables: - the library's version number, the value of %HTML.DefinitionRev and - a serial of relevant configuration. Whenever any of these changes, - a new HTML definition is generated. Notice that there is no way - for the definition object to track changes to customizations: here, it - is up to you to supply appropriate information to DefinitionID and - DefinitionRev. -</p> - -<h2 id="addAttribute">Add an attribute</h2> - -<p> - For this example, we're going to implement the <code>target</code> attribute found - on <code>a</code> elements. To implement an attribute, we have to - ask a few questions: -</p> - -<ol> - <li>What element is it found on?</li> - <li>What is its name?</li> - <li>Is it required or optional?</li> - <li>What are valid values for it?</li> -</ol> - -<p> - The first three are easy: the element is <code>a</code>, the attribute - is <code>target</code>, and it is not a required attribute. (If it - was required, we'd need to append an asterisk to the attribute name, - you'll see an example of this in the addElement() example). -</p> - -<p> - The last question is a little trickier. - Lets allow the special values: _blank, _self, _target and _top. - The form of this is called an <strong>enumeration</strong>, a list of - valid values, although only one can be used at a time. To translate - this into code form, we write: -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -$config->set('Cache.DefinitionImpl', null); // remove this later! -$def = $config->getHTMLDefinition(true); -<strong>$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');</strong></pre> - -<p> - The <code>Enum#_blank,_self,_target,_top</code> does all the magic. - The string is split into two parts, separated by a hash mark (#): -</p> - -<ol> - <li>The first part is the name of what we call an <code>AttrDef</code></li> - <li>The second part is the parameter of the above-mentioned <code>AttrDef</code></li> -</ol> - -<p> - If that sounds vague and generic, it's because it is! HTML Purifier defines - an assortment of different attribute types one can use, and each of these - has their own specialized parameter format. Here are some of the more useful - ones: -</p> - -<table class="table"> - <thead> - <tr> - <th>Type</th> - <th>Format</th> - <th>Description</th> - </tr> - </thead> - <tbody> - <tr> - <th>Enum</th> - <td><em>[s:]</em>value1,value2,...</td> - <td> - Attribute with a number of valid values, one of which may be used. When - s: is present, the enumeration is case sensitive. - </td> - </tr> - <tr> - <th>Bool</th> - <td>attribute_name</td> - <td> - Boolean attribute, with only one valid value: the name - of the attribute. - </td> - </tr> - <tr> - <th>CDATA</th> - <td></td> - <td> - Attribute of arbitrary text. Can also be referred to as <strong>Text</strong> - (the specification makes a semantic distinction between the two). - </td> - </tr> - <tr> - <th>ID</th> - <td></td> - <td> - Attribute that specifies a unique ID - </td> - </tr> - <tr> - <th>Pixels</th> - <td></td> - <td> - Attribute that specifies an integer pixel length - </td> - </tr> - <tr> - <th>Length</th> - <td></td> - <td> - Attribute that specifies a pixel or percentage length - </td> - </tr> - <tr> - <th>NMTOKENS</th> - <td></td> - <td> - Attribute that specifies a number of name tokens, example: the - <code>class</code> attribute - </td> - </tr> - <tr> - <th>URI</th> - <td></td> - <td> - Attribute that specifies a URI, example: the <code>href</code> - attribute - </td> - </tr> - <tr> - <th>Number</th> - <td></td> - <td> - Attribute that specifies an positive integer number - </td> - </tr> - </tbody> -</table> - -<p> - For a complete list, consult - <a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/AttrTypes.php"><code>library/HTMLPurifier/AttrTypes.php</code></a>; - more information on attributes that accept parameters can be found on their - respective includes in - <a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/AttrDef"><code>library/HTMLPurifier/AttrDef</code></a>. -</p> - -<p> - Sometimes, the restrictive list in AttrTypes just doesn't cut it. Don't - sweat: you can also use a fully instantiated object as the value. The - equivalent, verbose form of the above example is: -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -$config->set('Cache.DefinitionImpl', null); // remove this later! -$def = $config->getHTMLDefinition(true); -<strong>$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum( - array('_blank','_self','_target','_top') -));</strong></pre> - -<p> - Trust me, you'll learn to love the shorthand. -</p> - -<h2>Add an element</h2> - -<p> - Adding attributes is really small-fry stuff, though, and it was possible - to add them (albeit a bit more wordy) prior to 2.0. The real gem of - the Advanced API is adding elements. There are five questions to - ask when adding a new element: -</p> - -<ol> - <li>What is the element's name?</li> - <li>What content set does this element belong to?</li> - <li>What are the allowed children of this element?</li> - <li>What attributes does the element allow that are general?</li> - <li>What attributes does the element allow that are specific to this element?</li> -</ol> - -<p> - It's a mouthful, and you'll be slightly lost if your not familiar with - the HTML specification, so let's explain them step by step. -</p> - -<h3>Content set</h3> - -<p> - The HTML specification defines two major content sets: Inline - and Block. Each of these - content sets contain a list of elements: Inline contains things like - <code>span</code> and <code>b</code> while Block contains things like - <code>div</code> and <code>blockquote</code>. -</p> - -<p> - These content sets amount to a macro mechanism for HTML definition. Most - elements in HTML are organized into one of these two sets, and most - elements in HTML allow elements from one of these sets. If we had - to write each element verbatim into each other element's allowed - children, we would have ridiculously large lists; instead we use - content sets to compactify the declaration. -</p> - -<p> - Practically speaking, there are several useful values you can use here: -</p> - -<table class="table"> - <thead> - <tr> - <th>Content set</th> - <th>Description</th> - </tr> - </thead> - <tbody> - <tr> - <th>Inline</th> - <td>Character level elements, text</td> - </tr> - <tr> - <th>Block</th> - <td>Block-like elements, like paragraphs and lists</td> - </tr> - <tr> - <th><em>false</em></th> - <td> - Any element that doesn't fit into the mold, for example <code>li</code> - or <code>tr</code> - </td> - </tr> - </tbody> -</table> - -<p> - By specifying a valid value here, all other elements that use that - content set will also allow your element, without you having to do - anything. If you specify <em>false</em>, you'll have to register - your element manually. -</p> - -<h3>Allowed children</h3> - -<p> - Allowed children defines the elements that this element can contain. - The allowed values may range from none to a complex regexp depending on - your element. -</p> - -<p> - If you've ever taken a look at the HTML DTD's before, you may have - noticed declarations like this: -</p> - -<pre><!ELEMENT LI - O (%flow;)* -- list item --></pre> - -<p> - The <code>(%flow;)*</code> indicates the allowed children of the - <code>li</code> tag: <code>li</code> allows any number of flow - elements as its children. (The <code>- O</code> allows the closing tag to be - omitted, though in XML this is not allowed.) In HTML Purifier, - we'd write it like <code>Flow</code> (here's where the content sets - we were discussing earlier come into play). There are three shorthand - content models you can specify: -</p> - -<table class="table"> - <thead> - <tr> - <th>Content model</th> - <th>Description</th> - </tr> - </thead> - <tbody> - <tr> - <th>Empty</th> - <td>No children allowed, like <code>br</code> or <code>hr</code></td> - </tr> - <tr> - <th>Inline</th> - <td>Any number of inline elements and text, like <code>span</code></td> - </tr> - <tr> - <th>Flow</th> - <td>Any number of inline elements, block elements and text, like <code>div</code></td> - </tr> - </tbody> -</table> - -<p> - This covers 90% of all the cases out there, but what about elements that - break the mold like <code>ul</code>? This guy requires at least one - child, and the only valid children for it are <code>li</code>. The - content model is: <code>Required: li</code>. There are two parts: the - first type determines what <code>ChildDef</code> will be used to validate - content models. The most common values are: -</p> - -<table class="table"> - <thead> - <tr> - <th>Type</th> - <th>Description</th> - </tr> - </thead> - <tbody> - <tr> - <th>Required</th> - <td>Children must be one or more of the valid elements</td> - </tr> - <tr> - <th>Optional</th> - <td>Children can be any number of the valid elements</td> - </tr> - <tr> - <th>Custom</th> - <td>Children must follow the DTD-style regex</td> - </tr> - </tbody> -</table> - -<p> - You can also implement your own <code>ChildDef</code>: this was done - for a few special cases in HTML Purifier such as <code>Chameleon</code> - (for <code>ins</code> and <code>del</code>), <code>StrictBlockquote</code> - and <code>Table</code>. -</p> - -<p> - The second part specifies either valid elements or a regular expression. - Valid elements are separated with horizontal bars (|), i.e. - "<code>a | b | c</code>". Use #PCDATA to represent plain text. - Regular expressions are based off of DTD's style: -</p> - -<ul> - <li>Parentheses () are used for grouping</li> - <li>Commas (,) separate elements that should come one after another</li> - <li>Horizontal bars (|) indicate one or the other elements should be used</li> - <li>Plus signs (+) are used for a one or more match</li> - <li>Asterisks (*) are used for a zero or more match</li> - <li>Question marks (?) are used for a zero or one match</li> -</ul> - -<p> - For example, "<code>a, b?, (c | d), e+, f*</code>" means "In this order, - one <code>a</code> element, at most one <code>b</code> element, - one <code>c</code> or <code>d</code> element (but not both), one or more - <code>e</code> elements, and any number of <code>f</code> elements." - Regex veterans should be able to jump right in, and those not so savvy - can always copy-paste W3C's content model definitions into HTML Purifier - and hope for the best. -</p> - -<p> - A word of warning: while the regex format is extremely flexible on - the developer's side, it is - quite unforgiving on the user's side. If the user input does not <em>exactly</em> - match the specification, the entire contents of the element will - be nuked. This is why there is are specific content model types like - Optional and Required: while they could be implemented as <code>Custom: - (valid | elements)*</code>, the custom classes contain special recovery - measures that make sure as much of the user's original content gets - through. HTML Purifier's core, as a rule, does not use Custom. -</p> - -<p> - One final note: you can also use Content Sets inside your valid elements - lists or regular expressions. In fact, the three shorthand content models - mentioned above are just that: abbreviations: -</p> - -<table class="table"> - <thead> - <tr> - <th>Content model</th> - <th>Implementation</th> - </tr> - </thead> - <tbody> - <tr> - <th>Inline</th> - <td>Optional: Inline | #PCDATA</td> - </tr> - <tr> - <th>Flow</th> - <td>Optional: Flow | #PCDATA</td> - </tr> - </tbody> -</table> - -<p> - When the definition is compiled, Inline will be replaced with a - horizontal-bar separated list of inline elements. Also, notice that - it does not contain text: you have to specify that yourself. -</p> - -<h3>Common attributes</h3> - -<p> - Congratulations: you have just gotten over the proverbial hump (Allowed - children). Common attributes is much simpler, and boils down to - one question: does your element have the <code>id</code>, <code>style</code>, - <code>class</code>, <code>title</code> and <code>lang</code> attributes? - If so, you'll want to specify the <code>Common</code> attribute collection, - which contains these five attributes that are found on almost every - HTML element in the specification. -</p> - -<p> - There are a few more collections, but they're really edge cases: -</p> - -<table class="table"> - <thead> - <tr> - <th>Collection</th> - <th>Attributes</th> - </tr> - </thead> - <tbody> - <tr> - <th>I18N</th> - <td><code>lang</code>, possibly <code>xml:lang</code></td> - </tr> - <tr> - <th>Core</th> - <td><code>style</code>, <code>class</code>, <code>id</code> and <code>title</code></td> - </tr> - </tbody> -</table> - -<p> - Common is a combination of the above-mentioned collections. -</p> - -<p class="aside"> - Readers familiar with the modularization may have noticed that the Core - attribute collection differs from that specified by the <a - href="http://www.w3.org/TR/xhtml-modularization/abstract_modules.html#s_commonatts">abstract - modules of the XHTML Modularization 1.1</a>. We believe this section - to be in error, as <code>br</code> permits the use of the <code>style</code> - attribute even though it uses the <code>Core</code> collection, and - the DTD and XML Schemas supplied by W3C support our interpretation. -</p> - -<h3>Attributes</h3> - -<p> - If you didn't read the <a href="#addAttribute">earlier section on - adding attributes</a>, read it now. The last parameter is simply - an array of attribute names to attribute implementations, in the exact - same format as <code>addAttribute()</code>. -</p> - -<h3>Putting it all together</h3> - -<p> - We're going to implement <code>form</code>. Before we embark, lets - grab a reference implementation from over at the - <a href="http://www.w3.org/TR/html4/sgml/loosedtd.html">transitional DTD</a>: -</p> - -<pre><!ELEMENT FORM - - (%flow;)* -(FORM) -- interactive form --> -<!ATTLIST FORM - %attrs; -- %coreattrs, %i18n, %events -- - action %URI; #REQUIRED -- server-side form handler -- - method (GET|POST) GET -- HTTP method used to submit the form-- - enctype %ContentType; "application/x-www-form-urlencoded" - accept %ContentTypes; #IMPLIED -- list of MIME types for file upload -- - name CDATA #IMPLIED -- name of form for scripting -- - onsubmit %Script; #IMPLIED -- the form was submitted -- - onreset %Script; #IMPLIED -- the form was reset -- - target %FrameTarget; #IMPLIED -- render in this frame -- - accept-charset %Charsets; #IMPLIED -- list of supported charsets -- - ></pre> - -<p> - Juicy! With just this, we can answer four of our five questions: -</p> - -<ol> - <li>What is the element's name? <strong>form</strong></li> - <li>What content set does this element belong to? <strong>Block</strong> - (this needs a little sleuthing, I find the easiest way is to search - the DTD for <code>FORM</code> and determine which set it is in.)</li> - <li>What are the allowed children of this element? <strong>One - or more flow elements, but no nested <code>form</code>s</strong></li> - <li>What attributes does the element allow that are general? <strong>Common</strong></li> - <li>What attributes does the element allow that are specific to this element? <strong>A whole bunch, see ATTLIST; - we're going to do the vital ones: <code>action</code>, <code>method</code> and <code>name</code></strong></li> -</ol> - -<p> - Time for some code: -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -$config->set('Cache.DefinitionImpl', null); // remove this later! -$def = $config->getHTMLDefinition(true); -$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum( - array('_blank','_self','_target','_top') -)); -<strong>$form = $def->addElement( - 'form', // name - 'Block', // content set - 'Flow', // allowed children - 'Common', // attribute collection - array( // attributes - 'action*' => 'URI', - 'method' => 'Enum#get|post', - 'name' => 'ID' - ) -); -$form->excludes = array('form' => true);</strong></pre> - -<p> - Each of the parameters corresponds to one of the questions we asked. - Notice that we added an asterisk to the end of the <code>action</code> - attribute to indicate that it is required. If someone specifies a - <code>form</code> without that attribute, the tag will be axed. - Also, the extra line at the end is a special extra declaration that - prevents forms from being nested within each other. -</p> - -<p> - And that's all there is to it! Implementing the rest of the form - module is left as an exercise to the user; to see more examples - check the <a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/HTMLModule"><code>library/HTMLPurifier/HTMLModule/</code></a> directory - in your local HTML Purifier installation. -</p> - -<h2>And beyond...</h2> - -<p> - Perceptive users may have realized that, to a certain extent, we - have simply re-implemented the facilities of XML Schema or the - Document Type Definition. What you are seeing here, however, is - not just an XML Schema or Document Type Definition: it is a fully - expressive method of specifying the definition of HTML that is - a portable superset of the capabilities of the two above-mentioned schema - languages. What makes HTMLDefinition so powerful is the fact that - if we don't have an implementation for a content model or an attribute - definition, you can supply it yourself by writing a PHP class. -</p> - -<p> - There are many facets of HTMLDefinition beyond the Advanced API I have - walked you through today. To find out more about these, you can - check out these source files: -</p> - -<ul> - <li><a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/HTMLModule.php"><code>library/HTMLPurifier/HTMLModule.php</code></a></li> - <li><a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/ElementDef.php"><code>library/HTMLPurifier/ElementDef.php</code></a></li> -</ul> - -<h2 id="optimized">Notes for HTML Purifier 4.2.0 and earlier</h3> - -<p> - Previously, this tutorial gave some incorrect template code for - editing raw definitions, and that template code will now produce the - error <q>Due to a documentation error in previous version of HTML - Purifier...</q> Here is how to mechanically transform old-style - code into new-style code. -</p> - -<p> - First, identify all code that edits the raw definition object, and - put it together. Ensure none of this code must be run on every - request; if some sub-part needs to always be run, move it outside - this block. Here is an example below, with the raw definition - object code bolded. -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -$def = $config->getHTMLDefinition(true); -<strong>$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');</strong> -$purifier = new HTMLPurifier($config);</pre> - -<p> - Next, replace the raw definition retrieval with a - maybeGetRawHTMLDefinition method call inside an if conditional, and - place the editing code inside that if block. -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial'); -$config->set('HTML.DefinitionRev', 1); -<strong>if ($def = $config->maybeGetRawHTMLDefinition()) { - $def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top'); -}</strong> -$purifier = new HTMLPurifier($config);</pre> - -<p> - And you're done! Alternatively, if you're OK with not ever caching - your code, the following will still work and not emit warnings. -</p> - -<pre>$config = HTMLPurifier_Config::createDefault(); -$def = $config->getHTMLDefinition(true); -$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top'); -$purifier = new HTMLPurifier($config);</pre> - -<p> - A slightly less efficient version of this was what was going on with - old versions of HTML Purifier. -</p> - -<p> - <em>Technical notes:</em> ajh pointed out on <a - href="http://htmlpurifier.org/phorum/read.php?5,5164,5169#msg-5169">in a forum topic</a> that - HTML Purifier appeared to be repeatedly writing to the cache even - when a cache entry already existed. Investigation lead to the - discovery of the following infelicity: caching of customized - definitions didn't actually work! The problem was that even though - a cache file would be written out at the end of the process, there - was no way for HTML Purifier to say, <q>Actually, I've already got a - copy of your work, no need to reconfigure your - customizations</q>. This required the API to change: placing - all of the customizations to the raw definition object in a - conditional which could be skipped. -</p> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/enduser-id.html b/lib/htmlpurifier/docs/enduser-id.html deleted file mode 100644 index 53d2da248..000000000 --- a/lib/htmlpurifier/docs/enduser-id.html +++ /dev/null @@ -1,148 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Explains various methods for allowing IDs in documents safely in HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>IDs - HTML Purifier</title> - -</head><body> - -<h1 class="subtitled">IDs</h1> -<div class="subtitle">What they are, why you should(n't) wear them, and how to deal with it</div> - -<div id="filing">Filed under End-User</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>Prior to HTML Purifier 1.2.0, this library blithely accepted user input that -looked like this:</p> - -<pre><a id="fragment">Anchor</a></pre> - -<p>...presenting an attractive vector for those that would destroy standards -compliance: simply set the ID to one that is already used elsewhere in the -document and voila: validation breaks. There was a half-hearted attempt to -prevent this by allowing users to blacklist IDs, but I suspect that no one -really bothered, and thus, with the release of 1.2.0, IDs are now <em>removed</em> -by default.</p> - -<p>IDs, however, are quite useful functionality to have, so if users start -complaining about broken anchors you'll probably want to turn them back on -with %Attr.EnableID. But before you go mucking around with the config -object, it's probably worth to take some precautions to keep your page -validating. Why?</p> - -<ol> - <li>Standards-compliant pages are good</li> - <li>Duplicated IDs interfere with anchors. If there are two id="foobar"s in a - document, which spot does a browser presented with the fragment #foobar go - to? Most browsers opt for the first appearing ID, making it impossible - to references the second section. Similarly, duplicated IDs can hijack - client-side scripting that relies on the IDs of elements.</li> -</ol> - -<p>You have (currently) four ways of dealing with the problem.</p> - - - -<h2 class="subtitled">Blacklisting IDs</h2> -<div class="subsubtitle">Good for pages with single content source and stable templates</div> - -<p>Keeping in terms with the -<acronym title="Keep It Simple, Stupid">KISS</acronym> principle, let us -deal with the most obvious solution: preventing users from using any IDs that -appear elsewhere on the document. The method is simple:</p> - -<pre>$config->set('Attr.EnableID', true); -$config->set('Attr.IDBlacklist' array( - 'list', 'of', 'attribute', 'values', 'that', 'are', 'forbidden' -));</pre> - -<p>That being said, there are some notable drawbacks. First of all, you have to -know precisely which IDs are being used by the HTML surrounding the user code. -This is easier said than done: quite often the page designer and the system -coder work separately, so the designer has to constantly be talking with the -coder whenever he decides to add a new anchor. Miss one and you open yourself -to possible standards-compliance issues.</p> - -<p>Furthermore, this position becomes untenable when a single web page must hold -multiple portions of user-submitted content. Since there's obviously no way -to find out before-hand what IDs users will use, the blacklist is helpless. -And since HTML Purifier validates each segment separately, perhaps doing -so at different times, it would be extremely difficult to dynamically update -the blacklist in between runs.</p> - -<p>Finally, simply destroying the ID is extremely un-userfriendly behavior: after -all, they might have simply specified a duplicate ID by accident.</p> - -<p>Thus, we get to our second method.</p> - - - -<h2 class="subtitled">Namespacing IDs</h2> -<div class="subsubtitle">Lazy developer's way, but needs user education</div> - -<p>This method, too, is quite simple: add a prefix to all user IDs. With this -code:</p> - -<pre>$config->set('Attr.EnableID', true); -$config->set('Attr.IDPrefix', 'user_');</pre> - -<p>...this:</p> - -<pre><a id="foobar">Anchor!</a></pre> - -<p>...turns into:</p> - -<pre><a id="user_foobar">Anchor!</a></pre> - -<p>As long as you don't have any IDs that start with user_, collisions are -guaranteed not to happen. The drawback is obvious: if a user submits -id="foobar", they probably expect to be able to reference their page with -#foobar. You'll have to tell them, "No, that doesn't work, you have to add -user_ to the beginning."</p> - -<p>And yes, things get hairier. Even with a nice prefix, we still have done -nothing about multiple HTML Purifier outputs on one page. Thus, we have -a second configuration value to piggy-back off of: %Attr.IDPrefixLocal:</p> - -<pre>$config->set('Attr.IDPrefixLocal', 'comment' . $id . '_');</pre> - -<p>This new attributes does nothing but append on to regular IDPrefix, but is -special in that it is volatile: it's value is determined at run-time and -cannot possibly be cordoned into, say, a .ini config file. As for what to -put into the directive, is up to you, but I would recommend the ID number -the text has been assigned in the database. Whatever you pick, however, it -has to be unique and stable for the text you are validating. Note, however, -that we require that %Attr.IDPrefix be set before you use this directive.</p> - -<p>And also remember: the user has to know what this prefix is too!</p> - - - -<h2>Abstinence</h2> - -<p>You may not want to bother. That's okay too, just don't enable IDs.</p> - -<p>Personally, I would take this road whenever user-submitted content would be -possibly be shown together on one page. Why a blog comment would need to use -anchors is beyond me.</p> - - - -<h2>Denial</h2> - -<p>To revert back to pre-1.2.0 behavior, simply:</p> - -<pre>$config->set('Attr.EnableID', true);</pre> - -<p>Don't come crying to me when your page mysteriously stops validating, though.</p> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/enduser-overview.txt b/lib/htmlpurifier/docs/enduser-overview.txt deleted file mode 100644 index fe7f8705d..000000000 --- a/lib/htmlpurifier/docs/enduser-overview.txt +++ /dev/null @@ -1,59 +0,0 @@ - -HTML Purifier - by Edward Z. Yang - -There are a number of ad hoc HTML filtering solutions out there on the web -(some examples including HTML_Safe, kses and SafeHtmlChecker.class.php) that -claim to filter HTML properly, preventing malicious JavaScript and layout -breaking HTML from getting through the parser. None of them, however, -demonstrates a thorough knowledge of neither the DTD that defines the HTML -nor the caveats of HTML that cannot be expressed by a DTD. Configurable -filters (such as kses or PHP's built-in striptags() function) have trouble -validating the contents of attributes and can be subject to security attacks -due to poor configuration. Other filters take the naive approach of -blacklisting known threats and tags, failing to account for the introduction -of new technologies, new tags, new attributes or quirky browser behavior. - -However, HTML Purifier takes a different approach, one that doesn't use -specification-ignorant regexes or narrow blacklists. HTML Purifier will -decompose the whole document into tokens, and rigorously process the tokens by: -removing non-whitelisted elements, transforming bad practice tags like <font> -into <span>, properly checking the nesting of tags and their children and -validating all attributes according to their RFCs. - -To my knowledge, there is nothing like this on the web yet. Not even MediaWiki, -which allows an amazingly diverse mix of HTML and wikitext in its documents, -gets all the nesting quirks right. Existing solutions hope that no JavaScript -will slip through, but either do not attempt to ensure that the resulting -output is valid XHTML or send the HTML through a draconic XML parser (and yet -still get the nesting wrong: SafeHtmlChecker.class.php does not prevent <a> -tags from being nested within each other). - -This document no longer is a detailed description of how HTMLPurifier works, -as those descriptions have been moved to the appropriate code. The first -draft was drawn up after two rough code sketches and the implementation of a -forgiving lexer. You may also be interested in the unit tests located in the -tests/ folder, which provide a living document on how exactly the filter deals -with malformed input. - -In summary (see corresponding classes for more details): - -1. Parse document into an array of tag and text tokens (Lexer) -2. Remove all elements not on whitelist and transform certain other elements - into acceptable forms (i.e. <font>) -3. Make document well formed while helpfully taking into account certain quirks, - such as the fact that <p> tags traditionally are closed by other block-level - elements. -4. Run through all nodes and check children for proper order (especially - important for tables). -5. Validate attributes according to more restrictive definitions based on the - RFCs. -6. Translate back into a string. (Generator) - -HTML Purifier is best suited for documents that require a rich array of -HTML tags. Things like blog comments are, in all likelihood, most appropriately -written in an extremely restrictive set of markup that doesn't require -all this functionality (or not written in HTML at all), although this may -be changing in the future with the addition of levels of filtering. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/enduser-security.txt b/lib/htmlpurifier/docs/enduser-security.txt deleted file mode 100644 index 518f092bd..000000000 --- a/lib/htmlpurifier/docs/enduser-security.txt +++ /dev/null @@ -1,18 +0,0 @@ - -Security - -Like anything that claims to afford security, HTML_Purifier can be circumvented -through negligence of people. This class will do its job: no more, no less, -and it's up to you to provide it the proper information and proper context -to be effective. Things to remember: - -1. Character Encoding: see enduser-utf8.html for more info. - -2. IDs: see enduser-id.html for more info - -3. URIs: see enduser-uri-filter.html - -4. CSS: document pending -Explain which CSS styles we blocked and why. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/enduser-slow.html b/lib/htmlpurifier/docs/enduser-slow.html deleted file mode 100644 index f0ea02de1..000000000 --- a/lib/htmlpurifier/docs/enduser-slow.html +++ /dev/null @@ -1,120 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Explains how to speed up HTML Purifier through caching or inbound filtering." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Speeding up HTML Purifier - HTML Purifier</title> - -</head><body> - -<h1 class="subtitled">Speeding up HTML Purifier</h1> -<div class="subtitle">...also known as the HELP ME LIBRARY IS TOO SLOW MY PAGE TAKE TOO LONG page</div> - -<div id="filing">Filed under End-User</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>HTML Purifier is a very powerful library. But with power comes great -responsibility, in the form of longer execution times. Remember, this -library isn't lightly grazing over submitted HTML: it's deconstructing -the whole thing, rigorously checking the parts, and then putting it back -together. </p> - -<p>So, if it so turns out that HTML Purifier is kinda too slow for outbound -filtering, you've got a few options: </p> - -<h2>Inbound filtering</h2> - -<p>Perform filtering of HTML when it's submitted by the user. Since the -user is already submitting something, an extra half a second tacked on -to the load time probably isn't going to be that huge of a problem. -Then, displaying the content is a simple a manner of outputting it -directly from your database/filesystem. The trouble with this method is -that your user loses the original text, and when doing edits, will be -handling the filtered text. While this may be a good thing, especially -if you're using a WYSIWYG editor, it can also result in data-loss if a -user makes a typo. </p> - -<p>Example (non-functional):</p> - -<pre><?php - /** - * FORM SUBMISSION PAGE - * display_error($message) : displays nice error page with message - * display_success() : displays a nice success page - * display_form() : displays the HTML submission form - * database_insert($html) : inserts data into database as new row - */ - if (!empty($_POST)) { - require_once '/path/to/library/HTMLPurifier.auto.php'; - require_once 'HTMLPurifier.func.php'; - $dirty_html = isset($_POST['html']) ? $_POST['html'] : false; - if (!$dirty_html) { - display_error('You must write some HTML!'); - } - $html = HTMLPurifier($dirty_html); - database_insert($html); - display_success(); - // notice that $dirty_html is *not* saved - } else { - display_form(); - } -?></pre> - -<h2>Caching the filtered output</h2> - -<p>Accept the submitted text and put it unaltered into the database, but -then also generate a filtered version and stash that in the database. -Serve the filtered version to readers, and the unaltered version to -editors. If need be, you can invalidate the cache and have the cached -filtered version be regenerated on the first page view. Pros? Full data -retention. Cons? It's more complicated, and opens other editors up to -XSS if they are using a WYSIWYG editor (to fix that, they'd have to be -able to get their hands on the *really* original text served in -plaintext mode). </p> - -<p>Example (non-functional):</p> - -<pre><?php - /** - * VIEW PAGE - * display_error($message) : displays nice error page with message - * cache_get($id) : retrieves HTML from fast cache (db or file) - * cache_insert($id, $html) : inserts good HTML into cache system - * database_get($id) : retrieves raw HTML from database - */ - $id = isset($_GET['id']) ? (int) $_GET['id'] : false; - if (!$id) { - display_error('Must specify ID.'); - exit; - } - $html = cache_get($id); // filesystem or database - if ($html === false) { - // cache didn't have the HTML, generate it - $raw_html = database_get($id); - require_once '/path/to/library/HTMLPurifier.auto.php'; - require_once 'HTMLPurifier.func.php'; - $html = HTMLPurifier($raw_html); - cache_insert($id, $html); - } - echo $html; -?></pre> - -<h2>Summary</h2> - -<p>In short, inbound filtering is the simple option and caching is the -robust option (albeit with bigger storage requirements). </p> - -<p>There is a third option, independent of the two we've discussed: profile -and optimize HTMLPurifier yourself. Be sure to report back your results -if you decide to do that! Especially if you port HTML Purifier to C++. -<tt>;-)</tt></p> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/enduser-tidy.html b/lib/htmlpurifier/docs/enduser-tidy.html deleted file mode 100644 index a243f7fc2..000000000 --- a/lib/htmlpurifier/docs/enduser-tidy.html +++ /dev/null @@ -1,231 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Tutorial for tweaking HTML Purifier's Tidy-like behavior." /> -<link rel="stylesheet" type="text/css" href="style.css" /> - -<title>Tidy - HTML Purifier</title> - -</head><body> - -<h1>Tidy</h1> - -<div id="filing">Filed under Development</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>You've probably heard of HTML Tidy, Dave Raggett's little piece -of software that cleans up poorly written HTML. Let me say it straight -out:</p> - -<p class="emphasis">This ain't HTML Tidy!</p> - -<p>Rather, Tidy stands for a cool set of Tidy-inspired features in HTML Purifier -that allows users to submit deprecated elements and attributes and get -valid strict markup back. For example:</p> - -<pre><center>Centered</center></pre> - -<p>...becomes:</p> - -<pre><div style="text-align:center;">Centered</div></pre> - -<p>...when this particular fix is run on the HTML. This tutorial will give -you the lowdown of what exactly HTML Purifier will do when Tidy -is on, and how to fine-tune this behavior. Once again, <strong>you do -not need Tidy installed on your PHP to use these features!</strong></p> - -<h2>What does it do?</h2> - -<p>Tidy will do several things to your HTML:</p> - -<ul> - <li>Convert deprecated elements and attributes to standards-compliant - alternatives</li> - <li>Enforce XHTML compatibility guidelines and other best practices</li> - <li>Preserve data that would normally be removed as per W3C</li> -</ul> - -<h2>What are levels?</h2> - -<p>Levels describe how aggressive the Tidy module should be when -cleaning up HTML. There are four levels to pick: none, light, medium -and heavy. Each of these levels has a well-defined set of behavior -associated with it, although it may change depending on your doctype.</p> - -<dl> - <dt>light</dt> - <dd>This is the <strong>lenient</strong> level. If a tag or attribute - is about to be removed because it isn't supported by the - doctype, Tidy will step in and change into an alternative that - is supported.</dd> - <dt>medium</dt> - <dd>This is the <strong>correctional</strong> level. At this level, - all the functions of light are performed, as well as some extra, - non-essential best practices enforcement. Changes made on this - level are very benign and are unlikely to cause problems.</dd> - <dt>heavy</dt> - <dd>This is the <strong>aggressive</strong> level. If a tag or - attribute is deprecated, it will be converted into a non-deprecated - version, no ifs ands or buts.</dd> -</dl> - -<p>By default, Tidy operates on the <strong>medium</strong> level. You can -change the level of cleaning by setting the %HTML.TidyLevel configuration -directive:</p> - -<pre>$config->set('HTML.TidyLevel', 'heavy'); // burn baby burn!</pre> - -<h2>Is the light level really light?</h2> - -<p>It depends on what doctype you're using. If your documents are HTML -4.01 <em>Transitional</em>, HTML Purifier will be lazy -and won't clean up your <code>center</code> -or <code>font</code> tags. But if you're using HTML 4.01 <em>Strict</em>, -HTML Purifier has no choice: it has to convert them, or they will -be nuked out of existence. So while light on Transitional will result -in little to no changes, light on Strict will still result in quite -a lot of fixes.</p> - -<p>This is different behavior from 1.6 or before, where deprecated -tags in transitional documents would -always be cleaned up regardless. This is also better behavior.</p> - -<h2>My pages look different!</h2> - -<p>HTML Purifier is tasked with converting deprecated tags and -attributes to standards-compliant alternatives, which usually -need copious amounts of CSS. It's also not foolproof: sometimes -things do get lost in the translation. This is why when HTML Purifier -can get away with not doing cleaning, it won't; this is why -the default value is <strong>medium</strong> and not heavy.</p> - -<p>Fortunately, only a few attributes have problems with the switch -over. They are described below:</p> - -<table class="table"> - <thead><tr> - <th>Element@Attr</th> - <th>Changes</th> - </tr></thead> - <tbody> - <tr> - <td>caption@align</td> - <td>Firefox supports stuffing the caption on the - left and right side of the table, a feature that - Internet Explorer, understandably, does not have. - When align equals right or left, the text will simply - be aligned on the left or right side.</td> - </tr> - <tr> - <td>img@align</td> - <td>The implementation for align bottom is good, but not - perfect. There are a few pixel differences.</td> - </tr> - <tr> - <td>br@clear</td> - <td>Clear both gets a little wonky in Internet Explorer. Haven't - really been able to figure out why.</td> - </tr> - <tr> - <td>hr@noshade</td> - <td>All browsers implement this slightly differently: we've - chosen to make noshade horizontal rules gray.</td> - </tr> - </tbody> -</table> - -<p>There are a few more minor, although irritating, bugs. -Some older browsers support deprecated attributes, -but not CSS. Transformed elements and attributes will look unstyled -to said browsers. Also, CSS precedence is slightly different for -inline styles versus presentational markup. In increasing precedence:</p> - -<ol> - <li>Presentational attributes</li> - <li>External style sheets</li> - <li>Inline styling</li> -</ol> - -<p>This means that styling that may have been masked by external CSS -declarations will start showing up (a good thing, perhaps). Finally, -if you've turned off the style attribute, almost all of -these transformations will not work. Sorry mates.</p> - -<p>You can review the rendering before and after of these transformations -by consulting the <a -href="http://htmlpurifier.org/live/smoketests/attrTransform.php">attrTransform.php -smoketest</a>.</p> - -<h2>I like the general idea, but the specifics bug me!</h2> - -<p>So you want HTML Purifier to clean up your HTML, but you're not -so happy about the br@clear implementation. That's perfectly fine! -HTML Purifier will make accomodations:</p> - -<pre>$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); -$config->set('HTML.TidyLevel', 'heavy'); // all changes, minus... -<strong>$config->set('HTML.TidyRemove', 'br@clear');</strong></pre> - -<p>That third line does the magic, removing the br@clear fix -from the module, ensuring that <code><br clear="both" /></code> -will pass through unharmed. The reverse is possible too:</p> - -<pre>$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); -$config->set('HTML.TidyLevel', 'none'); // no changes, plus... -<strong>$config->set('HTML.TidyAdd', 'p@align');</strong></pre> - -<p>In this case, all transformations are shut off, except for the p@align -one, which you found handy.</p> - -<p>To find out what the names of fixes you want to turn on or off are, -you'll have to consult the source code, specifically the files in -<code>HTMLPurifier/HTMLModule/Tidy/</code>. There is, however, a -general syntax:</p> - -<table class="table"> - <thead> - <tr> - <th>Name</th> - <th>Example</th> - <th>Interpretation</th> - </tr> - </thead> - <tbody> - <tr> - <td>element</td> - <td>font</td> - <td>Tag transform for <em>element</em></td> - </tr> - <tr> - <td>element@attr</td> - <td>br@clear</td> - <td>Attribute transform for <em>attr</em> on <em>element</em></td> - </tr> - <tr> - <td>@attr</td> - <td>@lang</td> - <td>Global attribute transform for <em>attr</em></td> - </tr> - <tr> - <td>e#content_model_type</td> - <td>blockquote#content_model_type</td> - <td>Change of child processing implementation for <em>e</em></td> - </tr> - </tbody> -</table> - -<h2>So... what's the lowdown?</h2> - -<p>The lowdown is, quite frankly, HTML Purifier's default settings are -probably good enough. The next step is to bump the level up to heavy, -and if that still doesn't satisfy your appetite, do some fine-tuning. -Other than that, don't worry about it: this all works silently and -effectively in the background.</p> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/enduser-uri-filter.html b/lib/htmlpurifier/docs/enduser-uri-filter.html deleted file mode 100644 index d1b3354a3..000000000 --- a/lib/htmlpurifier/docs/enduser-uri-filter.html +++ /dev/null @@ -1,204 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Tutorial for creating custom URI filters." /> -<link rel="stylesheet" type="text/css" href="style.css" /> - -<title>URI Filters - HTML Purifier</title> - -</head><body> - -<h1>URI Filters</h1> - -<div id="filing">Filed under End-User</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p> - This is a quick and dirty document to get you on your way to writing - custom URI filters for your own URL filtering needs. Why would you - want to write a URI filter? If you need URIs your users put into - HTML to magically change into a different URI, this is - exactly what you need! -</p> - -<h2>Creating the class</h2> - -<p> - Any URI filter you make will be a subclass of <code>HTMLPurifier_URIFilter</code>. - The scaffolding is thus: -</p> - -<pre>class HTMLPurifier_URIFilter_<strong>NameOfFilter</strong> extends HTMLPurifier_URIFilter -{ - public $name = '<strong>NameOfFilter</strong>'; - public function prepare($config) {} - public function filter(&$uri, $config, $context) {} -}</pre> - -<p> - Fill in the variable <code>$name</code> with the name of your filter, and - take a look at the two methods. <code>prepare()</code> is an initialization - method that is called only once, before any filtering has been done of the - HTML. Use it to perform any costly setup work that only needs to be done - once. <code>filter()</code> is the guts and innards of our filter: - it takes the URI and does whatever needs to be done to it. -</p> - -<p> - If you've worked with HTML Purifier, you'll recognize the <code>$config</code> - and <code>$context</code> parameters. On the other hand, <code>$uri</code> - is something unique to this section of the application: it's a - <code>HTMLPurifier_URI</code> object. The interface is thus: -</p> - -<pre>class HTMLPurifier_URI -{ - public $scheme, $userinfo, $host, $port, $path, $query, $fragment; - public function HTMLPurifier_URI($scheme, $userinfo, $host, $port, $path, $query, $fragment); - public function toString(); - public function copy(); - public function getSchemeObj($config, $context); - public function validate($config, $context); -}</pre> - -<p> - The first three methods are fairly self-explanatory: you have a constructor, - a serializer, and a cloner. Generally, you won't be using them when - you are manipulating the URI objects themselves. - <code>getSchemeObj()</code> is a special purpose method that returns - a <code>HTMLPurifier_URIScheme</code> object corresponding to the specific - URI at hand. <code>validate()</code> performs general-purpose validation - on the internal components of a URI. Once again, you don't need to - worry about these: they've already been handled for you. -</p> - -<h2>URI format</h2> - -<p> - As a URIFilter, we're interested in the member variables of the URI object. -</p> - -<table class="quick"><tbody> - <tr><th>Scheme</th> <td>The protocol for identifying (and possibly locating) a resource (http, ftp, https)</td></tr> - <tr><th>Userinfo</th> <td>User information such as a username (bob)</td></tr> - <tr><th>Host</th> <td>Domain name or IP address of the server (example.com, 127.0.0.1)</td></tr> - <tr><th>Port</th> <td>Network port number for the server (80, 12345)</td></tr> - <tr><th>Path</th> <td>Data that identifies the resource, possibly hierarchical (/path/to, ed@example.com)</td></tr> - <tr><th>Query</th> <td>String of information to be interpreted by the resource (?q=search-term)</td></tr> - <tr><th>Fragment</th> <td>Additional information for the resource after retrieval (#bookmark)</td></tr> -</tbody></table> - -<p> - Because the URI is presented to us in this form, and not - <code>http://bob@example.com:8080/foo.php?q=string#hash</code>, it saves us - a lot of trouble in having to parse the URI every time we want to filter - it. For the record, the above URI has the following components: -</p> - -<table class="quick"><tbody> - <tr><th>Scheme</th> <td>http</td></tr> - <tr><th>Userinfo</th> <td>bob</td></tr> - <tr><th>Host</th> <td>example.com</td></tr> - <tr><th>Port</th> <td>8080</td></tr> - <tr><th>Path</th> <td>/foo.php</td></tr> - <tr><th>Query</th> <td>q=string</td></tr> - <tr><th>Fragment</th> <td>hash</td></tr> -</tbody></table> - -<p> - Note that there is no question mark or octothorpe in the query or - fragment: these get removed during parsing. -</p> - -<p> - With this information, you can get straight to implementing your - <code>filter()</code> method. But one more thing... -</p> - -<h2>Return value: Boolean, not URI</h2> - -<p> - You may have noticed that the URI is being passed in by reference. - This means that whatever changes you make to it, those changes will - be reflected in the URI object the callee had. <strong>Do not - return the URI object: it is unnecessary and will cause bugs.</strong> - Instead, return a boolean value, true if the filtering was successful, - or false if the URI is beyond repair and needs to be axed. -</p> - -<p> - Let's suppose I wanted to write a filter that converted links with a - custom <code>image</code> scheme to its corresponding real path on - our website: -</p> - -<pre>class HTMLPurifier_URIFilter_TransformImageScheme extends HTMLPurifier_URIFilter -{ - public $name = 'TransformImageScheme'; - public function filter(&$uri, $config, $context) { - if ($uri->scheme !== 'image') return true; - $img_name = $uri->path; - // Overwrite the previous URI object - $uri = new HTMLPurifier_URI('http', null, null, null, '/img/' . $img_name . '.png', null, null); - return true; - } -}</pre> - -<p> - Notice I did not <code>return $uri;</code>. This filter would turn - <code>image:Foo</code> into <code>/img/Foo.png</code>. -</p> - -<h2>Activating your filter</h2> - -<p> - Having a filter is all well and good, but you need to tell HTML Purifier - to use it. Fortunately, this part's simple: -</p> - -<pre>$uri = $config->getDefinition('URI'); -$uri->addFilter(new HTMLPurifier_URIFilter_<strong>NameOfFilter</strong>(), $config);</pre> - -<p> - After adding a filter, you won't be able to set configuration directives. - Structure your code accordingly. -</p> - -<!-- XXX: link to new documentation system --> - -<h2>Post-filter</h2> - -<p> - Remember our TransformImageScheme filter? That filter acted before we had - performed scheme validation; otherwise, the URI would have been filtered - out when it was discovered that there was no image scheme. Well, a post-filter - is run after scheme specific validation, so it's ideal for bulk - post-processing of URIs, including munging. To specify a URI as a post-filter, - set the <code>$post</code> member variable to TRUE. -</p> - -<pre>class HTMLPurifier_URIFilter_MyPostFilter extends HTMLPurifier_URIFilter -{ - public $name = 'MyPostFilter'; - public $post = true; - // ... extra code here -} -</pre> - -<h2>Examples</h2> - -<p> - Check the - <a href="http://repo.or.cz/w/htmlpurifier.git?a=tree;hb=HEAD;f=library/HTMLPurifier/URIFilter">URIFilter</a> - directory for more implementation examples, and see <a href="proposal-new-directives.txt">the - new directives proposal document</a> for ideas on what could be implemented - as a filter. -</p> - -</body></html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/enduser-utf8.html b/lib/htmlpurifier/docs/enduser-utf8.html deleted file mode 100644 index 9b01a302a..000000000 --- a/lib/htmlpurifier/docs/enduser-utf8.html +++ /dev/null @@ -1,1060 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> -<style type="text/css"> - .minor td {font-style:italic;} -</style> - -<title>UTF-8: The Secret of Character Encoding - HTML Purifier</title> - -<!-- Note to users: this document, though professing to be UTF-8, attempts -to use only ASCII characters, because most webservers are configured -to send HTML as ISO-8859-1. So I will, many times, go against my -own advice for sake of portability. --> - -</head><body> - -<h1>UTF-8: The Secret of Character Encoding</h1> - -<div id="filing">Filed under End-User</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>Character encoding and character sets are not that -difficult to understand, but so many people blithely stumble -through the worlds of programming without knowing what to actually -do about it, or say "Ah, it's a job for those <em>internationalization</em> -experts." No, it is not! This document will walk you through -determining the encoding of your system and how you should handle -this information. It will stay away from excessive discussion on -the internals of character encoding.</p> - -<p>This document is not designed to be read in its entirety: it will -slowly introduce concepts that build on each other: you need not get to -the bottom to have learned something new. However, I strongly -recommend you read all the way to <strong>Why UTF-8?</strong>, because at least -at that point you'd have made a conscious decision not to migrate, -which can be a rewarding (but difficult) task.</p> - -<blockquote class="aside"> -<div class="label">Asides</div> - <p>Text in this formatting is an <strong>aside</strong>, - interesting tidbits for the curious but not strictly necessary material to - do the tutorial. If you read this text, you'll come out - with a greater understanding of the underlying issues.</p> -</blockquote> - -<h2>Table of Contents</h2> - -<ol id="toc"> - <li><a href="#findcharset">Finding the real encoding</a></li> - <li><a href="#findmetacharset">Finding the embedded encoding</a></li> - <li><a href="#fixcharset">Fixing the encoding</a><ol> - <li><a href="#fixcharset-none">No embedded encoding</a></li> - <li><a href="#fixcharset-diff">Embedded encoding disagrees</a></li> - <li><a href="#fixcharset-server">Changing the server encoding</a><ol> - <li><a href="#fixcharset-server-php">PHP header() function</a></li> - <li><a href="#fixcharset-server-phpini">PHP ini directive</a></li> - <li><a href="#fixcharset-server-nophp">Non-PHP</a></li> - <li><a href="#fixcharset-server-htaccess">.htaccess</a></li> - <li><a href="#fixcharset-server-ext">File extensions</a></li> - </ol></li> - <li><a href="#fixcharset-xml">XML</a></li> - <li><a href="#fixcharset-internals">Inside the process</a></li> - </ol></li> - <li><a href="#whyutf8">Why UTF-8?</a><ol> - <li><a href="#whyutf8-i18n">Internationalization</a></li> - <li><a href="#whyutf8-user">User-friendly</a></li> - <li><a href="#whyutf8-forms">Forms</a><ol> - <li><a href="#whyutf8-forms-urlencoded">application/x-www-form-urlencoded</a></li> - <li><a href="#whyutf8-forms-multipart">multipart/form-data</a></li> - </ol></li> - <li><a href="#whyutf8-support">Well supported</a></li> - <li><a href="#whyutf8-htmlpurifier">HTML Purifiers</a></li> - </ol></li> - <li><a href="#migrate">Migrate to UTF-8</a><ol> - <li><a href="#migrate-db">Configuring your database</a><ol> - <li><a href="#migrate-db-legit">Legit method</a></li> - <li><a href="#migrate-db-binary">Binary</a></li> - </ol></li> - <li><a href="#migrate-editor">Text editor</a></li> - <li><a href="#migrate-bom">Byte Order Mark (headers already sent!)</a></li> - <li><a href="#migrate-fonts">Fonts</a><ol> - <li><a href="#migrate-fonts-obscure">Obscure scripts</a></li> - <li><a href="#migrate-fonts-occasional">Occasional use</a></li> - </ol></li> - <li><a href="#migrate-variablewidth">Dealing with variable width in functions</a></li> - </ol></li> - <li><a href="#externallinks">Further Reading</a></li> -</ol> - -<h2 id="findcharset">Finding the real encoding</h2> - -<p>In the beginning, there was ASCII, and things were simple. But they -weren't good, for no one could write in Cyrillic or Thai. So there -exploded a proliferation of character encodings to remedy the problem -by extending the characters ASCII could express. This ridiculously -simplified version of the history of character encodings shows us that -there are now many character encodings floating around.</p> - -<blockquote class="aside"> - <p>A <strong>character encoding</strong> tells the computer how to - interpret raw zeroes and ones into real characters. It - usually does this by pairing numbers with characters.</p> - <p>There are many different types of character encodings floating - around, but the ones we deal most frequently with are ASCII, - 8-bit encodings, and Unicode-based encodings.</p> - <ul> - <li><strong>ASCII</strong> is a 7-bit encoding based on the - English alphabet.</li> - <li><strong>8-bit encodings</strong> are extensions to ASCII - that add a potpourri of useful, non-standard characters - like é and æ. They can only add 127 characters, - so usually only support one script at a time. When you - see a page on the web, chances are it's encoded in one - of these encodings.</li> - <li><strong>Unicode-based encodings</strong> implement the - Unicode standard and include UTF-8, UTF-16 and UTF-32/UCS-4. - They go beyond 8-bits and support almost - every language in the world. UTF-8 is gaining traction - as the dominant international encoding of the web.</li> - </ul> -</blockquote> - -<p>The first step of our journey is to find out what the encoding of -your website is. The most reliable way is to ask your -browser:</p> - -<dl> - <dt>Mozilla Firefox</dt> - <dd>Tools > Page Info: Encoding</dd> - <dt>Internet Explorer</dt> - <dd>View > Encoding: bulleted item is unofficial name</dd> -</dl> - -<p>Internet Explorer won't give you the MIME (i.e. useful/real) name of the -character encoding, so you'll have to look it up using their description. -Some common ones:</p> - -<table class="table"> - <thead><tr> - <th>IE's Description</th> - <th>Mime Name</th> - </tr></thead> - <tbody> - <tr><th colspan="2">Windows</th></tr> - <tr><td>Arabic (Windows)</td><td>Windows-1256</td></tr> - <tr><td>Baltic (Windows)</td><td>Windows-1257</td></tr> - <tr><td>Central European (Windows)</td><td>Windows-1250</td></tr> - <tr><td>Cyrillic (Windows)</td><td>Windows-1251</td></tr> - <tr><td>Greek (Windows)</td><td>Windows-1253</td></tr> - <tr><td>Hebrew (Windows)</td><td>Windows-1255</td></tr> - <tr><td>Thai (Windows)</td><td>TIS-620</td></tr> - <tr><td>Turkish (Windows)</td><td>Windows-1254</td></tr> - <tr><td>Vietnamese (Windows)</td><td>Windows-1258</td></tr> - <tr><td>Western European (Windows)</td><td>Windows-1252</td></tr> - </tbody> - <tbody> - <tr><th colspan="2">ISO</th></tr> - <tr><td>Arabic (ISO)</td><td>ISO-8859-6</td></tr> - <tr><td>Baltic (ISO)</td><td>ISO-8859-4</td></tr> - <tr><td>Central European (ISO)</td><td>ISO-8859-2</td></tr> - <tr><td>Cyrillic (ISO)</td><td>ISO-8859-5</td></tr> - <tr class="minor"><td>Estonian (ISO)</td><td>ISO-8859-13</td></tr> - <tr class="minor"><td>Greek (ISO)</td><td>ISO-8859-7</td></tr> - <tr><td>Hebrew (ISO-Logical)</td><td>ISO-8859-8-l</td></tr> - <tr><td>Hebrew (ISO-Visual)</td><td>ISO-8859-8</td></tr> - <tr class="minor"><td>Latin 9 (ISO)</td><td>ISO-8859-15</td></tr> - <tr class="minor"><td>Turkish (ISO)</td><td>ISO-8859-9</td></tr> - <tr><td>Western European (ISO)</td><td>ISO-8859-1</td></tr> - </tbody> - <tbody> - <tr><th colspan="2">Other</th></tr> - <tr><td>Chinese Simplified (GB18030)</td><td>GB18030</td></tr> - <tr><td>Chinese Simplified (GB2312)</td><td>GB2312</td></tr> - <tr><td>Chinese Simplified (HZ)</td><td>HZ</td></tr> - <tr><td>Chinese Traditional (Big5)</td><td>Big5</td></tr> - <tr><td>Japanese (Shift-JIS)</td><td>Shift_JIS</td></tr> - <tr><td>Japanese (EUC)</td><td>EUC-JP</td></tr> - <tr><td>Korean</td><td>EUC-KR</td></tr> - <tr><td>Unicode (UTF-8)</td><td>UTF-8</td></tr> - </tbody> -</table> - -<p>Internet Explorer does not recognize some of the more obscure -character encodings, and having to lookup the real names with a table -is a pain, so I recommend using Mozilla Firefox to find out your -character encoding.</p> - -<h2 id="findmetacharset">Finding the embedded encoding</h2> - -<p>At this point, you may be asking, "Didn't we already find out our -encoding?" Well, as it turns out, there are multiple places where -a web developer can specify a character encoding, and one such place -is in a <code>META</code> tag:</p> - -<pre><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></pre> - -<p>You'll find this in the <code>HEAD</code> section of an HTML document. -The text to the right of <code>charset=</code> is the "claimed" -encoding: the HTML claims to be this encoding, but whether or not this -is actually the case depends on other factors. For now, take note -if your <code>META</code> tag claims that either:</p> - -<ol> - <li>The character encoding is the same as the one reported by the - browser,</li> - <li>The character encoding is different from the browser's, or</li> - <li>There is no <code>META</code> tag at all! (horror, horror!)</li> -</ol> - -<h2 id="fixcharset">Fixing the encoding</h2> - -<p class="aside">The advice given here is for pages being served as -vanilla <code>text/html</code>. Different practices must be used -for <code>application/xml</code> or <code>application/xml+xhtml</code>, see -<a href="http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020430/">W3C's -document on XHTML media types</a> for more information.</p> - -<p>If your <code>META</code> encoding and your real encoding match, -savvy! You can skip this section. If they don't...</p> - -<h3 id="fixcharset-none">No embedded encoding</h3> - -<p>If this is the case, you'll want to add in the appropriate -<code>META</code> tag to your website. It's as simple as copy-pasting -the code snippet above and replacing UTF-8 with whatever is the mime name -of your real encoding.</p> - -<blockquote class="aside"> - <p>For all those skeptics out there, there is a very good reason - why the character encoding should be explicitly stated. When the - browser isn't told what the character encoding of a text is, it - has to guess: and sometimes the guess is wrong. Hackers can manipulate - this guess in order to slip XSS past filters and then fool the - browser into executing it as active code. A great example of this - is the <a href="http://shiflett.org/archive/177">Google UTF-7 - exploit</a>.</p> - <p>You might be able to get away with not specifying a character - encoding with the <code>META</code> tag as long as your webserver - sends the right Content-Type header, but why risk it? Besides, if - the user downloads the HTML file, there is no longer any webserver - to define the character encoding.</p> -</blockquote> - -<h3 id="fixcharset-diff">Embedded encoding disagrees</h3> - -<p>This is an extremely common mistake: another source is telling -the browser what the -character encoding is and is overriding the embedded encoding. This -source usually is the Content-Type HTTP header that the webserver (i.e. -Apache) sends. A usual Content-Type header sent with a page might -look like this:</p> - -<pre>Content-Type: text/html; charset=ISO-8859-1</pre> - -<p>Notice how there is a charset parameter: this is the webserver's -way of telling a browser what the character encoding is, much like -the <code>META</code> tags we touched upon previously.</p> - -<blockquote class="aside"><p>In fact, the <code>META</code> tag is -designed as a substitute for the HTTP header for contexts where -sending headers is impossible (such as locally stored files without -a webserver). Thus the name <code>http-equiv</code> (HTTP equivalent). -</p></blockquote> - -<p>There are two ways to go about fixing this: changing the <code>META</code> -tag to match the HTTP header, or changing the HTTP header to match -the <code>META</code> tag. How do we know which to do? It depends -on the website's content: after all, headers and tags are only ways of -describing the actual characters on the web page.</p> - -<p>If your website:</p> - -<dl> - <dt>...only uses ASCII characters,</dt> - <dd>Either way is fine, but I recommend switching both to - UTF-8 (more on this later).</dd> - <dt>...uses special characters, and they display - properly,</dt> - <dd>Change the embedded encoding to the server encoding.</dd> - <dt>...uses special characters, but users often complain that - they come out garbled,</dt> - <dd>Change the server encoding to the embedded encoding.</dd> -</dl> - -<p>Changing a META tag is easy: just swap out the old encoding -for the new. Changing the server (HTTP header) encoding, however, -is slightly more difficult.</p> - -<h3 id="fixcharset-server">Changing the server encoding</h3> - -<h4 id="fixcharset-server-php">PHP header() function</h4> - -<p>The simplest way to handle this problem is to send the encoding -yourself, via your programming language. Since you're using HTML -Purifier, I'll assume PHP, although it's not too difficult to do -similar things in -<a href="http://www.w3.org/International/O-HTTP-charset#scripting">other -languages</a>. The appropriate code is:</p> - -<pre><a href="http://php.net/function.header">header</a>('Content-Type:text/html; charset=UTF-8');</pre> - -<p>...replacing UTF-8 with whatever your embedded encoding is. -This code must come before any output, so be careful about -stray whitespace in your application (i.e., any whitespace before -output excluding whitespace within <?php ?> tags).</p> - -<h4 id="fixcharset-server-phpini">PHP ini directive</h4> - -<p>PHP also has a neat little ini directive that can save you a -header call: <code><a href="http://php.net/ini.core#ini.default-charset">default_charset</a></code>. Using this code:</p> - -<pre><a href="http://php.net/function.ini_set">ini_set</a>('default_charset', 'UTF-8');</pre> - -<p>...will also do the trick. If PHP is running as an Apache module (and -not as FastCGI, consult -<a href="http://php.net/phpinfo">phpinfo</a>() for details), you can even use htaccess to apply this property -across many PHP files:</p> - -<pre><a href="http://php.net/configuration.changes#configuration.changes.apache">php_value</a> default_charset "UTF-8"</pre> - -<blockquote class="aside"><p>As with all INI directives, this can -also go in your php.ini file. Some hosting providers allow you to customize -your own php.ini file, ask your support for details. Use:</p> -<pre>default_charset = "utf-8"</pre></blockquote> - -<h4 id="fixcharset-server-nophp">Non-PHP</h4> - -<p>You may, for whatever reason, need to set the character encoding -on non-PHP files, usually plain ol' HTML files. Doing this -is more of a hit-or-miss process: depending on the software being -used as a webserver and the configuration of that software, certain -techniques may work, or may not work.</p> - -<h4 id="fixcharset-server-htaccess">.htaccess</h4> - -<p>On Apache, you can use an .htaccess file to change the character -encoding. I'll defer to -<a href="http://www.w3.org/International/questions/qa-htaccess-charset">W3C</a> -for the in-depth explanation, but it boils down to creating a file -named .htaccess with the contents:</p> - -<pre><a href="http://httpd.apache.org/docs/1.3/mod/mod_mime.html#addcharset">AddCharset</a> UTF-8 .html</pre> - -<p>Where UTF-8 is replaced with the character encoding you want to -use and .html is a file extension that this will be applied to. This -character encoding will then be set for any file directly in -or in the subdirectories of directory you place this file in.</p> - -<p>If you're feeling particularly courageous, you can use:</p> - -<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> UTF-8</pre> - -<p>...which changes the character set Apache adds to any document that -doesn't have any Content-Type parameters. This directive, which the -default configuration file sets to iso-8859-1 for security -reasons, is probably why your headers mismatch -with the <code>META</code> tag. If you would prefer Apache not to be -butting in on your character encodings, you can tell it not -to send anything at all:</p> - -<pre><a href="http://httpd.apache.org/docs/1.3/mod/core.html#adddefaultcharset">AddDefaultCharset</a> Off</pre> - -<p>...making your internal charset declaration (usually the <code>META</code> tags) -the sole source of character encoding -information. In these cases, it is <em>especially</em> important to make -sure you have valid <code>META</code> tags on your pages and all the -text before them is ASCII.</p> - -<blockquote class="aside"><p>These directives can also be -placed in httpd.conf file for Apache, but -in most shared hosting situations you won't be able to edit this file. -</p></blockquote> - -<h4 id="fixcharset-server-ext">File extensions</h4> - -<p>If you're not allowed to use .htaccess files, you can often -piggy-back off of Apache's default AddCharset declarations to get -your files in the proper extension. Here are Apache's default -character set declarations:</p> - -<table class="table"> - <thead><tr> - <th>Charset</th> - <th>File extension(s)</th> - </tr></thead> - <tbody> - <tr><td>ISO-8859-1</td><td>.iso8859-1 .latin1</td></tr> - <tr><td>ISO-8859-2</td><td>.iso8859-2 .latin2 .cen</td></tr> - <tr><td>ISO-8859-3</td><td>.iso8859-3 .latin3</td></tr> - <tr><td>ISO-8859-4</td><td>.iso8859-4 .latin4</td></tr> - <tr><td>ISO-8859-5</td><td>.iso8859-5 .latin5 .cyr .iso-ru</td></tr> - <tr><td>ISO-8859-6</td><td>.iso8859-6 .latin6 .arb</td></tr> - <tr><td>ISO-8859-7</td><td>.iso8859-7 .latin7 .grk</td></tr> - <tr><td>ISO-8859-8</td><td>.iso8859-8 .latin8 .heb</td></tr> - <tr><td>ISO-8859-9</td><td>.iso8859-9 .latin9 .trk</td></tr> - <tr><td>ISO-2022-JP</td><td>.iso2022-jp .jis</td></tr> - <tr><td>ISO-2022-KR</td><td>.iso2022-kr .kis</td></tr> - <tr><td>ISO-2022-CN</td><td>.iso2022-cn .cis</td></tr> - <tr><td>Big5</td><td>.Big5 .big5 .b5</td></tr> - <tr><td>WINDOWS-1251</td><td>.cp-1251 .win-1251</td></tr> - <tr><td>CP866</td><td>.cp866</td></tr> - <tr><td>KOI8-r</td><td>.koi8-r .koi8-ru</td></tr> - <tr><td>KOI8-ru</td><td>.koi8-uk .ua</td></tr> - <tr><td>ISO-10646-UCS-2</td><td>.ucs2</td></tr> - <tr><td>ISO-10646-UCS-4</td><td>.ucs4</td></tr> - <tr><td>UTF-8</td><td>.utf8</td></tr> - <tr><td>GB2312</td><td>.gb2312 .gb </td></tr> - <tr><td>utf-7</td><td>.utf7</td></tr> - <tr><td>EUC-TW</td><td>.euc-tw</td></tr> - <tr><td>EUC-JP</td><td>.euc-jp</td></tr> - <tr><td>EUC-KR</td><td>.euc-kr</td></tr> - <tr><td>shift_jis</td><td>.sjis</td></tr> - </tbody> -</table> - -<p>So, for example, a file named <code>page.utf8.html</code> or -<code>page.html.utf8</code> will probably be sent with the UTF-8 charset -attached, the difference being that if there is an -<code>AddCharset charset .html</code> declaration, it will override -the .utf8 extension in <code>page.utf8.html</code> (precedence moves -from right to left). By default, Apache has no such declaration.</p> - -<h4 id="fixcharset-server-iis">Microsoft IIS</h4> - -<p>If anyone can contribute information on how to configure Microsoft -IIS to change character encodings, I'd be grateful.</p> - -<h3 id="fixcharset-xml">XML</h3> - -<p><code>META</code> tags are the most common source of embedded -encodings, but they can also come from somewhere else: XML -Declarations. They look like:</p> - -<pre><?xml version="1.0" encoding="UTF-8"?></pre> - -<p>...and are most often found in XML documents (including XHTML).</p> - -<p>For XHTML, this XML Declaration theoretically -overrides the <code>META</code> tag. In reality, this happens only when the -XHTML is actually served as legit XML and not HTML, which is almost always -never due to Internet Explorer's lack of support for -<code>application/xhtml+xml</code> (even though doing so is often -argued to be <a href="http://www.hixie.ch/advocacy/xhtml">good -practice</a> and is required by the XHTML 1.1 specification).</p> - -<p>For XML, however, this XML Declaration is extremely important. -Since most webservers are not configured to send charsets for .xml files, -this is the only thing a parser has to go on. Furthermore, the default -for XML files is UTF-8, which often butts heads with more common -ISO-8859-1 encoding (you see this in garbled RSS feeds).</p> - -<p>In short, if you use XHTML and have gone through the -trouble of adding the XML Declaration, make sure it jives -with your <code>META</code> tags (which should only be present -if served in text/html) and HTTP headers.</p> - -<h3 id="fixcharset-internals">Inside the process</h3> - -<p>This section is not required reading, -but may answer some of your questions on what's going on in all -this character encoding hocus pocus. If you're interested in -moving on to the next phase, skip this section.</p> - -<p>A logical question that follows all of our wheeling and dealing -with multiple sources of character encodings is "Why are there -so many options?" To answer this question, we have to turn -back our definition of character encodings: they allow a program -to interpret bytes into human-readable characters.</p> - -<p>Thus, a chicken-egg problem: a character encoding -is necessary to interpret the -text of a document. A <code>META</code> tag is in the text of a document. -The <code>META</code> tag gives the character encoding. How can we -determine the contents of a <code>META</code> tag, inside the text, -if we don't know it's character encoding? And how do we figure out -the character encoding, if we don't know the contents of the -<code>META</code> tag?</p> - -<p>Fortunately for us, the characters we need to write the -<code>META</code> are in ASCII, which is pretty much universal -over every character encoding that is in common use today. So, -all the web-browser has to do is parse all the way down until -it gets to the Content-Type tag, extract the character encoding -tag, then re-parse the document according to this new information.</p> - -<p>Obviously this is complicated, so browsers prefer the simpler -and more efficient solution: get the character encoding from a -somewhere other than the document itself, i.e. the HTTP headers, -much to the chagrin of HTML authors who can't set these headers.</p> - -<h2 id="whyutf8">Why UTF-8?</h2> - -<p>So, you've gone through all the trouble of ensuring that your -server and embedded characters all line up properly and are -present. Good job: at -this point, you could quit and rest easy knowing that your pages -are not vulnerable to character encoding style XSS attacks. -However, just as having a character encoding is better than -having no character encoding at all, having UTF-8 as your -character encoding is better than having some other random -character encoding, and the next step is to convert to UTF-8. -But why?</p> - -<h3 id="whyutf8-i18n">Internationalization</h3> - -<p>Many software projects, at one point or another, suddenly realize -that they should be supporting more than one language. Even regular -usage in one language sometimes requires the occasional special character -that, without surprise, is not available in your character set. Sometimes -developers get around this by adding support for multiple encodings: when -using Chinese, use Big5, when using Japanese, use Shift-JIS, when -using Greek, etc. Other times, they use character references with great -zeal.</p> - -<p>UTF-8, however, obviates the need for any of these complicated -measures. After getting the system to use UTF-8 and adjusting for -sources that are outside the hand of the browser (more on this later), -UTF-8 just works. You can use it for any language, even many languages -at once, you don't have to worry about managing multiple encodings, -you don't have to use those user-unfriendly entities.</p> - -<h3 id="whyutf8-user">User-friendly</h3> - -<p>Websites encoded in Latin-1 (ISO-8859-1) which occasionally need -a special character outside of their scope often will use a character -entity reference to achieve the desired effect. For instance, θ can be -written <code>&theta;</code>, regardless of the character encoding's -support of Greek letters.</p> - -<p>This works nicely for limited use of special characters, but -say you wanted this sentence of Chinese text: 激光, -這兩個字是甚麼意思. -The ampersand encoded version would look like this:</p> - -<pre>&#28608;&#20809;, &#36889;&#20841;&#20491;&#23383;&#26159;&#29978;&#40636;&#24847;&#24605;</pre> - -<p>Extremely inconvenient for those of us who actually know what -character entities are, totally unintelligible to poor users who don't! -Even the slightly more user-friendly, "intelligible" character -entities like <code>&theta;</code> will leave users who are -uninterested in learning HTML scratching their heads. On the other -hand, if they see θ in an edit box, they'll know that it's a -special character, and treat it accordingly, even if they don't know -how to write that character themselves.</p> - -<blockquote class="aside"><p>Wikipedia is a great case study for -an application that originally used ISO-8859-1 but switched to UTF-8 -when it became far to cumbersome to support foreign languages. Bots -will now actually go through articles and convert character entities -to their corresponding real characters for the sake of user-friendliness -and searchability. See -<a href="http://meta.wikimedia.org/wiki/Help:Special_characters">Meta's -page on special characters</a> for more details. -</p></blockquote> - -<h3 id="whyutf8-forms">Forms</h3> - -<p>While we're on the tack of users, how do non-UTF-8 web forms deal -with characters that are outside of their character set? Rather than -discuss what UTF-8 does right, we're going to show what could go wrong -if you didn't use UTF-8 and people tried to use characters outside -of your character encoding.</p> - -<p>The troubles are large, extensive, and extremely difficult to fix (or, -at least, difficult enough that if you had the time and resources to invest -in doing the fix, you would be probably better off migrating to UTF-8). -There are two types of form submission: <code>application/x-www-form-urlencoded</code> -which is used for GET and by default for POST, and <code>multipart/form-data</code> -which may be used by POST, and is required when you want to upload -files.</p> - -<p>The following is a summarization of notes from -<a href="http://web.archive.org/web/20060427015200/ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html"> -<code>FORM</code> submission and i18n</a>. That document contains lots -of useful information, but is written in a rambly manner, so -here I try to get right to the point. (Note: the original has -disappeared off the web, so I am linking to the Web Archive copy.)</p> - -<h4 id="whyutf8-forms-urlencoded"><code>application/x-www-form-urlencoded</code></h4> - -<p>This is the Content-Type that GET requests must use, and POST requests -use by default. It involves the ubiquitous percent encoding format that -looks something like: <code>%C3%86</code>. There is no official way of -determining the character encoding of such a request, since the percent -encoding operates on a byte level, so it is usually assumed that it -is the same as the encoding the page containing the form was submitted -in. (<a href="http://tools.ietf.org/html/rfc3986#section-2.5">RFC 3986</a> -recommends that textual identifiers be translated to UTF-8; however, browser -compliance is spotty.) You'll run into very few problems -if you only use characters in the character encoding you chose.</p> - -<p>However, once you start adding characters outside of your encoding -(and this is a lot more common than you may think: take curly -"smart" quotes from Microsoft as an example), -a whole manner of strange things start to happen. Depending on the -browser you're using, they might:</p> - -<ul> - <li>Replace the unsupported characters with useless question marks,</li> - <li>Attempt to fix the characters (example: smart quotes to regular quotes),</li> - <li>Replace the character with a character entity reference, or</li> - <li>Send it anyway as a different character encoding mixed in - with the original encoding (usually Windows-1252 rather than - iso-8859-1 or UTF-8 interspersed in 8-bit)</li> -</ul> - -<p>To properly guard against these behaviors, you'd have to sniff out -the browser agent, compile a database of different behaviors, and -take appropriate conversion action against the string (disregarding -a spate of extremely mysterious, random and devastating bugs Internet -Explorer manifests every once in a while). Or you could -use UTF-8 and rest easy knowing that none of this could possibly happen -since UTF-8 supports every character.</p> - -<h4 id="whyutf8-forms-multipart"><code>multipart/form-data</code></h4> - -<p>Multipart form submission takes away a lot of the ambiguity -that percent-encoding had: the server now can explicitly ask for -certain encodings, and the client can explicitly tell the server -during the form submission what encoding the fields are in.</p> - -<p>There are two ways you go with this functionality: leave it -unset and have the browser send in the same encoding as the page, -or set it to UTF-8 and then do another conversion server-side. -Each method has deficiencies, especially the former.</p> - -<p>If you tell the browser to send the form in the same encoding as -the page, you still have the trouble of what to do with characters -that are outside of the character encoding's range. The behavior, once -again, varies: Firefox 2.0 converts them to character entity references -while Internet Explorer 7.0 mangles them beyond intelligibility. For -serious internationalization purposes, this is not an option.</p> - -<p>The other possibility is to set Accept-Encoding to UTF-8, which -begs the question: Why aren't you using UTF-8 for everything then? -This route is more palatable, but there's a notable caveat: your data -will come in as UTF-8, so you will have to explicitly convert it into -your favored local character encoding.</p> - -<p>I object to this approach on idealogical grounds: you're -digging yourself deeper into -the hole when you could have been converting to UTF-8 -instead. And, of course, you can't use this method for GET requests.</p> - -<h3 id="whyutf8-support">Well supported</h3> - -<p>Almost every modern browser in the wild today has full UTF-8 and Unicode -support: the number of troublesome cases can be counted with the -fingers of one hand, and these browsers usually have trouble with -other character encodings too. Problems users usually encounter stem -from the lack of appropriate fonts to display the characters (once -again, this applies to all character encodings and HTML entities) or -Internet Explorer's lack of intelligent font picking (which can be -worked around).</p> - -<p>We will go into more detail about how to deal with edge cases in -the browser world in the Migration section, but rest assured that -converting to UTF-8, if done correctly, will not result in users -hounding you about broken pages.</p> - -<h3 id="whyutf8-htmlpurifier">HTML Purifier</h3> - -<p>And finally, we get to HTML Purifier. HTML Purifier is built to -deal with UTF-8: any indications otherwise are the result of an -encoder that converts text from your preferred encoding to UTF-8, and -back again. HTML Purifier never touches anything else, and leaves -it up to the module iconv to do the dirty work.</p> - -<p>This approach, however, is not perfect. iconv is blithely unaware -of HTML character entities. HTML Purifier, in order to -protect against sophisticated escaping schemes, normalizes all character -and numeric entity references before processing the text. This leads to -one important ramification:</p> - -<p><strong>Any character that is not supported by the target character -set, regardless of whether or not it is in the form of a character -entity reference or a raw character, will be silently ignored.</strong></p> - -<p>Example of this principle at work: say you have <code>&theta;</code> -in your HTML, but the output is in Latin-1 (which, understandably, -does not understand Greek), the following process will occur (assuming you've -set the encoding correctly using %Core.Encoding):</p> - -<ul> - <li>The <code>Encoder</code> will transform the text from ISO 8859-1 to UTF-8 - (note that theta is preserved here since it doesn't actually use - any non-ASCII characters): <code>&theta;</code></li> - <li>The <code>EntityParser</code> will transform all named and numeric - character entities to their corresponding raw UTF-8 equivalents: - <code>θ</code></li> - <li>HTML Purifier processes the code: <code>θ</code></li> - <li>The <code>Encoder</code> now transforms the text back from UTF-8 - to ISO 8859-1. Since Greek is not supported by ISO 8859-1, it - will be either ignored or replaced with a question mark: - <code>?</code></li> -</ul> - -<p>This behaviour is quite unsatisfactory. It is a deal-breaker for -international applications, and it can be mildly annoying for the provincial -soul who occasionally needs a special character. Since 1.4.0, HTML -Purifier has provided a slightly more palatable workaround using -%Core.EscapeNonASCIICharacters. The process now looks like:</p> - -<ul> - <li>The <code>Encoder</code> transforms encoding to UTF-8: <code>&theta;</code></li> - <li>The <code>EntityParser</code> transforms entities: <code>θ</code></li> - <li>HTML Purifier processes the code: <code>θ</code></li> - <li>The <code>Encoder</code> replaces all non-ASCII characters - with numeric entity reference: <code>&#952;</code></li> - <li>For good measure, <code>Encoder</code> transforms encoding back to - original (which is strictly unnecessary for 99% of encodings - out there): <code>&#952;</code> (remember, it's all ASCII!)</li> -</ul> - -<p>...which means that this is only good for an occasional foray into -the land of Unicode characters, and is totally unacceptable for Chinese -or Japanese texts. The even bigger kicker is that, supposing the -input encoding was actually ISO-8859-7, which <em>does</em> support -theta, the character would get converted into a character entity reference -anyway! (The Encoder does not discriminate).</p> - -<p>The current functionality is about where HTML Purifier will be for -the rest of eternity. HTML Purifier could attempt to preserve the original -form of the character references so that they could be substituted back in, only the -DOM extension kills them off irreversibly. HTML Purifier could also attempt -to be smart and only convert non-ASCII characters that weren't supported -by the target encoding, but that would require reimplementing iconv -with HTML awareness, something I will not do.</p> - -<p>So there: either it's UTF-8 or crippled international support. Your pick! (and I'm -not being sarcastic here: some people could care less about other languages).</p> - -<h2 id="migrate">Migrate to UTF-8</h2> - -<p>So, you've decided to bite the bullet, and want to migrate to UTF-8. -Note that this is not for the faint-hearted, and you should expect -the process to take longer than you think it will take.</p> - -<p>The general idea is that you convert all existing text to UTF-8, -and then you set all the headers and META tags we discussed earlier -to UTF-8. There are many ways going about doing this: you could -write a conversion script that runs through the database and re-encodes -everything as UTF-8 or you could do the conversion on the fly when someone -reads the page. The details depend on your system, but I will cover -some of the more subtle points of migration that may trip you up.</p> - -<h3 id="migrate-db">Configuring your database</h3> - -<p>Most modern databases, the most prominent open-source ones being MySQL -4.1+ and PostgreSQL, support character encodings. If you're switching -to UTF-8, logically speaking, you'd want to make sure your database -knows about the change too. There are some caveats though:</p> - -<h4 id="migrate-db-legit">Legit method</h4> - -<p>Standardization in terms of SQL syntax for specifying character -encodings is notoriously spotty. Refer to your respective database's -documentation on how to do this properly.</p> - -<p>For <a href="http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html">MySQL</a>, <code>ALTER</code> will magically perform the -character encoding conversion for you. However, you have -to make sure that the text inside the column is what is says it is: -if you had put Shift-JIS in an ISO 8859-1 column, MySQL will irreversibly mangle -the text when you try to convert it to UTF-8. You'll have to convert -it to a binary field, convert it to a Shift-JIS field (the real encoding), -and then finally to UTF-8. Many a website had pages irreversibly mangled -because they didn't realize that they'd been deluding themselves about -the character encoding all along; don't become the next victim.</p> - -<p>For <a href="http://www.postgresql.org/docs/8.2/static/multibyte.html">PostgreSQL</a>, there appears to be no direct way to change the -encoding of a database (as of 8.2). You will have to dump the data, and then reimport -it into a new table. Make sure that your client encoding is set properly: -this is how PostgreSQL knows to perform an encoding conversion.</p> - -<p>Many times, you will be also asked about the "collation" of -the new column. Collation is how a DBMS sorts text, like ordering -B, C and A into A, B and C (the problem gets surprisingly complicated -when you get to languages like Thai and Japanese). If in doubt, -going with the default setting is usually a safe bet.</p> - -<p>Once the conversion is all said and done, you still have to remember -to set the client encoding (your encoding) properly on each database -connection using <code>SET NAMES</code> (which is standard SQL and is -usually supported).</p> - -<h4 id="migrate-db-binary">Binary</h4> - -<p>Due to the aforementioned compatibility issues, a more interoperable -way of storing UTF-8 text is to stuff it in a binary datatype. -<code>CHAR</code> becomes <code>BINARY</code>, <code>VARCHAR</code> becomes -<code>VARBINARY</code> and <code>TEXT</code> becomes <code>BLOB</code>. -Doing so can save you some huge headaches:</p> - -<ul> - <li>The syntax for binary data types is very portable,</li> - <li>MySQL 4.0 has <em>no</em> support for character encodings, so - if you want to support it you <em>have</em> to use binary,</li> - <li>MySQL, as of 5.1, has no support for four byte UTF-8 characters, - which represent characters beyond the basic multilingual - plane, and</li> - <li>You will never have to worry about your DBMS being too smart - and attempting to convert your text when you don't want it to.</li> -</ul> - -<p>MediaWiki, a very prominent international application, uses binary fields -for storing their data because of point three.</p> - -<p>There are drawbacks, of course:</p> - -<ul> - <li>Database tools like PHPMyAdmin won't be able to offer you inline - text editing, since it is declared as binary,</li> - <li>It's not semantically correct: it's really text not binary - (lying to the database),</li> - <li>Unless you use the not-very-portable wizardry mentioned above, - you have to change the encoding yourself (usually, you'd do - it on the fly), and</li> - <li>You will not have collation.</li> -</ul> - -<p>Choose based on your circumstances.</p> - -<h3 id="migrate-editor">Text editor</h3> - -<p>For more flat-file oriented systems, you will often be tasked with -converting reams of existing text and HTML files into UTF-8, as well as -making sure that all new files uploaded are properly encoded. Once again, -I can only point vaguely in the right direction for converting your -existing files: make sure you backup, make sure you use -<a href="http://php.net/ref.iconv">iconv</a>(), and -make sure you know what the original character encoding of the files -is (or are, depending on the tidiness of your system).</p> - -<p>However, I can proffer more specific advice on the subject of -text editors. Many text editors have notoriously spotty Unicode support. -To find out how your editor is doing, you can check out <a -href="http://www.alanwood.net/unicode/utilities_editors.html">this list</a> -or <a href="http://en.wikipedia.org/wiki/Comparison_of_text_editors#Encoding_support">Wikipedia's list.</a> -I personally use Notepad++, which works like a charm when it comes to UTF-8. -Usually, you will have to <strong>explicitly</strong> tell the editor through some dialogue -(usually Save as or Format) what encoding you want it to use. An editor -will often offer "Unicode" as a method of saving, which is -ambiguous. Make sure you know whether or not they really mean UTF-8 -or UTF-16 (which is another flavor of Unicode).</p> - -<p>The two things to look out for are whether or not the editor -supports <strong>font mixing</strong> (multiple -fonts in one document) and whether or not it adds a <strong>BOM</strong>. -Font mixing is important because fonts rarely have support for every -language known to mankind: in order to be flexible, an editor must -be able to take a little from here and a little from there, otherwise -all your Chinese characters will come as nice boxes. We'll discuss -BOM below.</p> - -<h3 id="migrate-bom">Byte Order Mark (headers already sent!)</h3> - -<p>The BOM, or <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark">Byte -Order Mark</a>, is a magical, invisible character placed at -the beginning of UTF-8 files to tell people what the encoding is and -what the endianness of the text is. It is also unnecessary.</p> - -<p>Because it's invisible, it often -catches people by surprise when it starts doing things it shouldn't -be doing. For example, this PHP file:</p> - -<pre><strong>BOM</strong><?php -header('Location: index.php'); -?></pre> - -<p>...will fail with the all too familiar <strong>Headers already sent</strong> -PHP error. And because the BOM is invisible, this culprit will go unnoticed. -My suggestion is to only use ASCII in PHP pages, but if you must, make -sure the page is saved WITHOUT the BOM.</p> - -<blockquote class="aside"> - <p>The headers the error is referring to are <strong>HTTP headers</strong>, - which are sent to the browser before any HTML to tell it various - information. The moment any regular text (and yes, a BOM counts as - ordinary text) is output, the headers must be sent, and you are - not allowed to send anymore. Thus, the error.</p> -</blockquote> - -<p>If you are reading in text files to insert into the middle of another -page, it is strongly advised (but not strictly necessary) that you replace out the UTF-8 byte -sequence for BOM <code>"\xEF\xBB\xBF"</code> before inserting it in, -via:</p> - -<pre>$text = str_replace("\xEF\xBB\xBF", '', $text);</pre> - -<h3 id="migrate-fonts">Fonts</h3> - -<p>Generally speaking, people who are having trouble with fonts fall -into two categories:</p> - -<ul> -<li>Those who want to -use an extremely obscure language for which there is very little -support even among native speakers of the language, and</li> -<li>Those where the primary language of the text is -well-supported but there are occasional characters -that aren't supported.</li> -</ul> - -<p>Yes, there's always a chance where an English user happens across -a Sinhalese website and doesn't have the right font. But an English user -who happens not to have the right fonts probably has no business reading Sinhalese -anyway. So we'll deal with the other two edge cases.</p> - -<h4 id="migrate-fonts-obscure">Obscure scripts</h4> - -<p>If you run a Bengali website, you may get comments from users who -would like to read your website but get heaps of question marks or -other meaningless characters. Fixing this problem requires the -installation of a font or language pack which is often highly -dependent on what the language is. <a href="http://bn.wikipedia.org/wiki/%E0%A6%89%E0%A6%87%E0%A6%95%E0%A6%BF%E0%A6%AA%E0%A7%87%E0%A6%A1%E0%A6%BF%E0%A6%AF%E0%A6%BC%E0%A6%BE:Bangla_script_display_and_input_help">Here is an example</a> -of such a help file for the Bengali language; I am sure there are -others out there too. You just have to point users to the appropriate -help file.</p> - -<h4 id="migrate-fonts-occasional">Occasional use</h4> - -<p>A prime example of when you'll see some very obscure Unicode -characters embedded in what otherwise would be very bland ASCII are -letters of the -<a href="http://en.wikipedia.org/wiki/International_Phonetic_Alphabet">International -Phonetic Alphabet (IPA)</a>, use to designate pronunciations in a very standard -manner (you probably see them all the time in your dictionary). Your -average font probably won't have support for all of the IPA characters -like ʘ (bilabial click) or ʒ (voiced postalveolar fricative). -So what's a poor browser to do? Font mix! Smart browsers like Mozilla Firefox -and Internet Explorer 7 will borrow glyphs from other fonts in order -to make sure that all the characters display properly.</p> - -<p>But what happens when the browser isn't smart and happens to be the -most widely used browser in the entire world? Microsoft IE 6 -is not smart enough to borrow from other fonts when a character isn't -present, so more often than not you'll be slapped with a nice big �. -To get things to work, MSIE 6 needs a little nudge. You could configure it -to use a different font to render the text, but you can achieve the same -effect by selectively changing the font for blocks of special characters -to known good Unicode fonts.</p> - -<p>Fortunately, the folks over at Wikipedia have already done all the -heavy lifting for you. Get the CSS from the horses mouth here: -<a href="http://en.wikipedia.org/wiki/MediaWiki:Common.css">Common.css</a>, -and search for ".IPA" There are also a smattering of -other classes you can use for other purposes, check out -<a href="http://meta.wikimedia.org/wiki/Help:Special_characters#Displaying_Special_Characters">this page</a> -for more details. For you lazy ones, this should work:</p> - -<pre>.Unicode { - font-family: Code2000, "TITUS Cyberbit Basic", "Doulos SIL", - "Chrysanthi Unicode", "Bitstream Cyberbit", - "Bitstream CyberBase", Thryomanes, Gentium, GentiumAlt, - "Lucida Grande", "Arial Unicode MS", "Microsoft Sans Serif", - "Lucida Sans Unicode"; - font-family /**/:inherit; /* resets fonts for everyone but IE6 */ -}</pre> - -<p>The standard usage goes along the lines of <code><span class="Unicode">Crazy -Unicode stuff here</span></code>. Characters in the -<a href="http://en.wikipedia.org/wiki/Windows_Glyph_List_4">Windows Glyph List</a> -usually don't need to be fixed, but for anything else you probably -want to play it safe. Unless, of course, you don't care about IE6 -users.</p> - -<h3 id="migrate-variablewidth">Dealing with variable width in functions</h3> - -<p>When people claim that PHP6 will solve all our Unicode problems, they're -misinformed. It will not fix any of the aforementioned troubles. It will, -however, fix the problem we are about to discuss: processing UTF-8 text -in PHP.</p> - -<p>PHP (as of PHP5) is blithely unaware of the existence of UTF-8 (with a few -notable exceptions). Sometimes, this will cause problems, other times, -this won't. So far, we've avoided discussing the architecture of -UTF-8, so, we must first ask, what is UTF-8? Yes, it supports Unicode, -and yes, it is variable width. Other traits:</p> - -<ul> - <li>Every character's byte sequence is unique and will never be found - inside the byte sequence of another character,</li> - <li>UTF-8 may use up to four bytes to encode a character,</li> - <li>UTF-8 text must be checked for well-formedness,</li> - <li>Pure ASCII is also valid UTF-8, and</li> - <li>Binary sorting will sort UTF-8 in the same order as Unicode.</li> -</ul> - -<p>Each of these traits affect different domains of text processing -in different ways. It is beyond the scope of this document to explain -what precisely these implications are. PHPWact provides -a very good <a href="http://www.phpwact.org/php/i18n/utf-8">reference document</a> -on what to expect from each function, although coverage is spotty in -some areas. Their more general notes on -<a href="http://www.phpwact.org/php/i18n/charsets">character sets</a> -are also worth looking at for information on UTF-8. Some rules of thumb -when dealing with Unicode text:</p> - -<ul> - <li>Do not EVER use functions that:<ul> - <li>...convert case (strtolower, strtoupper, ucfirst, ucwords)</li> - <li>...claim to be case-insensitive (str_ireplace, stristr, strcasecmp)</li> - </ul></li> - <li>Think twice before using functions that:<ul> - <li>...count characters (strlen will return bytes, not characters; - str_split and word_wrap may corrupt)</li> - <li>...convert characters to entity references (UTF-8 doesn't need entities)</li> - <li>...do very complex string processing (*printf)</li> - </ul></li> -</ul> - -<p>Note: this list applies to UTF-8 encoded text only: if you have -a string that you are 100% sure is ASCII, be my guest and use -<code>strtolower</code> (HTML Purifier uses this function.)</p> - -<p>Regardless, always think in bytes, not characters. If you use strpos() -to find the position of a character, it will be in bytes, but this -usually won't matter since substr() also operates with byte indices!</p> - -<p>You'll also need to make sure your UTF-8 is well-formed and will -probably need replacements for some of these functions. I recommend -using Harry Fuecks' <a href="http://phputf8.sourceforge.net/">PHP -UTF-8</a> library, rather than use mb_string directly. HTML Purifier -also defines a few useful UTF-8 compatible functions: check out -<code>Encoder.php</code> in the <code>/library/HTMLPurifier/</code> -directory.</p> - -<h2 id="externallinks">Further Reading</h2> - -<p>Well, that's it. Hopefully this document has served as a very -practical springboard into knowledge of how UTF-8 works. You may have -decided that you don't want to migrate yet: that's fine, just know -what will happen to your output and what bug reports you may receive.</p> - -<p>Many other developers have already discussed the subject of Unicode, -UTF-8 and internationalization, and I would like to defer to them for -a more in-depth look into character sets and encodings.</p> - -<ul> - <li><a href="http://www.joelonsoftware.com/articles/Unicode.html"> - The Absolute Minimum Every Software Developer Absolutely, - Positively Must Know About Unicode and Character Sets - (No Excuses!)</a> by Joel Spolsky, provides a <em>very</em> - good high-level look at Unicode and character sets in general.</li> - <li><a href="http://en.wikipedia.org/wiki/UTF-8">UTF-8 on Wikipedia</a>, - provides a lot of useful details into the innards of UTF-8, although - it may be a little off-putting to people who don't know much - about Unicode to begin with.</li> -</ul> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/enduser-youtube.html b/lib/htmlpurifier/docs/enduser-youtube.html deleted file mode 100644 index 87a36b9aa..000000000 --- a/lib/htmlpurifier/docs/enduser-youtube.html +++ /dev/null @@ -1,153 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Explains how to safely allow the embedding of flash from trusted sites in HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Embedding YouTube Videos - HTML Purifier</title> - -</head><body> - -<h1 class="subtitled">Embedding YouTube Videos</h1> -<div class="subtitle">...as well as other dangerous active content</div> - -<div id="filing">Filed under End-User</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>Clients like their YouTube videos. It gives them a warm fuzzy feeling when -they see a neat little embedded video player on their websites that can play -the latest clips from their documentary "Fido and the Bones of Spring". -All joking aside, the ability to embed YouTube videos or other active -content in their pages is something that a lot of people like.</p> - -<p>This is a <em>bad</em> idea. The moment you embed anything untrusted, -you will definitely be slammed by a manner of nasties that can be -embedded in things from your run of the mill Flash movie to -<a href="http://blog.spywareguide.com/2006/12/myspace_phish_attack_leads_use.html">Quicktime movies</a>. -Even <code>img</code> tags, which HTML Purifier allows by default, can be -dangerous. Be distrustful of anything that tells a browser to load content -from another website automatically.</p> - -<p>Luckily for us, however, whitelisting saves the day. Sure, letting users -include any old random flash file could be dangerous, but if it's -from a specific website, it probably is okay. If no amount of pleading will -convince the people upstairs that they should just settle with just linking -to their movies, you may find this technique very useful.</p> - -<h2>Looking in</h2> - -<p>Below is custom code that allows users to embed -YouTube videos. This is not favoritism: this trick can easily be adapted for -other forms of embeddable content.</p> - -<p>Usually, websites like YouTube give us boilerplate code that you can insert -into your documents. YouTube's code goes like this:</p> - -<pre> -<object width="425" height="350"> - <param name="movie" value="http://www.youtube.com/v/AyPzM5WK8ys" /> - <param name="wmode" value="transparent" /> - <embed src="http://www.youtube.com/v/AyPzM5WK8ys" - type="application/x-shockwave-flash" - wmode="transparent" width="425" height="350" /> -</object> -</pre> - -<p>There are two things to note about this code:</p> - -<ol> - <li><code><embed></code> is not recognized by W3C, so if you want - standards-compliant code, you'll have to get rid of it.</li> - <li>The code is exactly the same for all instances, except for the - identifier <tt>AyPzM5WK8ys</tt> which tells us which movie file - to retrieve.</li> -</ol> - -<p>What point 2 means is that if we have code like <code><span -class="youtube-embed">AyPzM5WK8ys</span></code> your -application can reconstruct the full object from this small snippet that -passes through HTML Purifier <em>unharmed</em>. -<a href="http://repo.or.cz/w/htmlpurifier.git?a=blob;hb=HEAD;f=library/HTMLPurifier/Filter/YouTube.php">Show me the code!</a></p> - -<p>And the corresponding usage:</p> - -<pre><?php - $config->set('Filter.YouTube', true); -?></pre> - -<p>There is a bit going in the two code snippets, so let's explain.</p> - -<ol> - <li>This is a Filter object, which intercepts the HTML that is - coming into and out of the purifier. You can add as many - filter objects as you like. <code>preFilter()</code> - processes the code before it gets purified, and <code>postFilter()</code> - processes the code afterwards. So, we'll use <code>preFilter()</code> to - replace the object tag with a <code>span</code>, and <code>postFilter()</code> - to restore it.</li> - <li>The first preg_replace call replaces any YouTube code users may have - embedded into the benign span tag. Span is used because it is inline, - and objects are inline too. We are very careful to be extremely - restrictive on what goes inside the span tag, as if an errant code - gets in there it could get messy.</li> - <li>The HTML is then purified as usual.</li> - <li>Then, another preg_replace replaces the span tag with a fully fledged - object. Note that the embed is removed, and, in its place, a data - attribute was added to the object. This makes the tag standards - compliant! It also breaks Internet Explorer, so we add in a bit of - conditional comments with the old embed code to make it work again. - It's all quite convoluted but works.</li> -</ol> - -<h2>Warning</h2> - -<p>There are a number of possible problems with the code above, depending -on how you look at it.</p> - -<h3>Cannot change width and height</h3> - -<p>The width and height of the final YouTube movie cannot be adjusted. This -is because I am lazy. If you really insist on letting users change the size -of the movie, what you need to do is package up the attributes inside the -span tag (along with the movie ID). It gets complicated though: a malicious -user can specify an outrageously large height and width and attempt to crash -the user's operating system/browser. You need to either cap it by limiting -the amount of digits allowed in the regex or using a callback to check the -number.</p> - -<h3>Trusts media's host's security</h3> - -<p>By allowing this code onto our website, we are trusting that YouTube has -tech-savvy enough people not to allow their users to inject malicious -code into the Flash files. An exploit on YouTube means an exploit on your -site. Even though YouTube is run by the reputable Google, it -<a href="http://ha.ckers.org/blog/20061213/google-xss-vuln/">doesn't</a> -mean they are -<a href="http://ha.ckers.org/blog/20061208/xss-in-googles-orkut/">invulnerable.</a> -You're putting a certain measure of the job on an external provider (just as -you have by entrusting your user input to HTML Purifier), and -it is important that you are cognizant of the risk.</p> - -<h3>Poorly written adaptations compromise security</h3> - -<p>This should go without saying, but if you're going to adapt this code -for Google Video or the like, make sure you do it <em>right</em>. It's -extremely easy to allow a character too many in <code>postFilter()</code> and -suddenly you're introducing XSS into HTML Purifier's XSS free output. HTML -Purifier may be well written, but it cannot guard against vulnerabilities -introduced after it has finished.</p> - -<h2>Help out!</h2> - -<p>If you write a filter for your favorite video destination (or anything -like that, for that matter), send it over and it might get included -with the core!</p> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/entities/xhtml-lat1.ent b/lib/htmlpurifier/docs/entities/xhtml-lat1.ent deleted file mode 100644 index ffee223eb..000000000 --- a/lib/htmlpurifier/docs/entities/xhtml-lat1.ent +++ /dev/null @@ -1,196 +0,0 @@ -<!-- Portions (C) International Organization for Standardization 1986 - Permission to copy in any form is granted for use with - conforming SGML systems and applications as defined in - ISO 8879, provided this notice is included in all copies. ---> -<!-- Character entity set. Typical invocation: - <!ENTITY % HTMLlat1 PUBLIC - "-//W3C//ENTITIES Latin 1 for XHTML//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent"> - %HTMLlat1; ---> - -<!ENTITY nbsp " "> <!-- no-break space = non-breaking space, - U+00A0 ISOnum --> -<!ENTITY iexcl "¡"> <!-- inverted exclamation mark, U+00A1 ISOnum --> -<!ENTITY cent "¢"> <!-- cent sign, U+00A2 ISOnum --> -<!ENTITY pound "£"> <!-- pound sign, U+00A3 ISOnum --> -<!ENTITY curren "¤"> <!-- currency sign, U+00A4 ISOnum --> -<!ENTITY yen "¥"> <!-- yen sign = yuan sign, U+00A5 ISOnum --> -<!ENTITY brvbar "¦"> <!-- broken bar = broken vertical bar, - U+00A6 ISOnum --> -<!ENTITY sect "§"> <!-- section sign, U+00A7 ISOnum --> -<!ENTITY uml "¨"> <!-- diaeresis = spacing diaeresis, - U+00A8 ISOdia --> -<!ENTITY copy "©"> <!-- copyright sign, U+00A9 ISOnum --> -<!ENTITY ordf "ª"> <!-- feminine ordinal indicator, U+00AA ISOnum --> -<!ENTITY laquo "«"> <!-- left-pointing double angle quotation mark - = left pointing guillemet, U+00AB ISOnum --> -<!ENTITY not "¬"> <!-- not sign = angled dash, - U+00AC ISOnum --> -<!ENTITY shy "­"> <!-- soft hyphen = discretionary hyphen, - U+00AD ISOnum --> -<!ENTITY reg "®"> <!-- registered sign = registered trade mark sign, - U+00AE ISOnum --> -<!ENTITY macr "¯"> <!-- macron = spacing macron = overline - = APL overbar, U+00AF ISOdia --> -<!ENTITY deg "°"> <!-- degree sign, U+00B0 ISOnum --> -<!ENTITY plusmn "±"> <!-- plus-minus sign = plus-or-minus sign, - U+00B1 ISOnum --> -<!ENTITY sup2 "²"> <!-- superscript two = superscript digit two - = squared, U+00B2 ISOnum --> -<!ENTITY sup3 "³"> <!-- superscript three = superscript digit three - = cubed, U+00B3 ISOnum --> -<!ENTITY acute "´"> <!-- acute accent = spacing acute, - U+00B4 ISOdia --> -<!ENTITY micro "µ"> <!-- micro sign, U+00B5 ISOnum --> -<!ENTITY para "¶"> <!-- pilcrow sign = paragraph sign, - U+00B6 ISOnum --> -<!ENTITY middot "·"> <!-- middle dot = Georgian comma - = Greek middle dot, U+00B7 ISOnum --> -<!ENTITY cedil "¸"> <!-- cedilla = spacing cedilla, U+00B8 ISOdia --> -<!ENTITY sup1 "¹"> <!-- superscript one = superscript digit one, - U+00B9 ISOnum --> -<!ENTITY ordm "º"> <!-- masculine ordinal indicator, - U+00BA ISOnum --> -<!ENTITY raquo "»"> <!-- right-pointing double angle quotation mark - = right pointing guillemet, U+00BB ISOnum --> -<!ENTITY frac14 "¼"> <!-- vulgar fraction one quarter - = fraction one quarter, U+00BC ISOnum --> -<!ENTITY frac12 "½"> <!-- vulgar fraction one half - = fraction one half, U+00BD ISOnum --> -<!ENTITY frac34 "¾"> <!-- vulgar fraction three quarters - = fraction three quarters, U+00BE ISOnum --> -<!ENTITY iquest "¿"> <!-- inverted question mark - = turned question mark, U+00BF ISOnum --> -<!ENTITY Agrave "À"> <!-- latin capital letter A with grave - = latin capital letter A grave, - U+00C0 ISOlat1 --> -<!ENTITY Aacute "Á"> <!-- latin capital letter A with acute, - U+00C1 ISOlat1 --> -<!ENTITY Acirc "Â"> <!-- latin capital letter A with circumflex, - U+00C2 ISOlat1 --> -<!ENTITY Atilde "Ã"> <!-- latin capital letter A with tilde, - U+00C3 ISOlat1 --> -<!ENTITY Auml "Ä"> <!-- latin capital letter A with diaeresis, - U+00C4 ISOlat1 --> -<!ENTITY Aring "Å"> <!-- latin capital letter A with ring above - = latin capital letter A ring, - U+00C5 ISOlat1 --> -<!ENTITY AElig "Æ"> <!-- latin capital letter AE - = latin capital ligature AE, - U+00C6 ISOlat1 --> -<!ENTITY Ccedil "Ç"> <!-- latin capital letter C with cedilla, - U+00C7 ISOlat1 --> -<!ENTITY Egrave "È"> <!-- latin capital letter E with grave, - U+00C8 ISOlat1 --> -<!ENTITY Eacute "É"> <!-- latin capital letter E with acute, - U+00C9 ISOlat1 --> -<!ENTITY Ecirc "Ê"> <!-- latin capital letter E with circumflex, - U+00CA ISOlat1 --> -<!ENTITY Euml "Ë"> <!-- latin capital letter E with diaeresis, - U+00CB ISOlat1 --> -<!ENTITY Igrave "Ì"> <!-- latin capital letter I with grave, - U+00CC ISOlat1 --> -<!ENTITY Iacute "Í"> <!-- latin capital letter I with acute, - U+00CD ISOlat1 --> -<!ENTITY Icirc "Î"> <!-- latin capital letter I with circumflex, - U+00CE ISOlat1 --> -<!ENTITY Iuml "Ï"> <!-- latin capital letter I with diaeresis, - U+00CF ISOlat1 --> -<!ENTITY ETH "Ð"> <!-- latin capital letter ETH, U+00D0 ISOlat1 --> -<!ENTITY Ntilde "Ñ"> <!-- latin capital letter N with tilde, - U+00D1 ISOlat1 --> -<!ENTITY Ograve "Ò"> <!-- latin capital letter O with grave, - U+00D2 ISOlat1 --> -<!ENTITY Oacute "Ó"> <!-- latin capital letter O with acute, - U+00D3 ISOlat1 --> -<!ENTITY Ocirc "Ô"> <!-- latin capital letter O with circumflex, - U+00D4 ISOlat1 --> -<!ENTITY Otilde "Õ"> <!-- latin capital letter O with tilde, - U+00D5 ISOlat1 --> -<!ENTITY Ouml "Ö"> <!-- latin capital letter O with diaeresis, - U+00D6 ISOlat1 --> -<!ENTITY times "×"> <!-- multiplication sign, U+00D7 ISOnum --> -<!ENTITY Oslash "Ø"> <!-- latin capital letter O with stroke - = latin capital letter O slash, - U+00D8 ISOlat1 --> -<!ENTITY Ugrave "Ù"> <!-- latin capital letter U with grave, - U+00D9 ISOlat1 --> -<!ENTITY Uacute "Ú"> <!-- latin capital letter U with acute, - U+00DA ISOlat1 --> -<!ENTITY Ucirc "Û"> <!-- latin capital letter U with circumflex, - U+00DB ISOlat1 --> -<!ENTITY Uuml "Ü"> <!-- latin capital letter U with diaeresis, - U+00DC ISOlat1 --> -<!ENTITY Yacute "Ý"> <!-- latin capital letter Y with acute, - U+00DD ISOlat1 --> -<!ENTITY THORN "Þ"> <!-- latin capital letter THORN, - U+00DE ISOlat1 --> -<!ENTITY szlig "ß"> <!-- latin small letter sharp s = ess-zed, - U+00DF ISOlat1 --> -<!ENTITY agrave "à"> <!-- latin small letter a with grave - = latin small letter a grave, - U+00E0 ISOlat1 --> -<!ENTITY aacute "á"> <!-- latin small letter a with acute, - U+00E1 ISOlat1 --> -<!ENTITY acirc "â"> <!-- latin small letter a with circumflex, - U+00E2 ISOlat1 --> -<!ENTITY atilde "ã"> <!-- latin small letter a with tilde, - U+00E3 ISOlat1 --> -<!ENTITY auml "ä"> <!-- latin small letter a with diaeresis, - U+00E4 ISOlat1 --> -<!ENTITY aring "å"> <!-- latin small letter a with ring above - = latin small letter a ring, - U+00E5 ISOlat1 --> -<!ENTITY aelig "æ"> <!-- latin small letter ae - = latin small ligature ae, U+00E6 ISOlat1 --> -<!ENTITY ccedil "ç"> <!-- latin small letter c with cedilla, - U+00E7 ISOlat1 --> -<!ENTITY egrave "è"> <!-- latin small letter e with grave, - U+00E8 ISOlat1 --> -<!ENTITY eacute "é"> <!-- latin small letter e with acute, - U+00E9 ISOlat1 --> -<!ENTITY ecirc "ê"> <!-- latin small letter e with circumflex, - U+00EA ISOlat1 --> -<!ENTITY euml "ë"> <!-- latin small letter e with diaeresis, - U+00EB ISOlat1 --> -<!ENTITY igrave "ì"> <!-- latin small letter i with grave, - U+00EC ISOlat1 --> -<!ENTITY iacute "í"> <!-- latin small letter i with acute, - U+00ED ISOlat1 --> -<!ENTITY icirc "î"> <!-- latin small letter i with circumflex, - U+00EE ISOlat1 --> -<!ENTITY iuml "ï"> <!-- latin small letter i with diaeresis, - U+00EF ISOlat1 --> -<!ENTITY eth "ð"> <!-- latin small letter eth, U+00F0 ISOlat1 --> -<!ENTITY ntilde "ñ"> <!-- latin small letter n with tilde, - U+00F1 ISOlat1 --> -<!ENTITY ograve "ò"> <!-- latin small letter o with grave, - U+00F2 ISOlat1 --> -<!ENTITY oacute "ó"> <!-- latin small letter o with acute, - U+00F3 ISOlat1 --> -<!ENTITY ocirc "ô"> <!-- latin small letter o with circumflex, - U+00F4 ISOlat1 --> -<!ENTITY otilde "õ"> <!-- latin small letter o with tilde, - U+00F5 ISOlat1 --> -<!ENTITY ouml "ö"> <!-- latin small letter o with diaeresis, - U+00F6 ISOlat1 --> -<!ENTITY divide "÷"> <!-- division sign, U+00F7 ISOnum --> -<!ENTITY oslash "ø"> <!-- latin small letter o with stroke, - = latin small letter o slash, - U+00F8 ISOlat1 --> -<!ENTITY ugrave "ù"> <!-- latin small letter u with grave, - U+00F9 ISOlat1 --> -<!ENTITY uacute "ú"> <!-- latin small letter u with acute, - U+00FA ISOlat1 --> -<!ENTITY ucirc "û"> <!-- latin small letter u with circumflex, - U+00FB ISOlat1 --> -<!ENTITY uuml "ü"> <!-- latin small letter u with diaeresis, - U+00FC ISOlat1 --> -<!ENTITY yacute "ý"> <!-- latin small letter y with acute, - U+00FD ISOlat1 --> -<!ENTITY thorn "þ"> <!-- latin small letter thorn, - U+00FE ISOlat1 --> -<!ENTITY yuml "ÿ"> <!-- latin small letter y with diaeresis, - U+00FF ISOlat1 --> diff --git a/lib/htmlpurifier/docs/entities/xhtml-special.ent b/lib/htmlpurifier/docs/entities/xhtml-special.ent deleted file mode 100644 index ca358b2fe..000000000 --- a/lib/htmlpurifier/docs/entities/xhtml-special.ent +++ /dev/null @@ -1,80 +0,0 @@ -<!-- Special characters for XHTML --> - -<!-- Character entity set. Typical invocation: - <!ENTITY % HTMLspecial PUBLIC - "-//W3C//ENTITIES Special for XHTML//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent"> - %HTMLspecial; ---> - -<!-- Portions (C) International Organization for Standardization 1986: - Permission to copy in any form is granted for use with - conforming SGML systems and applications as defined in - ISO 8879, provided this notice is included in all copies. ---> - -<!-- Relevant ISO entity set is given unless names are newly introduced. - New names (i.e., not in ISO 8879 list) do not clash with any - existing ISO 8879 entity names. ISO 10646 character numbers - are given for each character, in hex. values are decimal - conversions of the ISO 10646 values and refer to the document - character set. Names are Unicode names. ---> - -<!-- C0 Controls and Basic Latin --> -<!ENTITY quot """> <!-- quotation mark, U+0022 ISOnum --> -<!ENTITY amp "&#38;"> <!-- ampersand, U+0026 ISOnum --> -<!ENTITY lt "&#60;"> <!-- less-than sign, U+003C ISOnum --> -<!ENTITY gt ">"> <!-- greater-than sign, U+003E ISOnum --> -<!ENTITY apos "'"> <!-- apostrophe = APL quote, U+0027 ISOnum --> - -<!-- Latin Extended-A --> -<!ENTITY OElig "Œ"> <!-- latin capital ligature OE, - U+0152 ISOlat2 --> -<!ENTITY oelig "œ"> <!-- latin small ligature oe, U+0153 ISOlat2 --> -<!-- ligature is a misnomer, this is a separate character in some languages --> -<!ENTITY Scaron "Š"> <!-- latin capital letter S with caron, - U+0160 ISOlat2 --> -<!ENTITY scaron "š"> <!-- latin small letter s with caron, - U+0161 ISOlat2 --> -<!ENTITY Yuml "Ÿ"> <!-- latin capital letter Y with diaeresis, - U+0178 ISOlat2 --> - -<!-- Spacing Modifier Letters --> -<!ENTITY circ "ˆ"> <!-- modifier letter circumflex accent, - U+02C6 ISOpub --> -<!ENTITY tilde "˜"> <!-- small tilde, U+02DC ISOdia --> - -<!-- General Punctuation --> -<!ENTITY ensp " "> <!-- en space, U+2002 ISOpub --> -<!ENTITY emsp " "> <!-- em space, U+2003 ISOpub --> -<!ENTITY thinsp " "> <!-- thin space, U+2009 ISOpub --> -<!ENTITY zwnj "‌"> <!-- zero width non-joiner, - U+200C NEW RFC 2070 --> -<!ENTITY zwj "‍"> <!-- zero width joiner, U+200D NEW RFC 2070 --> -<!ENTITY lrm "‎"> <!-- left-to-right mark, U+200E NEW RFC 2070 --> -<!ENTITY rlm "‏"> <!-- right-to-left mark, U+200F NEW RFC 2070 --> -<!ENTITY ndash "–"> <!-- en dash, U+2013 ISOpub --> -<!ENTITY mdash "—"> <!-- em dash, U+2014 ISOpub --> -<!ENTITY lsquo "‘"> <!-- left single quotation mark, - U+2018 ISOnum --> -<!ENTITY rsquo "’"> <!-- right single quotation mark, - U+2019 ISOnum --> -<!ENTITY sbquo "‚"> <!-- single low-9 quotation mark, U+201A NEW --> -<!ENTITY ldquo "“"> <!-- left double quotation mark, - U+201C ISOnum --> -<!ENTITY rdquo "”"> <!-- right double quotation mark, - U+201D ISOnum --> -<!ENTITY bdquo "„"> <!-- double low-9 quotation mark, U+201E NEW --> -<!ENTITY dagger "†"> <!-- dagger, U+2020 ISOpub --> -<!ENTITY Dagger "‡"> <!-- double dagger, U+2021 ISOpub --> -<!ENTITY permil "‰"> <!-- per mille sign, U+2030 ISOtech --> -<!ENTITY lsaquo "‹"> <!-- single left-pointing angle quotation mark, - U+2039 ISO proposed --> -<!-- lsaquo is proposed but not yet ISO standardized --> -<!ENTITY rsaquo "›"> <!-- single right-pointing angle quotation mark, - U+203A ISO proposed --> -<!-- rsaquo is proposed but not yet ISO standardized --> - -<!-- Currency Symbols --> -<!ENTITY euro "€"> <!-- euro sign, U+20AC NEW --> diff --git a/lib/htmlpurifier/docs/entities/xhtml-symbol.ent b/lib/htmlpurifier/docs/entities/xhtml-symbol.ent deleted file mode 100644 index 63c2abfa6..000000000 --- a/lib/htmlpurifier/docs/entities/xhtml-symbol.ent +++ /dev/null @@ -1,237 +0,0 @@ -<!-- Mathematical, Greek and Symbolic characters for XHTML --> - -<!-- Character entity set. Typical invocation: - <!ENTITY % HTMLsymbol PUBLIC - "-//W3C//ENTITIES Symbols for XHTML//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent"> - %HTMLsymbol; ---> - -<!-- Portions (C) International Organization for Standardization 1986: - Permission to copy in any form is granted for use with - conforming SGML systems and applications as defined in - ISO 8879, provided this notice is included in all copies. ---> - -<!-- Relevant ISO entity set is given unless names are newly introduced. - New names (i.e., not in ISO 8879 list) do not clash with any - existing ISO 8879 entity names. ISO 10646 character numbers - are given for each character, in hex. values are decimal - conversions of the ISO 10646 values and refer to the document - character set. Names are Unicode names. ---> - -<!-- Latin Extended-B --> -<!ENTITY fnof "ƒ"> <!-- latin small letter f with hook = function - = florin, U+0192 ISOtech --> - -<!-- Greek --> -<!ENTITY Alpha "Α"> <!-- greek capital letter alpha, U+0391 --> -<!ENTITY Beta "Β"> <!-- greek capital letter beta, U+0392 --> -<!ENTITY Gamma "Γ"> <!-- greek capital letter gamma, - U+0393 ISOgrk3 --> -<!ENTITY Delta "Δ"> <!-- greek capital letter delta, - U+0394 ISOgrk3 --> -<!ENTITY Epsilon "Ε"> <!-- greek capital letter epsilon, U+0395 --> -<!ENTITY Zeta "Ζ"> <!-- greek capital letter zeta, U+0396 --> -<!ENTITY Eta "Η"> <!-- greek capital letter eta, U+0397 --> -<!ENTITY Theta "Θ"> <!-- greek capital letter theta, - U+0398 ISOgrk3 --> -<!ENTITY Iota "Ι"> <!-- greek capital letter iota, U+0399 --> -<!ENTITY Kappa "Κ"> <!-- greek capital letter kappa, U+039A --> -<!ENTITY Lambda "Λ"> <!-- greek capital letter lamda, - U+039B ISOgrk3 --> -<!ENTITY Mu "Μ"> <!-- greek capital letter mu, U+039C --> -<!ENTITY Nu "Ν"> <!-- greek capital letter nu, U+039D --> -<!ENTITY Xi "Ξ"> <!-- greek capital letter xi, U+039E ISOgrk3 --> -<!ENTITY Omicron "Ο"> <!-- greek capital letter omicron, U+039F --> -<!ENTITY Pi "Π"> <!-- greek capital letter pi, U+03A0 ISOgrk3 --> -<!ENTITY Rho "Ρ"> <!-- greek capital letter rho, U+03A1 --> -<!-- there is no Sigmaf, and no U+03A2 character either --> -<!ENTITY Sigma "Σ"> <!-- greek capital letter sigma, - U+03A3 ISOgrk3 --> -<!ENTITY Tau "Τ"> <!-- greek capital letter tau, U+03A4 --> -<!ENTITY Upsilon "Υ"> <!-- greek capital letter upsilon, - U+03A5 ISOgrk3 --> -<!ENTITY Phi "Φ"> <!-- greek capital letter phi, - U+03A6 ISOgrk3 --> -<!ENTITY Chi "Χ"> <!-- greek capital letter chi, U+03A7 --> -<!ENTITY Psi "Ψ"> <!-- greek capital letter psi, - U+03A8 ISOgrk3 --> -<!ENTITY Omega "Ω"> <!-- greek capital letter omega, - U+03A9 ISOgrk3 --> - -<!ENTITY alpha "α"> <!-- greek small letter alpha, - U+03B1 ISOgrk3 --> -<!ENTITY beta "β"> <!-- greek small letter beta, U+03B2 ISOgrk3 --> -<!ENTITY gamma "γ"> <!-- greek small letter gamma, - U+03B3 ISOgrk3 --> -<!ENTITY delta "δ"> <!-- greek small letter delta, - U+03B4 ISOgrk3 --> -<!ENTITY epsilon "ε"> <!-- greek small letter epsilon, - U+03B5 ISOgrk3 --> -<!ENTITY zeta "ζ"> <!-- greek small letter zeta, U+03B6 ISOgrk3 --> -<!ENTITY eta "η"> <!-- greek small letter eta, U+03B7 ISOgrk3 --> -<!ENTITY theta "θ"> <!-- greek small letter theta, - U+03B8 ISOgrk3 --> -<!ENTITY iota "ι"> <!-- greek small letter iota, U+03B9 ISOgrk3 --> -<!ENTITY kappa "κ"> <!-- greek small letter kappa, - U+03BA ISOgrk3 --> -<!ENTITY lambda "λ"> <!-- greek small letter lamda, - U+03BB ISOgrk3 --> -<!ENTITY mu "μ"> <!-- greek small letter mu, U+03BC ISOgrk3 --> -<!ENTITY nu "ν"> <!-- greek small letter nu, U+03BD ISOgrk3 --> -<!ENTITY xi "ξ"> <!-- greek small letter xi, U+03BE ISOgrk3 --> -<!ENTITY omicron "ο"> <!-- greek small letter omicron, U+03BF NEW --> -<!ENTITY pi "π"> <!-- greek small letter pi, U+03C0 ISOgrk3 --> -<!ENTITY rho "ρ"> <!-- greek small letter rho, U+03C1 ISOgrk3 --> -<!ENTITY sigmaf "ς"> <!-- greek small letter final sigma, - U+03C2 ISOgrk3 --> -<!ENTITY sigma "σ"> <!-- greek small letter sigma, - U+03C3 ISOgrk3 --> -<!ENTITY tau "τ"> <!-- greek small letter tau, U+03C4 ISOgrk3 --> -<!ENTITY upsilon "υ"> <!-- greek small letter upsilon, - U+03C5 ISOgrk3 --> -<!ENTITY phi "φ"> <!-- greek small letter phi, U+03C6 ISOgrk3 --> -<!ENTITY chi "χ"> <!-- greek small letter chi, U+03C7 ISOgrk3 --> -<!ENTITY psi "ψ"> <!-- greek small letter psi, U+03C8 ISOgrk3 --> -<!ENTITY omega "ω"> <!-- greek small letter omega, - U+03C9 ISOgrk3 --> -<!ENTITY thetasym "ϑ"> <!-- greek theta symbol, - U+03D1 NEW --> -<!ENTITY upsih "ϒ"> <!-- greek upsilon with hook symbol, - U+03D2 NEW --> -<!ENTITY piv "ϖ"> <!-- greek pi symbol, U+03D6 ISOgrk3 --> - -<!-- General Punctuation --> -<!ENTITY bull "•"> <!-- bullet = black small circle, - U+2022 ISOpub --> -<!-- bullet is NOT the same as bullet operator, U+2219 --> -<!ENTITY hellip "…"> <!-- horizontal ellipsis = three dot leader, - U+2026 ISOpub --> -<!ENTITY prime "′"> <!-- prime = minutes = feet, U+2032 ISOtech --> -<!ENTITY Prime "″"> <!-- double prime = seconds = inches, - U+2033 ISOtech --> -<!ENTITY oline "‾"> <!-- overline = spacing overscore, - U+203E NEW --> -<!ENTITY frasl "⁄"> <!-- fraction slash, U+2044 NEW --> - -<!-- Letterlike Symbols --> -<!ENTITY weierp "℘"> <!-- script capital P = power set - = Weierstrass p, U+2118 ISOamso --> -<!ENTITY image "ℑ"> <!-- black-letter capital I = imaginary part, - U+2111 ISOamso --> -<!ENTITY real "ℜ"> <!-- black-letter capital R = real part symbol, - U+211C ISOamso --> -<!ENTITY trade "™"> <!-- trade mark sign, U+2122 ISOnum --> -<!ENTITY alefsym "ℵ"> <!-- alef symbol = first transfinite cardinal, - U+2135 NEW --> -<!-- alef symbol is NOT the same as hebrew letter alef, - U+05D0 although the same glyph could be used to depict both characters --> - -<!-- Arrows --> -<!ENTITY larr "←"> <!-- leftwards arrow, U+2190 ISOnum --> -<!ENTITY uarr "↑"> <!-- upwards arrow, U+2191 ISOnum--> -<!ENTITY rarr "→"> <!-- rightwards arrow, U+2192 ISOnum --> -<!ENTITY darr "↓"> <!-- downwards arrow, U+2193 ISOnum --> -<!ENTITY harr "↔"> <!-- left right arrow, U+2194 ISOamsa --> -<!ENTITY crarr "↵"> <!-- downwards arrow with corner leftwards - = carriage return, U+21B5 NEW --> -<!ENTITY lArr "⇐"> <!-- leftwards double arrow, U+21D0 ISOtech --> -<!-- Unicode does not say that lArr is the same as the 'is implied by' arrow - but also does not have any other character for that function. So lArr can - be used for 'is implied by' as ISOtech suggests --> -<!ENTITY uArr "⇑"> <!-- upwards double arrow, U+21D1 ISOamsa --> -<!ENTITY rArr "⇒"> <!-- rightwards double arrow, - U+21D2 ISOtech --> -<!-- Unicode does not say this is the 'implies' character but does not have - another character with this function so rArr can be used for 'implies' - as ISOtech suggests --> -<!ENTITY dArr "⇓"> <!-- downwards double arrow, U+21D3 ISOamsa --> -<!ENTITY hArr "⇔"> <!-- left right double arrow, - U+21D4 ISOamsa --> - -<!-- Mathematical Operators --> -<!ENTITY forall "∀"> <!-- for all, U+2200 ISOtech --> -<!ENTITY part "∂"> <!-- partial differential, U+2202 ISOtech --> -<!ENTITY exist "∃"> <!-- there exists, U+2203 ISOtech --> -<!ENTITY empty "∅"> <!-- empty set = null set, U+2205 ISOamso --> -<!ENTITY nabla "∇"> <!-- nabla = backward difference, - U+2207 ISOtech --> -<!ENTITY isin "∈"> <!-- element of, U+2208 ISOtech --> -<!ENTITY notin "∉"> <!-- not an element of, U+2209 ISOtech --> -<!ENTITY ni "∋"> <!-- contains as member, U+220B ISOtech --> -<!ENTITY prod "∏"> <!-- n-ary product = product sign, - U+220F ISOamsb --> -<!-- prod is NOT the same character as U+03A0 'greek capital letter pi' though - the same glyph might be used for both --> -<!ENTITY sum "∑"> <!-- n-ary summation, U+2211 ISOamsb --> -<!-- sum is NOT the same character as U+03A3 'greek capital letter sigma' - though the same glyph might be used for both --> -<!ENTITY minus "−"> <!-- minus sign, U+2212 ISOtech --> -<!ENTITY lowast "∗"> <!-- asterisk operator, U+2217 ISOtech --> -<!ENTITY radic "√"> <!-- square root = radical sign, - U+221A ISOtech --> -<!ENTITY prop "∝"> <!-- proportional to, U+221D ISOtech --> -<!ENTITY infin "∞"> <!-- infinity, U+221E ISOtech --> -<!ENTITY ang "∠"> <!-- angle, U+2220 ISOamso --> -<!ENTITY and "∧"> <!-- logical and = wedge, U+2227 ISOtech --> -<!ENTITY or "∨"> <!-- logical or = vee, U+2228 ISOtech --> -<!ENTITY cap "∩"> <!-- intersection = cap, U+2229 ISOtech --> -<!ENTITY cup "∪"> <!-- union = cup, U+222A ISOtech --> -<!ENTITY int "∫"> <!-- integral, U+222B ISOtech --> -<!ENTITY there4 "∴"> <!-- therefore, U+2234 ISOtech --> -<!ENTITY sim "∼"> <!-- tilde operator = varies with = similar to, - U+223C ISOtech --> -<!-- tilde operator is NOT the same character as the tilde, U+007E, - although the same glyph might be used to represent both --> -<!ENTITY cong "≅"> <!-- approximately equal to, U+2245 ISOtech --> -<!ENTITY asymp "≈"> <!-- almost equal to = asymptotic to, - U+2248 ISOamsr --> -<!ENTITY ne "≠"> <!-- not equal to, U+2260 ISOtech --> -<!ENTITY equiv "≡"> <!-- identical to, U+2261 ISOtech --> -<!ENTITY le "≤"> <!-- less-than or equal to, U+2264 ISOtech --> -<!ENTITY ge "≥"> <!-- greater-than or equal to, - U+2265 ISOtech --> -<!ENTITY sub "⊂"> <!-- subset of, U+2282 ISOtech --> -<!ENTITY sup "⊃"> <!-- superset of, U+2283 ISOtech --> -<!ENTITY nsub "⊄"> <!-- not a subset of, U+2284 ISOamsn --> -<!ENTITY sube "⊆"> <!-- subset of or equal to, U+2286 ISOtech --> -<!ENTITY supe "⊇"> <!-- superset of or equal to, - U+2287 ISOtech --> -<!ENTITY oplus "⊕"> <!-- circled plus = direct sum, - U+2295 ISOamsb --> -<!ENTITY otimes "⊗"> <!-- circled times = vector product, - U+2297 ISOamsb --> -<!ENTITY perp "⊥"> <!-- up tack = orthogonal to = perpendicular, - U+22A5 ISOtech --> -<!ENTITY sdot "⋅"> <!-- dot operator, U+22C5 ISOamsb --> -<!-- dot operator is NOT the same character as U+00B7 middle dot --> - -<!-- Miscellaneous Technical --> -<!ENTITY lceil "⌈"> <!-- left ceiling = APL upstile, - U+2308 ISOamsc --> -<!ENTITY rceil "⌉"> <!-- right ceiling, U+2309 ISOamsc --> -<!ENTITY lfloor "⌊"> <!-- left floor = APL downstile, - U+230A ISOamsc --> -<!ENTITY rfloor "⌋"> <!-- right floor, U+230B ISOamsc --> -<!ENTITY lang "〈"> <!-- left-pointing angle bracket = bra, - U+2329 ISOtech --> -<!-- lang is NOT the same character as U+003C 'less than sign' - or U+2039 'single left-pointing angle quotation mark' --> -<!ENTITY rang "〉"> <!-- right-pointing angle bracket = ket, - U+232A ISOtech --> -<!-- rang is NOT the same character as U+003E 'greater than sign' - or U+203A 'single right-pointing angle quotation mark' --> - -<!-- Geometric Shapes --> -<!ENTITY loz "◊"> <!-- lozenge, U+25CA ISOpub --> - -<!-- Miscellaneous Symbols --> -<!ENTITY spades "♠"> <!-- black spade suit, U+2660 ISOpub --> -<!-- black here seems to mean filled as opposed to hollow --> -<!ENTITY clubs "♣"> <!-- black club suit = shamrock, - U+2663 ISOpub --> -<!ENTITY hearts "♥"> <!-- black heart suit = valentine, - U+2665 ISOpub --> -<!ENTITY diams "♦"> <!-- black diamond suit, U+2666 ISOpub --> diff --git a/lib/htmlpurifier/docs/examples/basic.php b/lib/htmlpurifier/docs/examples/basic.php deleted file mode 100644 index b51096d2d..000000000 --- a/lib/htmlpurifier/docs/examples/basic.php +++ /dev/null @@ -1,23 +0,0 @@ -<?php - -// This file demonstrates basic usage of HTMLPurifier. - -// replace this with the path to the HTML Purifier library -require_once '../../library/HTMLPurifier.auto.php'; - -$config = HTMLPurifier_Config::createDefault(); - -// configuration goes here: -$config->set('Core.Encoding', 'UTF-8'); // replace with your encoding -$config->set('HTML.Doctype', 'XHTML 1.0 Transitional'); // replace with your doctype - -$purifier = new HTMLPurifier($config); - -// untrusted input HTML -$html = '<b>Simple and short'; - -$pure_html = $purifier->purify($html); - -echo '<pre>' . htmlspecialchars($pure_html) . '</pre>'; - -// vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/fixquotes.htc b/lib/htmlpurifier/docs/fixquotes.htc deleted file mode 100644 index 80dda2dc2..000000000 --- a/lib/htmlpurifier/docs/fixquotes.htc +++ /dev/null @@ -1,9 +0,0 @@ -<public:attach event="oncontentready" onevent="init();" /> -<script> -function init() { - element.innerHTML = '“'+element.innerHTML+'”'; -} -</script> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/index.html b/lib/htmlpurifier/docs/index.html deleted file mode 100644 index 3c4ecc716..000000000 --- a/lib/htmlpurifier/docs/index.html +++ /dev/null @@ -1,188 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Index to all HTML Purifier documentation." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Documentation - HTML Purifier</title> - -</head> -<body> - -<h1>Documentation</h1> - -<p><strong><a href="http://htmlpurifier.org/">HTML Purifier</a></strong> has documentation for all types of people. -Here is an index of all of them.</p> - -<h2>End-user</h2> -<p>End-user documentation that contains articles, tutorials and useful -information for casual developers using HTML Purifier.</p> - -<dl> - -<dt><a href="enduser-id.html">IDs</a></dt> -<dd>Explains various methods for allowing IDs in documents safely.</dd> - -<dt><a href="enduser-youtube.html">Embedding YouTube videos</a></dt> -<dd>Explains how to safely allow the embedding of flash from trusted sites.</dd> - -<dt><a href="enduser-slow.html">Speeding up HTML Purifier</a></dt> -<dd>Explains how to speed up HTML Purifier through caching or inbound filtering.</dd> - -<dt><a href="enduser-utf8.html">UTF-8: The Secret of Character Encoding</a></dt> -<dd>Describes the rationale for using UTF-8, the ramifications otherwise, and how to make the switch.</dd> - -<dt><a href="enduser-tidy.html">Tidy</a></dt> -<dd>Tutorial for tweaking HTML Purifier's Tidy-like behavior.</dd> - -<dt><a href="enduser-customize.html">Customize</a></dt> -<dd>Tutorial for customizing HTML Purifier's tag and attribute sets.</dd> - -<dt><a href="enduser-uri-filter.html">URI Filters</a></dt> -<dd>Tutorial for creating custom URI filters.</dd> - -</dl> - -<h2>Development</h2> -<p>Developer documentation detailing code issues, roadmaps and project -conventions.</p> - -<dl> - -<dt><a href="dev-progress.html">Implementation Progress</a></dt> -<dd>Tables detailing HTML element and CSS property implementation coverage.</dd> - -<dt><a href="dev-naming.html">Naming Conventions</a></dt> -<dd>Defines class naming conventions.</dd> - -<dt><a href="dev-optimization.html">Optimization</a></dt> -<dd>Discusses possible methods of optimizing HTML Purifier.</dd> - -<dt><a href="dev-flush.html">Flushing the Purifier</a></dt> -<dd>Discusses when to flush HTML Purifier's various caches.</dd> - -<dt><a href="dev-advanced-api.html">Advanced API</a></dt> -<dd>Specification for HTML Purifier's advanced API for defining -custom filtering behavior.</dd> - -<dt><a href="dev-config-schema.html">Config Schema</a></dt> -<dd>Describes config schema framework in HTML Purifier.</dd> - -</dl> - -<h2>Proposals</h2> -<p>Proposed features, as well as the associated rambling to get a clear -objective in place before attempted implementation.</p> - -<dl> -<dt><a href="proposal-colors.html">Colors</a></dt> -<dd>Proposal to allow for color constraints.</dd> -</dl> - -<h2>Reference</h2> -<p>Miscellaneous essays, research pieces and other reference type material -that may not directly discuss HTML Purifier.</p> - -<dl> -<dt><a href="ref-devnetwork.html">DevNetwork Credits</a></dt> -<dd>Credits and links to DevNetwork forum topics.</dd> -</dl> - -<h2>Internal memos</h2> - -<p>Plaintext documents that are more for use by active developers of -the code. They may be upgraded to HTML files or stay as TXT scratchpads.</p> - -<table class="table"> - -<thead><tr> - <th style="width:10%">Type</th> - <th style="width:20%">Name</th> - <th>Description</th> -</tr></thead> - -<tbody> - -<tr> - <td>End-user</td> - <td><a href="enduser-overview.txt">Overview</a></td> - <td>High level overview of the general control flow (mostly obsolete).</td> -</tr> - -<tr> - <td>End-user</td> - <td><a href="enduser-security.txt">Security</a></td> - <td>Common security issues that may still arise (half-baked).</td> -</tr> - -<tr> - <td>Development</td> - <td><a href="dev-config-bcbreaks.txt">Config BC Breaks</a></td> - <td>Backwards-incompatible changes in HTML Purifier 4.0.0</td> -</tr> - -<tr> - <td>Development</td> - <td><a href="dev-code-quality.txt">Code Quality Issues</a></td> - <td>Enumerates code quality issues and places that need to be refactored.</td> -</tr> - -<tr> - <td>Proposal</td> - <td><a href="proposal-filter-levels.txt">Filter levels</a></td> - <td>Outlines details of projected configurable level of filtering.</td> -</tr> - -<tr> - <td>Proposal</td> - <td><a href="proposal-language.txt">Language</a></td> - <td>Specification of I18N for error messages derived from MediaWiki (half-baked).</td> -</tr> - -<tr> - <td>Proposal</td> - <td><a href="proposal-new-directives.txt">New directives</a></td> - <td>Assorted configuration options that could be implemented.</td> -</tr> - -<tr> - <td>Proposal</td> - <td><a href="proposal-css-extraction.txt">CSS extraction</a></td> - <td>Taking the inline CSS out of documents and into <code>style</code>.</td> -</tr> - -<tr> - <td>Reference</td> - <td><a href="ref-content-models.txt">Handling Content Model Changes</a></td> - <td>Discusses how to tidy up content model changes using custom ChildDef classes.</td> -</tr> - -<tr> - <td>Reference</td> - <td><a href="ref-proprietary-tags.txt">Proprietary tags</a></td> - <td>List of vendor-specific tags we may want to transform to W3C compliant markup.</td> -</tr> - -<tr> - <td>Reference</td> - <td><a href="ref-html-modularization.txt">Modularization of HTMLDefinition</a></td> - <td>Provides a high-level overview of the concepts behind HTMLModules.</td> -</tr> - -<tr> - <td>Reference</td> - <td><a href="ref-whatwg.txt">WHATWG</a></td> - <td>How WHATWG plays into what we need to do.</td> -</tr> - -</tbody> - -</table> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/proposal-colors.html b/lib/htmlpurifier/docs/proposal-colors.html deleted file mode 100644 index 657633882..000000000 --- a/lib/htmlpurifier/docs/proposal-colors.html +++ /dev/null @@ -1,49 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Proposal to allow for color constraints in HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>Proposal: Colors - HTML Purifier</title> - -</head><body> - -<h1 class="subtitled">Colors</h1> -<div class="subtitle">Hammering some sense into those color-blind newbies</div> - -<div id="filing">Filed under Proposals</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>Your website probably has a color-scheme. -<span style="color:#090; background:#FFF;">Green on white</span>, -<span style="color:#A0F; background:#FF0;">purple on yellow</span>, -whatever. When you give users the ability to style their content, you may -want them to keep in line with your styling. If you're website is all -about light colors, you don't want a user to come in and vandalize your -page with a deep maroon.</p> - -<p>This is an extremely silly feature proposal, but I'm writing it down anyway.</p> - -<p>What if the user could constrain the colors specified in inline styles? You -are only allowed to use these shades of dark green for text and these shades -of light yellow for the background. At the very least, you could ensure -that we did not have pale yellow on white text.</p> - -<h2>Implementation issues</h2> - -<ol> -<li>Requires the color attribute definition to know, currently, what the text -and background colors are. This becomes difficult when classes are thrown -into the mix.</li> -<li>The user still has to define the permissible colors, how does one do -something like that?</li> -</ol> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/proposal-config.txt b/lib/htmlpurifier/docs/proposal-config.txt deleted file mode 100644 index 4e031c586..000000000 --- a/lib/htmlpurifier/docs/proposal-config.txt +++ /dev/null @@ -1,23 +0,0 @@ - -Configuration - -Configuration is documented on a per-use case: if a class uses a certain -value from the configuration object, it has to define its name and what the -value is used for. This means decentralized configuration declarations that -are nevertheless error checking and a centralized configuration object. - -Directives are divided into namespaces, indicating the major portion of -functionality they cover (although there may be overlaps). Please consult -the documentation in ConfigDef for more information on these namespaces. - -Since configuration is dependant on context, internal classes require a -configuration object to be passed as a parameter. (They also require a -Context object). A majority of classes do not need the config object, -but for those who do, it is a lifesaver. - -Definition objects are complex datatypes influenced by their respective -directive namespaces (HTMLDefinition with HTML and CSSDefinition with CSS). -If any of these directives is updated, HTML Purifier forces the definition -to be regenerated. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/proposal-css-extraction.txt b/lib/htmlpurifier/docs/proposal-css-extraction.txt deleted file mode 100644 index 9933c96b8..000000000 --- a/lib/htmlpurifier/docs/proposal-css-extraction.txt +++ /dev/null @@ -1,34 +0,0 @@ - -Extracting inline CSS from HTML Purifier - voodoofied: Assigning semantics to elements - -Sander Tekelenburg brought to my attention the poor programming style of -inline CSS in HTML documents. In an ideal world, we wouldn't be using inline -CSS at all: everything would be assigned using semantic class attributes -from an external stylesheet. - -With ExtractStyleBlocks and CSSTidy, this is now possible (when allowed, users -can specify a style element which gets extracted from the user-submitted HTML, which -the application can place in the head of the HTML document). But there still -is the issue of inline CSS that refuses to go away. - -The basic idea behind this feature is assign every element a unique identifier, -and then move all of the CSS data to a style-sheet. This HTML: - -<div style="text-align:center">Big <span style="color:red;">things</span>!</div> - -into - -<div id="hp-12345">Big <span id="hp-12346">things</span>!</div> - -and a stylesheet that is: - -#hp-12345 {text-align:center;} -#hp-12346 {color:red;} - -Beyond that, HTML Purifier can magically merge common CSS values together, -and a whole manner of other heuristic things. HTML Purifier should also -make it easy for an admin to re-style the HTML semantically. Speed is not -an issue. Also, better WYSIWYG editors are needed. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/proposal-errors.txt b/lib/htmlpurifier/docs/proposal-errors.txt deleted file mode 100644 index 87cb2ac19..000000000 --- a/lib/htmlpurifier/docs/proposal-errors.txt +++ /dev/null @@ -1,211 +0,0 @@ -Considerations for ErrorCollection - -Presently, HTML Purifier takes a code-execution centric approach to handling -errors. Errors are organized and grouped according to which segment of the -code triggers them, not necessarily the portion of the input document that -triggered the error. This means that errors are pseudo-sorted by category, -rather than location in the document. - -One easy way to "fix" this problem would be to re-sort according to line number. -However, the "category" style information we derive from naively following -program execution is still useful. After all, each of the strategies which -can report errors still process the document mostly linearly. Furthermore, -not only do they process linearly, but the way they pass off operations to -sub-systems mirrors that of the document. For example, AttrValidator will -linearly proceed through elements, and on each element will use AttrDef to -validate those contents. From there, the attribute might have more -sub-components, which have execution passed off accordingly. - -In fact, each strategy handles a very specific class of "error." - -RemoveForeignElements - element tokens -MakeWellFormed - element token ordering -FixNesting - element token ordering -ValidateAttributes - attributes of elements - -The crucial point is that while we care about the hierarchy governing these -different errors, we *don't* care about any other information about what actually -happens to the elements. This brings up another point: if HTML Purifier fixes -something, this is not really a notice/warning/error; it's really a suggestion -of a way to fix the aforementioned defects. - -In short, the refactoring to take this into account kinda sucks. - -Errors should not be recorded in order that they are reported. Instead, they -should be bound to the line (and preferably element) in which they were found. -This means we need some way to uniquely identify every element in the document, -which doesn't presently exist. An easy way of adding this would be to track -line columns. An important ramification of this is that we *must* use the -DirectLex implementation. - - 1. Implement column numbers for DirectLex [DONE!] - 2. Disable error collection when not using DirectLex [DONE!] - -Next, we need to re-orient all of the error declarations to place CurrentToken -at utmost important. Since this is passed via Context, it's not always clear -if that's available. ErrorCollector should complain HARD if it isn't available. -There are some locations when we don't have a token available. These include: - - * Lexing - this can actually have a row and column, but NOT correspond to - a token - * End of document errors - bump this to the end - -Actually, we *don't* have to complain if CurrentToken isn't available; we just -set it as a document-wide error. And actually, nothing needs to be done here. - -Something interesting to consider is whether or not we care about the locations -of attributes and CSS properties, i.e. the sub-objects that compose these things. -In terms of consistency, at the very least attributes should have column/line -numbers attached to them. However, this may be overkill, as attributes are -uniquely identifiable. You could go even further, with CSS, but they are also -uniquely identifiable. - -Bottom-line is, however, this information must be available, in form of the -CurrentAttribute and CurrentCssProperty (theoretical) context variables, and -it must be used to organize the errors that the sub-processes may throw. -There is also a hierarchy of sorts that may make merging this into one context -variable more sense, if it hadn't been for HTML's reasonably rigid structure. -A CSS property will never contain an HTML attribute. So we won't ever get -recursive relations, and having multiple depths won't ever make sense. Leave -this be. - -We already have this information, and consequently, using start and end is -*unnecessary*, so long as the context variables are set appropriately. We don't -care if an error was thrown by an attribute transform or an attribute definition; -to the end user these are the same (for a developer, they are different, but -they're better off with a stack trace (which we should add support for) in such -cases). - - 3. Remove start()/end() code. Don't get rid of recursion, though [DONE] - 4. Setup ErrorCollector to use context information to setup hierarchies. - This may require a different internal format. Use objects if it gets - complex. [DONE] - - ASIDE - More on this topic: since we are now binding errors to lines - and columns, a particular error can have three relationships to that - specific location: - - 1. The token at that location directly - RemoveForeignElements - AttrValidator (transforms) - MakeWellFormed - 2. A "component" of that token (i.e. attribute) - AttrValidator (removals) - 3. A modification to that node (i.e. contents from start to end - token) as a whole - FixNesting - - This needs to be marked accordingly. In the presentation, it might - make sense keep (3) separate, have (2) a sublist of (1). (1) can - be a closing tag, in which case (3) makes no sense at all, OR it - should be related with its opening tag (this may not necessarily - be possible before MakeWellFormed is run). - - So, the line and column counts as our identifier, so: - - $errors[$line][$col] = ... - - Then, we need to identify case 1, 2 or 3. They are identified as - such: - - 1. Need some sort of semaphore in RemoveForeignElements, etc. - 2. If CurrentAttr/CurrentCssProperty is non-null - 3. Default (FixNesting, MakeWellFormed) - - One consideration about (1) is that it usually is actually a - (3) modification, but we have no way of knowing about that because - of various optimizations. However, they can probably be treated - the same. The other difficulty is that (3) is never a line and - column; rather, it is a range (i.e. a duple) and telling the user - the very start of the range may confuse them. For example, - - <b>Foo<div>bar</div></b> - ^ ^ - - The node being operated on is <b>, so the error would be assigned - to the first caret, with a "node reorganized" error. Then, the - ChildDef would have submitted its own suggestions and errors with - regard to what's going in the internals. So I suppose this is - ok. :-) - - Now, the structure of the earlier mentioned ... would be something - like this: - - object { - type = (token|attr|property), - value, // appropriate for type - errors => array(), - sub-errors = [recursive], - } - - This helps us keep things agnostic. It is also sufficiently complex - enough to warrant an object. - -So, more wanking about the object format is in order. The way HTML Purifier is -currently setup, the only possible hierarchy is: - - token -> attr -> css property - -These relations do not exist all of the time; a comment or end token would not -ever have any attributes, and non-style attributes would never have CSS properties -associated with them. - -I believe that it is worth supporting multiple paths. At some point, we might -have a hierarchy like: - - * -> syntax - -> token -> attr -> css property - -> url - -> css stylesheet <style> - -et cetera. Now, one of the practical implications of this is that every "node" -on our tree is well-defined, so in theory it should be possible to either 1. -create a separate class for each error struct, or 2. embed this information -directly into HTML Purifier's token stream. Embedding the information in the -token stream is not a terribly good idea, since tokens can be removed, etc. -So that leaves us with 1... and if we use a generic interface we can cut down -on a lot of code we might need. So let's leave it like this. - -~~~~ - -Then we setup suggestions. - - 5. Setup a separate error class which tells the user any modifications - HTML Purifier made. - -Some information about this: - -Our current paradigm is to tell the user what HTML Purifier did to the HTML. -This is the most natural mode of operation, since that's what HTML Purifier -is all about; it was not meant to be a validator. - -However, most other people have experience dealing with a validator. In cases -where HTML Purifier unambiguously does the right thing, simply giving the user -the correct version isn't a bad idea, but problems arise when: - -- The user has such bad HTML we do something odd, when we should have just - flagged the HTML as an error. Such examples are when we do things like - remove text from directly inside a <table> tag. It was probably meant to - be in a <td> tag or be outside the table, but we're not smart enough to - realize this so we just remove it. In such a case, we should tell the user - that there was foreign data in the table, but then we shouldn't "demand" - the user remove the data; it's more of a "here's a possible way of - rectifying the problem" - -- Giving line context for input is hard enough, but feasible; giving output - line context will be extremely difficult due to shifting lines; we'd probably - have to track what the tokens are and then find the appropriate out context - and it's not guaranteed to work etc etc etc. - -```````````` - -Don't forget to spruce up output. - - 6. Output needs to automatically give line and column numbers, basically - "at line" on steroids. Look at W3C's output; it's ok. [PARTIALLY DONE] - - - We need a standard CSS to apply (check demo.css for some starting - styling; some buttons would also be hip) - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/proposal-filter-levels.txt b/lib/htmlpurifier/docs/proposal-filter-levels.txt deleted file mode 100644 index b78b898b4..000000000 --- a/lib/htmlpurifier/docs/proposal-filter-levels.txt +++ /dev/null @@ -1,137 +0,0 @@ - -Filter Levels - When one size *does not* fit all - -It makes little sense to constrain users to one set of HTML elements and -attributes and tell them that they are not allowed to mold this in -any fashion. Many users demand to be able to custom-select which elements -and attributes they want. This is fine: because HTML Purifier keeps close -track of what elements are safe to use, there is no way for them to -accidently allow an XSS-able tag. - -However, combing through the HTML spec to make your own whitelist can -be a daunting task. HTML Purifier ought to offer pre-canned filter levels -that amateur users can select based on what they think is their use-case. - -Here are some fuzzy levels you could set: - -1. Comments - Wordpress recommends a, abbr, acronym, b, blockquote, cite, - code, em, i, strike, strong; however, you could get away with only a, em and - p; also having blockquote and pre tags would be helpful. -2. BBCode - Emulate the usual tagset for forums: b, i, img, a, blockquote, - pre, div, span and h[2-6] (the last three are for specially formatted - posts, div and span require associated classes or inline styling enabled - to be useful) -3. Pages - As permissive as possible without allowing XSS. No protection - against bad design sense, unfortunantely. Suitable for wiki and page - environments. (probably what we have now) -4. Lint - Accept everything in the spec, a Tidy wannabe. (This probably won't - get implemented as it would require routines for things like <object> - and friends to be implemented, which is a lot of work for not a lot of - benefit) - -One final note: when you start axing tags that are more commonly used, you -run the risk of accidentally destroying user data, especially if the data -is incoming from a WYSIWYG editor that hasn't been synced accordingly. This may -make forbidden element to text transformations desirable (for example, images). - - - -== Element Risk Analysis == - -Although none of the currently supported elements presents a security -threat per-say, some can cause problems for page layouts or be -extremely complicated. - -Legend: - [danger level] - regular tags / uncommon tags ~ deprecated tags - [danger level]* - rare tags - -1 - blockquote, code, em, i, p, tt / strong, sub, sup -1* - abbr, acronym, bdo, cite, dfn, kbd, q, samp -2 - b, br, del, div, pre, span / ins, s, strike ~ u -3 - h2, h3, h4, h5, h6 ~ center -4 - h1, big ~ font -5 - a -7 - area, map - -These are special use tags, they should be enabled on a blanket basis. - -Lists - dd, dl, dt, li, ol, ul ~ menu, dir -Tables - caption, table, td, th, tr / col, colgroup, tbody, tfoot, thead - -Forms - fieldset, form, input, lable, legend, optgroup, option, select, textarea -XSS - noscript, object, script ~ applet -Meta - base, basefont, body, head, html, link, meta, style, title -Frames - frame, frameset, iframe - -And tag specific notes: - -a - general problems involving linkspam -b - too much bold is bad, typographically speaking bold is discouraged -br - often misused -center - CSS, usually no legit use -del - only useful in editing context -div - little meaning in certain contexts i.e. blog comment -h1 - usually no legit use, as header is already set by application -h* - not needed in blog comments -hr - usually not necessary in blog comments -img - could be extremely undesirable if linking to external pics (CSRF, goatse) -pre - could use formatting, only useful in code contexts -q - very little support -s - transform into span with styling or del? -small - technically presentational -span - depends on attribute allowances -sub, sup - specialized -u - little legit use, prefer class with text-decoration - -Based on the riskiness of the items, we may want to offer %HTML.DisableImages -attribute and put URI filtering higher up on the priority list. - - -== Attribute Risk Analysis == - -We actually have a suprisingly small assortment of allowed attributes (the -rest are deprecated in strict, and thus we opted not to allow them, even -though our output is XHTML Transitional by default.) - -Required URI - img.alt, img.src, a.href -Medium risk - *.class, *.dir -High risk - img.height, img.width, *.id, *.style - -Table - colgroup/col.span, td/th.rowspan, td/th.colspan -Uncommon - *.title, *.lang, *.xml:lang -Rare - td/th.abbr, table.summary, {table}.charoff -Rare URI - del.cite, ins.cite, blockquote.cite, q.cite, img.longdesc -Presentational - {table}.align, {table}.valign, table.frame, table.rules, - table.border -Partially presentational - table.cellpadding, table.cellspacing, - table.width, col.width, colgroup.width - - -== CSS Risk Analysis == - -Currently, there is no support for fine-grained "allowed CSS" specification, -mainly because I'm lazy, partially because no one has asked for it. However, -this will be added eventually. - -There are certain CSS elements that are extremely useful inline, but then -as you get to more presentation oriented styling it may not always be -appropriate to inline them. - -Useful - clear, float, border-collapse, caption-side - -These CSS properties can break layouts if used improperly. We have excluded -any CSS properties that are not currently implemented (such as position). - -Dangerous, can go outside container - float -Easy to abuse - font-size, font-family (font), width -Colored - background-color (background), border-color (border), color - (see proposal-colors.html) -Dramatic - border, list-style-position (list-style), margin, padding, - text-align, text-indent, text-transform, vertical-align, line-height - -Dramatic elements substantially change the look of text in ways that should -probably have been reserved to other areas. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/proposal-language.txt b/lib/htmlpurifier/docs/proposal-language.txt deleted file mode 100644 index 149701cd3..000000000 --- a/lib/htmlpurifier/docs/proposal-language.txt +++ /dev/null @@ -1,64 +0,0 @@ -We are going to model our I18N/L10N off of MediaWiki's system. Their's is -obviously quite complicated, so we're going to simplify it a bit for our needs. - -== Caching == - -MediaWiki has lots of caching mechanisms built in, which make the code somewhat -more difficult to understand. Before doing any loading, MediaWiki will check -the following places to see if we can be lazy: - -1. $mLocalisationCache[$code] - just a variable where it may have been stashed -2. serialized/$code.ser - compiled serialized language file -3. Memcached version of file (with expiration checking) - -Expiration checking consists of by ensuring all dependencies have filemtime -that match the ones bundled with the cached copy. Similar checking could be -implemented for serialized versions, as it seems that they are not updated -until manually recompiled. - -== Behavior == - -Things that are localizable: - -- Weekdays (and abbrev) -- Months (and abbrev) -- Bookstores -- Skin names -- Date preferences / Custom date format -- Default date format -- Default user option overrides --+ Language names -- Timezones --+ Character encoding conversion via iconv -- UpperLowerCase first (needs casemaps for some) -- UpperLowerCase -- Uppercase words -- Uppercase word breaks -- Case folding -- Strip punctuation for MySQL search -- Get first character --+ Alternate encoding --+ Recoding for edit (and then recode input) --+ RTL --+ Direction mark character depending on RTL --? Arrow depending on RTL -- Languages where italics cannot be used --+ Number formatting (commafy, transform digits, transform separators) -- Truncate (multibyte) -- Grammar conversions for inflected languages -- Plural transformations -- Formatting expiry times -- Segmenting for diffs (Chinese) -- Convert to variants of language -- Language specific user preference options -- Link trails [[foo]]bar --+ Language code (RFC 3066) - -Neat functionality: - -- I18N sprintfDate -- Roman numeral formatting - -Items marked with a + likely need to be addressed by HTML Purifier - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/proposal-new-directives.txt b/lib/htmlpurifier/docs/proposal-new-directives.txt deleted file mode 100644 index f54ee2d8d..000000000 --- a/lib/htmlpurifier/docs/proposal-new-directives.txt +++ /dev/null @@ -1,44 +0,0 @@ - -Configuration Ideas - -Here are some theoretical configuration ideas that we could implement some -time. Note the naming convention: %Namespace.Directive. If you want one -implemented, give us a ring, and we'll move it up the priority chain. - -%Attr.RewriteFragments - if there's %Attr.IDPrefix we may want to transparently - rewrite the URLs we parse too. However, we can only do it when it's a pure - anchor link, so it's not foolproof - -%Attr.ClassBlacklist, -%Attr.ClassWhitelist, -%Attr.ClassPolicy - determines what classes are allowed. When - %Attr.ClassPolicy is set to Blacklist, only allow those not in - %Attr.ClassBlacklist. When it's Whitelist, only allow those in - %Attr.ClassWhitelist. - -%Attr.MaxWidth, -%Attr.MaxHeight - caps for width and height related checks. - (the hack in Pixels for an image crashing attack could be replaced by this) - -%URI.AddRelNofollow - will add rel="nofollow" to all links, preventing the - spread of ill-gotten pagerank - -%URI.HostBlacklistRegex - regexes that if matching the host are disallowed -%URI.HostWhitelist - domain names that are excluded from the host blacklist -%URI.HostPolicy - determines whether or not its reject all and then whitelist - or allow all in then do specific blacklists with whitelist intervening. - 'DenyAll' or 'AllowAll' (default) - -%URI.DisableIPHosts - URIs that have IP addresses for hosts are disallowed. - Be sure to also grab unusual encodings (dword, hex and octal), which may - be currently be caught by regular DNS -%URI.DisableIDN - Disallow raw internationalized domain names. Punycode - will still be permitted. - -%URI.ConvertUnusualIPHosts - transform dword/hex/octal IP addresses to the - regular form -%URI.ConvertAbsoluteDNS - Remove extra dots after host names that trigger - absolute DNS. While this is actually the preferred method according to - the RFC, most people opt to use a relative domain name relative to . (root). - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/proposal-plists.txt b/lib/htmlpurifier/docs/proposal-plists.txt deleted file mode 100644 index eef8ade61..000000000 --- a/lib/htmlpurifier/docs/proposal-plists.txt +++ /dev/null @@ -1,218 +0,0 @@ -THE UNIVERSAL DESIGN PATTERN: PROPERTIES -Steve Yegge - -Implementation: - get(name) - put(name, value) - has(name) - remove(name) - iteration, with filtering [this will be our namespaces] - parent - -Representations: - - Keys are strings - - It's nice to not need to quote keys (if we formulate our own language, - consider this) - - Property not present representation (key missing) - - Frequent removal/re-add may have null help. If null is valid, use - another value. (PHP semantics are weird here) - -Data structures: - - LinkedHashMap is wonderful (O(1) access and maintains order) - - Using a special property that points to the parent is usual - - Multiple inheritance possible, need rules for which to lookup first - - Iterative inheritance is best - - Consider performance! - -Deletion - - Tricky problem with inheritance - - Distinguish between "not found" and "look in my parent for the property" - [Maybe HTML Purifier won't allow deletion] - -Read/write asymmetry (it's correct!) - -Read-only plists - - Allow ability to freeze [this is what we have already] - - Don't overuse it - -Performance: - - Intern strings (PHP does this already) - - Don't be case-insensitive - - If all properties in a plist are known a-priori, you can use a "perfect" - hash function. Often overkill. - - Copy-on-read caching "plundering" reduces lookup, but uses memory and can - grow stale. Use as last resort. - - Refactoring to fields. Watch for API compatibility, system complexity, - and lack of flexibility. - - Refrigerator: external data-structure to hold plists - -Transient properties: - [Don't need to worry about this] - - Use a separate plist for transient properties - - Non-numeric override; numeric should ADD - - Deletion: removeTransientProperty() and transientlyRemoveProperty() - -Persistence: - - XML/JSON are good - - Text-based is good for readability, maintainability and bootstrapping - - Compressed binary format for network transport [not necessary] - - RDBMS or XML database - -Querying: [not relevant] - - XML database is nice for XPath/XQuery - - jQuery for JSON - - Just load it all into a program - -Backfills/Data integrity: - - Use usual methods - - Lazy backfill is a nice hack - -Type systems: - - Flags: ReadOnly, Permanent, DontEnum - - Typed properties isn't that useful [It's also Not-PHP] - - Seperate meta-list of directive properties IS useful - - Duck typing is useful for systems designed fully around properties pattern - -Trade-off: - + Flexibility - + Extensibility - + Unit-testing/prototype-speed - - Performance - - Data integrity - - Navagability/Query-ability - - Reversability (hard to go back) - -HTML Purifier - -We are not happy with our current system of defining configuration directives, -because it has become clear that things will get a lot nicer if we allow -multiple namespaces, and there are some features that naturally lend themselves -to inheritance, which we do not really support well. - -One of the considered implementation changes would be to go from a structure -like: - -array( - 'Namespace' => array( - 'Directive' => 'val1', - 'Directive2' => 'val2', - ) -) - -to: - -array( - 'Namespace.Directive' => 'val1', - 'Namespace.Directive2' => 'val2', -) - -The below implementation takes more memory, however, and it makes it a bit -complicated to grab all values from a namespace. - -The alternate implementation choice is to allow nested plists. This keeps -iteration easy, but is problematic for inheritance (it would be difficult -to distinguish a plist from an array) and retrieval (when specifying multiple -namespaces we would need some multiple de-referencing). - ----- - -We can bite the performance hit, and just do iteration with filter -(the strncmp call should be relatively cheap). Then, users should be able -to optimize doing something like: - -$config = HTMLPurifier_Config::createDefault(); -if (!file_exists('config.php')) { - // set up $config - $config->save('config.php'); -} else { - $config->load('config.php'); -} - -Or maybe memcache, or something. This means that "// set up $config" must -not have any dynamic parts, or the user has to invalidate the cache when -they do update it. We have to think about this a little more carefully; the -file call might be more expensive. - ----- - -This might get expensive, however, when we actually care about iterating -over the configuration and want the actual values. So what about nesting the -lists? - -"ns.sub.directive" => values['ns']['sub']['directive'] - -We can distinguish between plists and arrays by using ArrayObjects for the -plists, and regular arrays for the arrays? Alternatively, use ArrayObjects -for the arrays, and regular arrays for the plists. - ----- - -Implementation demands, and what has caused them: - -1. DefinitionCache, the HTML, CSS and URI namespaces have caches attached to them - Results: - - getBatchSerial() - - getBatch() : in general, the ability to traverse just a namespace - -2. AutoFormat/Filter, this is a plugin architecture, directives not hard-coded - - getBatch() - -3. Configuration form - - Namespaces used to organize directives - -Other than that, we have a pure plist. PERHAPS we should maintain separate things -for these different demands. - -Issue 2: Directives for configuring the plugins are regular plists, but -when enabling them, while it's "plist-ish", what you're really doing is adding -them to an array of "autoformatters"/"filters" to enable. We can setup -magic BC as well as in the new interface, but there should also be an -add('AutoFormat', 'AutoParagraph'); which does the right thing. - -One thing to consider is whether or not inheritance rules will apply to these. -I'd say yes. That means that they're still plisty, in fact, the underlying -implementation will probably be a plist. However, they will get their OWN -plists, and will NOT support nesting. - -Issue 1: Our current implementation is generally not efficient; md5(serialize($foo)) -is pretty expensive. So, I don't think there will be any problems if it -gets "less" efficient, as long as we give users a properly fast alternative; -DefinitionRev gives us a way to do this, by simply telling the user they must -update it whenever they update Configuration directives as well. (There are -obvious BC concerns here). - -In such a case, we simply iterate over our plist (performing full retrievals -for each value), grab the entries we care about, and then serialize and hash. -It's going to be slow either way, due to the ability of plists to inherit. -If we ksort(), we don't have to traverse the entire array, however, the -cost of a ksort() call may not be worth it. - -At this point, last time, I started worrying about the performance implications -of allowing inheritance, and wondering whether or not I wanted to squash -the plist. At first blush, our code might be under the assumption that -accessing properties is cheap; but actually we prefer to copy out the value -into a member variable if it's going to be used many times. With this is mind -I don't think CPU consumption from a few nested function calls is going to -be a problem. We *are* going to enforce a function only interface. - -The next issue at hand is how we're going to manage the "special" plists, -which should still be able to be inherited. Basically, it means that multiple -plists would be attached to the configuration object, which is not the -best for memory performance. The alternative is to keep them all in one -big plist, and then eat the one-time cost of traversing the entire plist -to grab the appropriate values. - -I think at this point we can write the generic interface, and then set up separate -plists if that ends up being necessary for performance (it probably won't.) Now -lets code our generic plist implementation. - ----- - -Iterating over the plist presents some problems. The way we've chosen to solve -this is to squash all of the parents. - ----- - -But I don't need iteration. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/ref-content-models.txt b/lib/htmlpurifier/docs/ref-content-models.txt deleted file mode 100644 index 19f84d526..000000000 --- a/lib/htmlpurifier/docs/ref-content-models.txt +++ /dev/null @@ -1,50 +0,0 @@ - -Handling Content Model Changes - - -1. Context - -The distinction between Transitional and Strict document types is somewhat -of an anomaly in the lineage of XHTML document types (following 1.0, no -doctypes do not have flavors: instead, modularization is used to let -document authors vary their elements). This transition is usually quite -straight-forward, as W3C usually deprecates attributes or elements, which -are quite easily handled using tag and attribute transforms. - -However, for two elements, <blockquote>, <body> and <address>, W3C elected -to also change the content model. <blockquote> and <body> originally -accepted both inline and block elements, but in the strict doctype they -only allow block elements. With <address>, the situation is inverted: -<p> tags were now forbidden from appearing within this tag. - - -2. Current situation - -Currently, HTML Purifier treats <blockquote> specially during Tidy mode -using a custom ChildDef class StrictBlockquote. StrictBlockquote -operates similarly to Required, except that when it encounters an inline -element, it will wrap it in a block tag (as specified by -%HTML.BlockWrapper, the default is <p>). The naming suggests it can -only be used for <blockquote>s, although it may be possible to -genericize it to work on other cases of this nature (this would be of -little practical application, as no other element in XHTML 1.1 or earlier -has a block-only content model). - -Tidy currently contains no custom, lenient implementation for <address>. -If one were to be written, it would likely operate on the principle that, -when a <p> tag were to be encountered, it would be replaced with a -leading and trailing <br /> tag (the contents of <p>, being inline, are -not an issue). There is no prior work with this sort of operation. - - -3. Outside applicability - -There are a number of other elements that contain restrictive content -models, such as <ul> or <span> (the latter is restrictive in that it -does not allow block elements). In the former case, an errant node -is eliminated completely, in the latter case, the text of the node -would is preserved (as the parent node does allow PCDATA). Custom -content model implementations probably are not the best way of handling -these cases, instead, node bubbling should be implemented instead. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/ref-css-length.txt b/lib/htmlpurifier/docs/ref-css-length.txt deleted file mode 100644 index aa40559e3..000000000 --- a/lib/htmlpurifier/docs/ref-css-length.txt +++ /dev/null @@ -1,30 +0,0 @@ - -CSS Length Reference - To bound, or not to bound, that is the question - -It's quite a reasonable request, really, and it's already been implemented -for HTML. That is, length bounding. It makes little sense to let users -define text blocks that have a font-size of 63,360 inches (that's a mile, -by the way) or a width of forty-fold the parent container. - -But it's a little more complicated then that. There are multiple units -one can use, and we have to a little unit conversion to get things working. -Here's what we have: - -Absolute: - 1 in ~= 2.54 cm - 1 cm = 10 mm - 1 pt = 1/72 in - 1 pc = 12 pt - -Relative: - 1 em ~= 10.0667 px - 1 ex ~= 0.5 em, though Mozilla Firefox says 1 ex = 6px - 1 px ~= 1 pt - -Watch out: font-sizes can also be nested to get successively larger -(although I do not relish having to keep track of context font-sizes, -this may be necessary, especially for some of the more advanced features -for preventing things like white on white). - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/ref-devnetwork.html b/lib/htmlpurifier/docs/ref-devnetwork.html deleted file mode 100644 index 2e9d142e5..000000000 --- a/lib/htmlpurifier/docs/ref-devnetwork.html +++ /dev/null @@ -1,47 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> -<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head> -<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> -<meta name="description" content="Credits and links to DevNetwork forum topics on HTML Purifier." /> -<link rel="stylesheet" type="text/css" href="./style.css" /> - -<title>DevNetwork Credits - HTML Purifier</title> - -</head> -<body> - -<h1>DevNetwork Credits</h1> - -<div id="filing">Filed under Reference</div> -<div id="index">Return to the <a href="index.html">index</a>.</div> -<div id="home"><a href="http://htmlpurifier.org/">HTML Purifier</a> End-User Documentation</div> - -<p>Many thanks to the DevNetwork community for answering questions, -theorizing about design, and offering encouragement during -the development of this library in these forum threads:</p> - -<ul> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=52905">HTMLPurifier PHP Library hompeage</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53056">How much of CSS to implement?</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53083">Parsing URL only according to URI : Security Risk?</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53096">Gimme a name : URI and friends</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53415">How to document configuration directives</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53479">IPv6</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53539">http and ftp versus news and mailto</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53579">HTMLPurifier - Take your best shot</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53664">Need help optimizing a block of code</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=53861">Non-SGML characters</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=54283">Wordpress makes me cry</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=54478">Parameter Object vs. Parameter Array vs. Parameter Functions</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=54521">Convert encoding where output cannot represent characters</a></li> - <li><a href="http://forums.devnetwork.net/viewtopic.php?t=56411">Reporting errors in a document without line numbers</a></li> -</ul> - -<p>...as well as any I may have forgotten.</p> - -</body> -</html> - -<!-- vim: et sw=4 sts=4 ---> diff --git a/lib/htmlpurifier/docs/ref-html-modularization.txt b/lib/htmlpurifier/docs/ref-html-modularization.txt deleted file mode 100644 index d26d30ada..000000000 --- a/lib/htmlpurifier/docs/ref-html-modularization.txt +++ /dev/null @@ -1,166 +0,0 @@ - -The Modularization of HTMLDefinition in HTML Purifier - -WARNING: This document was drafted before the implementation of this - system, and some implementation details may have evolved over time. - -HTML Purifier uses the modularization of XHTML -<http://www.w3.org/TR/xhtml-modularization/> to organize the internals -of HTMLDefinition into a more manageable and extensible fashion. Rather -than have one super-object, HTMLDefinition is split into HTMLModules, -each of which are responsible for defining elements, their attributes, -and other properties (for a more indepth coverage, see -/library/HTMLPurifier/HTMLModule.php's docblock comments). These modules -are managed by HTMLModuleManager. - -Modules that we don't support but could support are: - - * 5.6. Table Modules - o 5.6.1. Basic Tables Module [?] - * 5.8. Client-side Image Map Module [?] - * 5.9. Server-side Image Map Module [?] - * 5.12. Target Module [?] - * 5.21. Name Identification Module [deprecated] - -These modules would be implemented as "unsafe": - - * 5.2. Core Modules - o 5.2.1. Structure Module - * 5.3. Applet Module - * 5.5. Forms Modules - o 5.5.1. Basic Forms Module - o 5.5.2. Forms Module - * 5.10. Object Module - * 5.11. Frames Module - * 5.13. Iframe Module - * 5.14. Intrinsic Events Module - * 5.15. Metainformation Module - * 5.16. Scripting Module - * 5.17. Style Sheet Module - * 5.19. Link Module - * 5.20. Base Module - -We will not be using W3C's XML Schemas or DTDs directly due to the lack -of robust tools for handling them (the main problem is that all the -current parsers are usually PHP 5 only and solely-validating, not -correcting). - -This system may be generalized and ported over for CSS. - -== General Use-Case == - -The outwards API of HTMLDefinition has been largely preserved, not -only for backwards-compatibility but also by design. Instead, -HTMLDefinition can be retrieved "raw", in which it loads a structure -that closely resembles the modules of XHTML 1.1. This structure is very -dynamic, making it easy to make cascading changes to global content -sets or remove elements in bulk. - -However, once HTML Purifier needs the actual definition, it retrieves -a finalized version of HTMLDefinition. The finalized definition involves -processing the modules into a form that it is optimized for multiple -calls. This final version is immutable and, even if editable, would -be extremely hard to change. - -So, some code taking advantage of the XHTML modularization may look -like this: - -<?php - $config = HTMLPurifier_Config::createDefault(); - $def =& $config->getHTMLDefinition(true); // reference to raw - $def->addElement('marquee', 'Block', 'Flow', 'Common'); - $purifier = new HTMLPurifier($config); - $purifier->purify($html); // now the definition is finalized -?> - -== Inclusions == - -One of the nice features of HTMLDefinition is that piggy-backing off -of global attribute and content sets is extremely easy to do. - -=== Attributes === - -HTMLModule->elements[$element]->attr stores attribute information for the -specific attributes of $element. This is quite close to the final -API that HTML Purifier interfaces with, but there's an important -extra feature: attr may also contain a array with a member index zero. - -<?php - HTMLModule->elements[$element]->attr[0] = array('AttrSet'); -?> - -Rather than map the attribute key 0 to an array (which should be -an AttrDef), it defines a number of attribute collections that should -be merged into this elements attribute array. - -Furthermore, the value of an attribute key, attribute value pair need -not be a fully fledged AttrDef object. They can also be a string, which -signifies a AttrDef that is looked up from a centralized registry -AttrTypes. This allows more concise attribute definitions that look -more like W3C's declarations, as well as offering a centralized point -for modifying the behavior of one attribute type. And, of course, the -old method of manually instantiating an AttrDef still works. - -=== Attribute Collections === - -Attribute collections are stored and processed in the AttrCollections -object, which is responsible for performing the inclusions signified -by the 0 index. These attribute collections, too, are mutable, by -using HTMLModule->attr_collections. You may add new attributes -to a collection or define an entirely new collection for your module's -use. Inclusions can also be cumulative. - -Attribute collections allow us to get rid of so called "global attributes" -(which actually aren't so global). - -=== Content Models and ChildDef === - -An implementation of the above-mentioned attributes and attribute -collections was applied to the ChildDef system. HTML Purifier uses -a proprietary system called ChildDef for performance and flexibility -reasons, but this does not line up very well with W3C's notion of -regexps for defining the allowed children of an element. - -HTMLPurifier->elements[$element]->content_model and -HTMLPurifier->elements[$element]->content_model_type store information -about the final ChildDef that will be stored in -HTMLPurifier->elements[$element]->child (we use a different variable -because the two forms are sufficiently different). - -$content_model is an abstract, string representation of the internal -state of ChildDef, while $content_model_type is a string identifier -of which ChildDef subclass to instantiate. $content_model is processed -by substituting all content set identifiers (capitalized element names) -with their contents. It is then parsed and passed into the appropriate -ChildDef class, as defined by the ContentSets->getChildDef() or the -custom fallback HTMLModule->getChildDef() for custom child definitions -not in the core. - -You'll need to use these facilities if you plan on referencing a content -set like "Inline" or "Block", and using them is recommended even if you're -not due to their conciseness. - -A few notes on $content_model: it's structure can be as complicated -as you want, but the pipe symbol (|) is reserved for defining possible -choices, due to the content sets implementation. For example, a content -model that looks like: - -"Inline -> Block -> a" - -...when the Inline content set is defined as "span | b" and the Block -content set is defined as "div | blockquote", will expand into: - -"span | b -> div | blockquote -> a" - -The custom HTMLModule->getChildDef() function will need to be able to -then feed this information to ChildDef in a usable manner. - -=== Content Sets === - -Content sets can be altered using HTMLModule->content_sets, an associative -array of content set names to content set contents. If the content set -already exists, your values are appended on to it (great for, say, -registering the font tag as an inline element), otherwise it is -created. They are substituted into content_model. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/ref-proprietary-tags.txt b/lib/htmlpurifier/docs/ref-proprietary-tags.txt deleted file mode 100644 index 5849eb04d..000000000 --- a/lib/htmlpurifier/docs/ref-proprietary-tags.txt +++ /dev/null @@ -1,26 +0,0 @@ - -Proprietary Tags - <nobr> and friends - -Here are some proprietary tags that W3C does not define but occasionally show -up in the wild. We have only included tags that would make sense in an -HTML Purifier context. - -<align>, block element that aligns (extremely rare) -<blackface>, inline that double-bolds text (extremely rare) -<comment>, hidden comment for IE and WebTV -<multicol cols=number gutter=pixels width=pixels>, multiple columns -<nobr>, no linebreaks -<spacer align=* type="vertical|horizontal|block">, whitespace in doc, - use width/height for block and size for vertical/horizontal (attributes) - (extremely rare) -<wbr>, potential word break point: allows linebreaks. Only works in <nobr> - -<listing>, monospace pre-variant (extremely rare) -<plaintext>, escapes all tags to the end of document -<xmp>, monospace, replace with pre - -These should be put into their own Tidy module, not loaded by default(?). These -all qualify as "lenient" transforms. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/ref-whatwg.txt b/lib/htmlpurifier/docs/ref-whatwg.txt deleted file mode 100644 index 4bb4984f2..000000000 --- a/lib/htmlpurifier/docs/ref-whatwg.txt +++ /dev/null @@ -1,26 +0,0 @@ - -Web Hypertext Application Technology Working Group - WHATWG - -== HTML 5 == - -URL: http://www.whatwg.org/specs/web-apps/current-work/ - -HTML 5 defines a kaboodle of new elements and attributes, as well as -some well-defined, "quirks mode" HTML parsing. Although WHATWG professes -to be targeted towards web applications, many of their semantic additions -would be quite useful in regular documents. Eventually, HTML -Purifier will need to audit their lists and figure out what changes need -to be made. This process is complicated by the fact that the WHATWG -doesn't buy into W3C's modularization of XHTML 1.1: we may need -to remodularize HTML 5 (probably done by section name). No sense in -committing ourselves till the spec stabilizes, though. - -More immediately speaking though, however, is the well-defined parsing -behavior that HTML 5 adds. While I have little interest in writing -another DirectLex parser, other parsers like ph5p -<http://jero.net/lab/ph5p/> can be adapted to DOMLex to support much more -flexible HTML parsing (a cool feature I've seen is how they resolve -<b>bold<i>both</b>italic</i>). - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/specimens/LICENSE b/lib/htmlpurifier/docs/specimens/LICENSE deleted file mode 100644 index 0bfad771e..000000000 --- a/lib/htmlpurifier/docs/specimens/LICENSE +++ /dev/null @@ -1,10 +0,0 @@ -Licensing of Specimens - -Some files in this directory have different licenses: - -windows-live-mail-desktop-beta.html - donated by laacz, public domain -img.png - LGPL, from <http://commons.wikimedia.org/wiki/Image:Pastille_chrome.png> - -All other files are by me, and are licensed under LGPL. - - vim: et sw=4 sts=4 diff --git a/lib/htmlpurifier/docs/specimens/html-align-to-css.html b/lib/htmlpurifier/docs/specimens/html-align-to-css.html deleted file mode 100644 index 0adf76aaa..000000000 --- a/lib/htmlpurifier/docs/specimens/html-align-to-css.html +++ /dev/null @@ -1,165 +0,0 @@ -<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" - "http://www.w3.org/TR/html4/loose.dtd"> -<html> -<head> -<title>HTML align attribute to CSS - HTML Purifier Specimen</title> -<style type="text/css"> -div.container {position:relative;height:110px;} -div.container.legend .test {text-align:center;line-height:100px;} -div.test {width:100px;height:100px;border:1px solid black; -position:absolute;top:10px;} -div.test.html {left:10px;} -div.test.css {left:140px;} -table {background:#F00;} -img {border:1px solid #000;} -hr {width:50px;} -div.segment {width:250px; float:left; margin-top:1em;} -</style> -</head> -<body> - -<h1>HTML align attribute to CSS</h1> - -<p>Inspect source for methodology.</p> - -<div class="container legend"> -<div class="test html"> - HTML -</div> -<div class="test css"> - CSS -</div> -</div> - -<div class="segment"> - -<h2>table.align</h2> - -<h3>left</h3> -<div class="container"> -<div class="test html"> - a<table align="left"><tr><td>O</td></tr></table>a -</div> -<div class="test css"> - a<table style="float:left;"><tr><td>O</td></tr></table>a -</div> -</div> - -<h3>center</h3> -<div class="container"> -<div class="test html"> - a<table align="center"><tr><td>O</td></tr></table>a -</div> -<div class="test css"> - a<table style="margin-left:auto; margin-right:auto;"><tr><td>O</td></tr></table>a -</div> -</div> - -<h3>right</h3> -<div class="container"> -<div class="test html"> - a<table align="right"><tr><td>O</td></tr></table>a -</div> -<div class="test css"> - a<table style="float:right;"><tr><td>O</td></tr></table>a -</div> -</div> - -</div> - -<!-- ################################################################## --> - -<div class="segment"> -<h2>img.align</h2> -<h3>left</h3> -<div class="container"> -<div class="test html"> - a<img src="img.png" align="left">a -</div> -<div class="test css"> - a<img src="img.png" style="float:left;">a -</div> -</div> - -<h3>right</h3> -<div class="container"> -<div class="test html"> - a<img src="img.png" align="right">a -</div> -<div class="test css"> - a<img src="img.png" style="float:right;">a -</div> -</div> - -<h3>bottom</h3> -<div class="container"> -<div class="test html"> - a<img src="img.png" align="bottom">a -</div> -<div class="test css"> - a<img src="img.png" style="vertical-align:baseline;">a -</div> -</div> - -<h3>middle</h3> -<div class="container"> -<div class="test html"> - a<img src="img.png" align="middle">a -</div> -<div class="test css"> - a<img src="img.png" style="vertical-align:middle;">a -</div> -</div> - -<h3>top</h3> -<div class="container"> -<div class="test html"> - a<img src="img.png" align="top">a -</div> -<div class="test css"> - a<img src="img.png" style="vertical-align:top;">a -</div> -</div> - -</div> - -<!-- ################################################################## --> - -<div class="segment"> - -<h2>hr.align</h2> - -<h3>left</h3> -<div class="container"> -<div class="test html"> - <hr align="left" /> -</div> -<div class="test css"> - <hr style="margin-right:auto; margin-left:0; text-align:left;" /> -</div> -</div> - -<h3>center</h3> -<div class="container"> -<div class="test html"> - <hr align="center" /> -</div> -<div class="test css"> - <hr style="margin-right:auto; margin-left:auto; text-align:center;" /> -</div> -</div> - -<h3>right</h3> -<div class="container"> -<div class="test html"> - <hr align="right" /> -</div> -<div class="test css"> - <hr style="margin-right:0; margin-left:auto; text-align:right;" /> -</div> -</div> - -</div> - -</body> -</html> diff --git a/lib/htmlpurifier/docs/specimens/img.png b/lib/htmlpurifier/docs/specimens/img.png Binary files differdeleted file mode 100644 index a755bcb5e..000000000 --- a/lib/htmlpurifier/docs/specimens/img.png +++ /dev/null diff --git a/lib/htmlpurifier/docs/specimens/jochem-blok-word.html b/lib/htmlpurifier/docs/specimens/jochem-blok-word.html deleted file mode 100644 index 1cc08f888..000000000 --- a/lib/htmlpurifier/docs/specimens/jochem-blok-word.html +++ /dev/null @@ -1,129 +0,0 @@ -<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> - -<head> -<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"> -<meta name=Generator content="Microsoft Word 12 (filtered medium)"> -<!--[if !mso]> -<style> -v\:* {behavior:url(#default#VML);} -o\:* {behavior:url(#default#VML);} -w\:* {behavior:url(#default#VML);} -..shape {behavior:url(#default#VML);} -</style> -<![endif]--> -<style> -<!-- - /* Font Definitions */ - @font-face - {font-family:"Cambria Math"; - panose-1:2 4 5 3 5 4 6 3 2 4;} -@font-face - {font-family:Calibri; - panose-1:2 15 5 2 2 2 4 3 2 4;} -@font-face - {font-family:Tahoma; - panose-1:2 11 6 4 3 5 4 4 2 4;} -@font-face - {font-family:Verdana; - panose-1:2 11 6 4 3 5 4 4 2 4;} - /* Style Definitions */ - p.MsoNormal, li.MsoNormal, div.MsoNormal - {margin:0cm; - margin-bottom:.0001pt; - font-size:10.0pt; - font-family:"Verdana","sans-serif";} -a:link, span.MsoHyperlink - {mso-style-priority:99; - color:blue; - text-decoration:underline;} -a:visited, span.MsoHyperlinkFollowed - {mso-style-priority:99; - color:purple; - text-decoration:underline;} -p.MsoAcetate, li.MsoAcetate, div.MsoAcetate - {mso-style-priority:99; - mso-style-link:"Balloon Text Char"; - margin:0cm; - margin-bottom:.0001pt; - font-size:8.0pt; - font-family:"Tahoma","sans-serif";} -span.EmailStyle17 - {mso-style-type:personal-compose; - font-family:"Verdana","sans-serif"; - color:windowtext;} -span.BalloonTextChar - {mso-style-name:"Balloon Text Char"; - mso-style-priority:99; - mso-style-link:"Balloon Text"; - font-family:"Tahoma","sans-serif";} -..MsoChpDefault - {mso-style-type:export-only;} -@page Section1 - {size:612.0pt 792.0pt; - margin:70.85pt 70.85pt 70.85pt 70.85pt;} -div.Section1 - {page:Section1;} ---> -</style> -<!--[if gte mso 9]><xml> - <o:shapedefaults v:ext="edit" spidmax="2050" /> -</xml><![endif]--><!--[if gte mso 9]><xml> - <o:shapelayout v:ext="edit"> - <o:idmap v:ext="edit" data="1" /> - </o:shapelayout></xml><![endif]--> -</head> - -<body lang=NL link=blue vlink=purple> - -<div class=Section1> - -<p class=MsoNormal><img width=1277 height=994 id="Picture_x0020_1" -src="cid:image001.png@01C8CBDF.5D1BAEE0"><o:p></o:p></p> - -<p class=MsoNormal><o:p> </o:p></p> - -<p class=MsoNormal><b>Name<o:p></o:p></b></p> - -<p class=MsoNormal>E-mail : <a href="mailto:mail@example.com"><span -style='color:windowtext'>mail@example.com</span></a><o:p></o:p></p> - -<p class=MsoNormal><o:p> </o:p></p> - -<p class=MsoNormal><b>Company<o:p></o:p></b></p> - -<p class=MsoNormal>Address 1<o:p></o:p></p> - -<p class=MsoNormal>Address 2<o:p></o:p></p> - -<p class=MsoNormal><o:p> </o:p></p> - -<p class=MsoNormal>Telefoon : +xx xx xxx xxx xx <span style='color:black'><o:p></o:p></span></p> - -<p class=MsoNormal><span lang=EN-US style='color:black'>Fax : +xx xx xxx xx xx<o:p></o:p></span></p> - -<p class=MsoNormal><span lang=EN-US style='color:black'>Internet : </span><span -style='color:black'><a href="http://www.example.com/"><span lang=EN-US -style='color:black'>http://www.example.com</span></a></span><span -lang=EN-US style='color:black'><o:p></o:p></span></p> - -<p class=MsoNormal><span lang=EN-US style='color:black'>Kamer van koophandel -xxxxxxxxx<o:p></o:p></span></p> - -<p class=MsoNormal><span lang=EN-US style='color:black'><o:p> </o:p></span></p> - -<p class=MsoNormal><span lang=EN-US style='font-size:7.5pt;color:black'>Op deze -e-mail is een disclaimer van toepassing, ga naar </span><span lang=EN-US -style='font-size:7.5pt'><a -href="http://www.example.com/disclaimer"><span -style='color:black'>www.example.com/disclaimer</span></a><br> -<span style='color:black'>A disclaimer is applicable to this email, please -refer to </span><a href="http://www.example.com/disclaimer"><span -style='color:black'>www.example.com/disclaimer</span></a><o:p></o:p></span></p> - -<p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p> - -</div> - -</body> - -</html> diff --git a/lib/htmlpurifier/docs/specimens/windows-live-mail-desktop-beta.html b/lib/htmlpurifier/docs/specimens/windows-live-mail-desktop-beta.html deleted file mode 100644 index 735b4bd95..000000000 --- a/lib/htmlpurifier/docs/specimens/windows-live-mail-desktop-beta.html +++ /dev/null @@ -1,74 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> -<HTML ChildAreas="4" xmlns:canvas><HEAD> -<META http-equiv=Content-Type content=text/html;charset=windows-1257> -<STYLE></STYLE> - -<META content="MSHTML 6.00.6000.16414" name=GENERATOR></HEAD> -<BODY id=MailContainerBody -style="PADDING-RIGHT: 10px; PADDING-LEFT: 10px; FONT-SIZE: 10pt; COLOR: #000000; PADDING-TOP: 15px; FONT-FAMILY: Arial" -bgColor=#ff6600 leftMargin=0 background="" topMargin=0 -name="Compose message area" acc_role="text" CanvasTabStop="false"> -<DIV -style="BORDER-TOP: #dddddd 1px solid; FONT-SIZE: 10pt; WIDTH: 100%; MARGIN-RIGHT: 10px; PADDING-TOP: 5px; BORDER-BOTTOM: #dddddd 1px solid; FONT-FAMILY: Verdana; HEIGHT: 25px; BACKGROUND-COLOR: #ffffff"><NOBR><SPAN -title="View a slideshow of the pictures in this e-mail message." -style="PADDING-RIGHT: 20px"><A style="COLOR: #0088e4" -href="http://g.msn.com/5meen_us/171?path=/photomail/{6fc0065f-ffdd-4ca6-9a4c-cc5a93dc122f}&image=47D7B182CFEFB10!127&imagehi=47D7B182CFEFB10!125&CID=323550092004883216">Play -slideshow </A></SPAN><SPAN style="COLOR: #909090"><SPAN>|</SPAN><SPAN -style="PADDING-LEFT: 20px"> Download the highest quality version of a picture by -clicking the + above it </SPAN></SPAN></NOBR></DIV> -<DIV -style="PADDING-RIGHT: 5px; PADDING-LEFT: 7px; PADDING-BOTTOM: 2px; WIDTH: 100%; PADDING-TOP: 2px"> -<OL> - <LI><IMG title="Angry smile emoticon" - style="FLOAT: none; MARGIN: 0px; POSITION: static" tabIndex=-1 - alt="Angry smile emoticon" src="cid:49F0C856199E4D688D2D740680733D74@wc" - MSNNonUserImageOrEmoticon="true">Un ka <FONT style="BACKGROUND-COLOR: #800000" - color=#cc99ff><STRONG>Tev</STRONG></FONT> iet, un ko tu dari? - <LI>Aha!</LI></OL> - -<UL> - <LI>Buletets - <LI> - <DIV align=justify><A title=http://laacz.lv/blog/ - href="http://laacz.lv/blog/">http://laacz.lv/blog/</A> un <A - title=http://google.com/ href="http://google.com/">gugle</A></DIV> - <LI>Sarakstucitis</LI></UL></DIV><SPAN><SPAN xmlns:canvas="canvas-namespace-id" -layoutEmptyTextWellFont="Tahoma"><SPAN -style="MARGIN-BOTTOM: 15px; OVERFLOW: visible; HEIGHT: 16px"></SPAN><SPAN -style="MARGIN-BOTTOM: 25px; VERTICAL-ALIGN: top; OVERFLOW: visible; MARGIN-RIGHT: 25px; HEIGHT: 234px"> -<TABLE style="DISPLAY: inline"> - <TBODY> - <TR> - - <TD> - <DIV - style="FONT-WEIGHT: bold; FONT-SIZE: 12pt; FONT-FAMILY: arial; TEXT-ALIGN: center"><A - id=HiresARef - title="Click here to view or download a high resolution version of this picture" - style="COLOR: #0088e4; TEXT-DECORATION: none" - href="http://byfiles.storage.msn.com/x1pMvt0I80jTgT6DuaCpEMbprX3nk3jNv_vjigxV_EYVSMyM_PKgEvDEUtuNhQC-F-23mTTcKyqx6eGaeK2e_wMJ0ikwpDdFntk4SY7pfJUv2g2Ck6R2S2vAA?download">+</A></DIV> - <DIV - title="Click here to view the full image using the online photo viewer." - style="DISPLAY: inline; OVERFLOW: hidden; WIDTH: 140px; HEIGHT: 140px"><A - href="http://g.msn.com/5meen_us/171?path=/photomail/{6fc0065f-ffdd-4ca6-9a4c-cc5a93dc122f}&image=47D7B182CFEFB10!127&imagehi=47D7B182CFEFB10!125&CID=323550092004883216" - border="0"><IMG - style="MARGIN-TOP: 15px; DISPLAY: inline-block; MARGIN-LEFT: 0px" - height=109 src="cid:006A71303B80404E9FB6184E55D6A446@wc" width=140 - border=0></A></DIV></TD></TR> - <TR> - <TD> - <DIV - style="FONT-SIZE: 10pt; WIDTH: 140px; FONT-FAMILY: verdana; TEXT-ALIGN: center"><EM><STRONG>This - <U>is </U></STRONG><U>tit</U>le</EM> fo<STRONG>r <FONT - face="Arial Black">t<FONT color=#800000 size=7>h<U>i</U></FONT>s - </FONT>picture</STRONG></DIV></TD></TR></TBODY></TABLE></SPAN></SPAN></SPAN> - -<DIV -style="PADDING-RIGHT: 5px; PADDING-LEFT: 7px; PADDING-BOTTOM: 2px; WIDTH: 100%; PADDING-TOP: 2px; HEIGHT: 50px"> -<DIV> </DIV></DIV> -<DIV -style="BORDER-TOP: #dddddd 1px solid; FONT-SIZE: 10pt; MARGIN-BOTTOM: 10px; WIDTH: 100%; COLOR: #909090; MARGIN-RIGHT: 10px; PADDING-TOP: 9px; FONT-FAMILY: Verdana; HEIGHT: 42px; BACKGROUND-COLOR: #ffffff"><NOBR><SPAN -title="Join Windows Live to share photos using Windows Live Photo E-mail.">Online -pictures are available for 30 days. <A style="COLOR: #0088e4" -href="http://g.msn.com/5meen_us/175">Get Windows Live Mail desktop to create -your own photo e-mails. </A></SPAN></NOBR></DIV></BODY></HTML> diff --git a/lib/htmlpurifier/docs/style.css b/lib/htmlpurifier/docs/style.css deleted file mode 100644 index bd79c8a00..000000000 --- a/lib/htmlpurifier/docs/style.css +++ /dev/null @@ -1,76 +0,0 @@ -html {font-size:1em; font-family:serif; } -body {margin-left:4em; margin-right:4em; } - -dt {font-weight:bold; } -pre {margin-left:2em; } -pre, code, tt {font-family:monospace; font-size:1em; } - -h1 {text-align:center; font-family:Garamond, serif; - font-variant:small-caps;} -h2 {border-bottom:1px solid #CCC; font-family:sans-serif; font-weight:normal; - font-size:1.3em;} -h3 {font-family:sans-serif; font-size:1.1em; font-weight:bold; } -h4 {font-family:sans-serif; font-size:0.9em; font-weight:bold; } - -/* For witty quips */ -.subtitled {margin-bottom:0em;} -.subtitle , .subsubtitle {font-size:.8em; margin-bottom:1em; - font-style:italic; margin-top:-.2em;text-align:center;} -.subsubtitle {text-align:left;margin-left:2em;} - -/* Used for special "See also" links. */ -.reference {font-style:italic;margin-left:2em;} - -/* Marks off asides, discussions on why something is the way it is */ -.aside {margin-left:2em; font-family:sans-serif; font-size:0.9em; } -blockquote .label {font-weight:bold; font-size:1em; margin:0 0 .1em; - border-bottom:1px solid #CCC;} -.emphasis {font-weight:bold; text-align:center; font-size:1.3em;} - -/* A regular table */ -.table {border-collapse:collapse; border-bottom:2px solid #888; margin-left:2em; } -.table thead th {margin:0; background:#888; color:#FFF; } -.table thead th:first-child {-moz-border-radius-topleft:1em;} -.table tbody td {border-bottom:1px solid #CCC; padding-right:0.6em;padding-left:0.6em;} - -/* A quick table*/ -table.quick tbody th {text-align:right; padding-right:1em;} - -/* Category of the file */ -#filing {font-weight:bold; font-size:smaller; } - -/* Contains, without exception, Return to index. */ -#index {font-size:smaller; } - -#home {font-size:smaller;} - -/* Contains, without exception, $Id$, for SVN version info. */ -#version {text-align:right; font-style:italic; margin:2em 0;} - -#toc ol ol {list-style-type:lower-roman;} -#toc ol {list-style-type:decimal;} -#toc {list-style-type:upper-alpha;} - -q { - behavior: url(fixquotes.htc); /* IE fix */ - quotes: '\201C' '\201D' '\2018' '\2019'; -} -q:before { - content: open-quote; -} -q:after { - content: close-quote; -} - -/* Marks off implementation details interesting only to the person writing - the class described in the spec. */ -.technical {margin-left:2em; } -.technical:before {content:"Technical note: "; font-weight:bold; color:#061; } - -/* Marks off sections that are lacking. */ -.fixme {margin-left:2em; } -.fixme:before {content:"Fix me: "; font-weight:bold; color:#C00; } - -#applicability {margin: 1em 5%; font-style:italic;} - -/* vim: et sw=4 sts=4 */ |