From 7a40f4354b32809af3d0cfd6e3af0eda02ab0e0a Mon Sep 17 00:00:00 2001 From: friendica Date: Sat, 12 May 2012 17:57:41 -0700 Subject: some important stuff we'll need --- lib/htmlpurifier/docs/enduser-customize.html | 850 +++++++++++++++++++++++++++ 1 file changed, 850 insertions(+) create mode 100644 lib/htmlpurifier/docs/enduser-customize.html (limited to 'lib/htmlpurifier/docs/enduser-customize.html') diff --git a/lib/htmlpurifier/docs/enduser-customize.html b/lib/htmlpurifier/docs/enduser-customize.html new file mode 100644 index 000000000..7e1ffa260 --- /dev/null +++ b/lib/htmlpurifier/docs/enduser-customize.html @@ -0,0 +1,850 @@ + + + + + + + +Customize - HTML Purifier + + + +

Customize!

+
HTML Purifier is a Swiss-Army Knife
+ +
Filed under End-User
+
Return to the index.
+
HTML Purifier End-User Documentation
+ +

+ HTML Purifier has this quirk where if you try to allow certain elements or + attributes, HTML Purifier will tell you that it's not supported, and that + you should go to the forums to find out how to implement it. Well, this + document is how to implement elements and attributes which HTML Purifier + doesn't support out of the box. +

+ +

Is it necessary?

+ +

+ Before we even write any code, it is paramount to consider whether or + not the code we're writing is necessary or not. HTML Purifier, by default, + contains a large set of elements and attributes: large enough so that + any element or attribute in XHTML 1.0 or 1.1 (and its HTML variants) + that can be safely used by the general public is implemented. +

+ +

+ So what needs to be implemented? (Feel free to skip this section if + you know what you want). +

+ +

XHTML 1.0

+ +

+ All of the modules listed below are based off of the + modularization of + XHTML, which, while technically for XHTML 1.1, is quite a useful + resource. +

+ + + +

+ If you don't recognize it, you probably don't need it. But the curious + can look all of these modules up in the above-mentioned document. Note + that inline scripting comes packaged with HTML Purifier (more on this + later). +

+ +

XHTML 1.1

+ +

+ As of HTMLPurifier 2.1.0, we have implemented the + Ruby module, + which defines a set of tags + for publishing short annotations for text, used mostly in Japanese + and Chinese school texts, but applicable for positioning any text (not + limited to translations) above or below other corresponding text. +

+ +

HTML 5

+ +

+ HTML 5 + is a fork of HTML 4.01 by WHATWG, who believed that XHTML 2.0 was headed + in the wrong direction. It too is a working draft, and may change + drastically before publication, but it should be noted that the + canvas tag has been implemented by many browser vendors. +

+ +

Proprietary

+ +

+ There are a number of proprietary tags still in the wild. Many of them + have been documented in ref-proprietary-tags.txt, + but there is currently no implementation for any of them. +

+ +

Extensions

+ +

+ There are also a number of other XML languages out there that can + be embedded in HTML documents: two of the most popular are MathML and + SVG, and I frequently get requests to implement these. But they are + expansive, comprehensive specifications, and it would take far too long + to implement them correctly (most systems I've seen go as far + as whitelisting tags and no further; come on, what about nesting!) +

+ +

+ Word of warning: HTML Purifier is currently not namespace + aware. +

+ +

Giving back

+ +

+ As you may imagine from the details above (don't be abashed if you didn't + read it all: a glance over would have done), there's quite a bit that + HTML Purifier doesn't implement. Recent architectural changes have + allowed HTML Purifier to implement elements and attributes that are not + safe! Don't worry, they won't be activated unless you set %HTML.Trusted + to true, but they certainly help out users who need to put, say, forms + on their page and don't want to go through the trouble of reading this + and implementing it themself. +

+ +

+ So any of the above that you implement for your own application could + help out some other poor sap on the other side of the globe. Help us + out, and send back code so that it can be hammered into a module and + released with the core. Any code would be greatly appreciated! +

+ +

And now...

+ +

+ Enough philosophical talk, time for some code: +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+if ($def = $config->maybeGetRawHTMLDefinition()) {
+    // our code will go here
+}
+ +

+ Assuming that HTML Purifier has already been properly loaded (hint: + include HTMLPurifier.auto.php), this code will set up + the environment that you need to start customizing the HTML definition. + What's going on? +

+ + + +

Turn off caching

+ +

+ To make development easier, we're going to temporarily turn off + definition caching: +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+$config->set('Cache.DefinitionImpl', null); // TODO: remove this later!
+$def = $config->getHTMLDefinition(true);
+ +

+ A few things should be mentioned about the caching mechanism before + we move on. For performance reasons, HTML Purifier caches generated + HTMLPurifier_Definition objects in serialized files + stored (by default) in library/HTMLPurifier/DefinitionCache/Serializer. + A lot of processing is done in order to create these objects, so it + makes little sense to repeat the same processing over and over again + whenever HTML Purifier is called. +

+ +

+ In order to identify a cache entry, HTML Purifier uses three variables: + the library's version number, the value of %HTML.DefinitionRev and + a serial of relevant configuration. Whenever any of these changes, + a new HTML definition is generated. Notice that there is no way + for the definition object to track changes to customizations: here, it + is up to you to supply appropriate information to DefinitionID and + DefinitionRev. +

+ +

Add an attribute

+ +

+ For this example, we're going to implement the target attribute found + on a elements. To implement an attribute, we have to + ask a few questions: +

+ +
    +
  1. What element is it found on?
  2. +
  3. What is its name?
  4. +
  5. Is it required or optional?
  6. +
  7. What are valid values for it?
  8. +
+ +

+ The first three are easy: the element is a, the attribute + is target, and it is not a required attribute. (If it + was required, we'd need to append an asterisk to the attribute name, + you'll see an example of this in the addElement() example). +

+ +

+ The last question is a little trickier. + Lets allow the special values: _blank, _self, _target and _top. + The form of this is called an enumeration, a list of + valid values, although only one can be used at a time. To translate + this into code form, we write: +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+$config->set('Cache.DefinitionImpl', null); // remove this later!
+$def = $config->getHTMLDefinition(true);
+$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
+ +

+ The Enum#_blank,_self,_target,_top does all the magic. + The string is split into two parts, separated by a hash mark (#): +

+ +
    +
  1. The first part is the name of what we call an AttrDef
  2. +
  3. The second part is the parameter of the above-mentioned AttrDef
  4. +
+ +

+ If that sounds vague and generic, it's because it is! HTML Purifier defines + an assortment of different attribute types one can use, and each of these + has their own specialized parameter format. Here are some of the more useful + ones: +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeFormatDescription
Enum[s:]value1,value2,... + Attribute with a number of valid values, one of which may be used. When + s: is present, the enumeration is case sensitive. +
Boolattribute_name + Boolean attribute, with only one valid value: the name + of the attribute. +
CDATA + Attribute of arbitrary text. Can also be referred to as Text + (the specification makes a semantic distinction between the two). +
ID + Attribute that specifies a unique ID +
Pixels + Attribute that specifies an integer pixel length +
Length + Attribute that specifies a pixel or percentage length +
NMTOKENS + Attribute that specifies a number of name tokens, example: the + class attribute +
URI + Attribute that specifies a URI, example: the href + attribute +
Number + Attribute that specifies an positive integer number +
+ +

+ For a complete list, consult + library/HTMLPurifier/AttrTypes.php; + more information on attributes that accept parameters can be found on their + respective includes in + library/HTMLPurifier/AttrDef. +

+ +

+ Sometimes, the restrictive list in AttrTypes just doesn't cut it. Don't + sweat: you can also use a fully instantiated object as the value. The + equivalent, verbose form of the above example is: +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+$config->set('Cache.DefinitionImpl', null); // remove this later!
+$def = $config->getHTMLDefinition(true);
+$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum(
+  array('_blank','_self','_target','_top')
+));
+ +

+ Trust me, you'll learn to love the shorthand. +

+ +

Add an element

+ +

+ Adding attributes is really small-fry stuff, though, and it was possible + to add them (albeit a bit more wordy) prior to 2.0. The real gem of + the Advanced API is adding elements. There are five questions to + ask when adding a new element: +

+ +
    +
  1. What is the element's name?
  2. +
  3. What content set does this element belong to?
  4. +
  5. What are the allowed children of this element?
  6. +
  7. What attributes does the element allow that are general?
  8. +
  9. What attributes does the element allow that are specific to this element?
  10. +
+ +

+ It's a mouthful, and you'll be slightly lost if your not familiar with + the HTML specification, so let's explain them step by step. +

+ +

Content set

+ +

+ The HTML specification defines two major content sets: Inline + and Block. Each of these + content sets contain a list of elements: Inline contains things like + span and b while Block contains things like + div and blockquote. +

+ +

+ These content sets amount to a macro mechanism for HTML definition. Most + elements in HTML are organized into one of these two sets, and most + elements in HTML allow elements from one of these sets. If we had + to write each element verbatim into each other element's allowed + children, we would have ridiculously large lists; instead we use + content sets to compactify the declaration. +

+ +

+ Practically speaking, there are several useful values you can use here: +

+ + + + + + + + + + + + + + + + + + + + + + +
Content setDescription
InlineCharacter level elements, text
BlockBlock-like elements, like paragraphs and lists
false + Any element that doesn't fit into the mold, for example li + or tr +
+ +

+ By specifying a valid value here, all other elements that use that + content set will also allow your element, without you having to do + anything. If you specify false, you'll have to register + your element manually. +

+ +

Allowed children

+ +

+ Allowed children defines the elements that this element can contain. + The allowed values may range from none to a complex regexp depending on + your element. +

+ +

+ If you've ever taken a look at the HTML DTD's before, you may have + noticed declarations like this: +

+ +
<!ELEMENT LI - O (%flow;)*             -- list item -->
+ +

+ The (%flow;)* indicates the allowed children of the + li tag: li allows any number of flow + elements as its children. (The - O allows the closing tag to be + omitted, though in XML this is not allowed.) In HTML Purifier, + we'd write it like Flow (here's where the content sets + we were discussing earlier come into play). There are three shorthand + content models you can specify: +

+ + + + + + + + + + + + + + + + + + + + + + +
Content modelDescription
EmptyNo children allowed, like br or hr
InlineAny number of inline elements and text, like span
FlowAny number of inline elements, block elements and text, like div
+ +

+ This covers 90% of all the cases out there, but what about elements that + break the mold like ul? This guy requires at least one + child, and the only valid children for it are li. The + content model is: Required: li. There are two parts: the + first type determines what ChildDef will be used to validate + content models. The most common values are: +

+ + + + + + + + + + + + + + + + + + + + + + +
TypeDescription
RequiredChildren must be one or more of the valid elements
OptionalChildren can be any number of the valid elements
CustomChildren must follow the DTD-style regex
+ +

+ You can also implement your own ChildDef: this was done + for a few special cases in HTML Purifier such as Chameleon + (for ins and del), StrictBlockquote + and Table. +

+ +

+ The second part specifies either valid elements or a regular expression. + Valid elements are separated with horizontal bars (|), i.e. + "a | b | c". Use #PCDATA to represent plain text. + Regular expressions are based off of DTD's style: +

+ + + +

+ For example, "a, b?, (c | d), e+, f*" means "In this order, + one a element, at most one b element, + one c or d element (but not both), one or more + e elements, and any number of f elements." + Regex veterans should be able to jump right in, and those not so savvy + can always copy-paste W3C's content model definitions into HTML Purifier + and hope for the best. +

+ +

+ A word of warning: while the regex format is extremely flexible on + the developer's side, it is + quite unforgiving on the user's side. If the user input does not exactly + match the specification, the entire contents of the element will + be nuked. This is why there is are specific content model types like + Optional and Required: while they could be implemented as Custom: + (valid | elements)*, the custom classes contain special recovery + measures that make sure as much of the user's original content gets + through. HTML Purifier's core, as a rule, does not use Custom. +

+ +

+ One final note: you can also use Content Sets inside your valid elements + lists or regular expressions. In fact, the three shorthand content models + mentioned above are just that: abbreviations: +

+ + + + + + + + + + + + + + + + + + +
Content modelImplementation
InlineOptional: Inline | #PCDATA
FlowOptional: Flow | #PCDATA
+ +

+ When the definition is compiled, Inline will be replaced with a + horizontal-bar separated list of inline elements. Also, notice that + it does not contain text: you have to specify that yourself. +

+ +

Common attributes

+ +

+ Congratulations: you have just gotten over the proverbial hump (Allowed + children). Common attributes is much simpler, and boils down to + one question: does your element have the id, style, + class, title and lang attributes? + If so, you'll want to specify the Common attribute collection, + which contains these five attributes that are found on almost every + HTML element in the specification. +

+ +

+ There are a few more collections, but they're really edge cases: +

+ + + + + + + + + + + + + + + + + + +
CollectionAttributes
I18Nlang, possibly xml:lang
Corestyle, class, id and title
+ +

+ Common is a combination of the above-mentioned collections. +

+ +

+ Readers familiar with the modularization may have noticed that the Core + attribute collection differs from that specified by the abstract + modules of the XHTML Modularization 1.1. We believe this section + to be in error, as br permits the use of the style + attribute even though it uses the Core collection, and + the DTD and XML Schemas supplied by W3C support our interpretation. +

+ +

Attributes

+ +

+ If you didn't read the earlier section on + adding attributes, read it now. The last parameter is simply + an array of attribute names to attribute implementations, in the exact + same format as addAttribute(). +

+ +

Putting it all together

+ +

+ We're going to implement form. Before we embark, lets + grab a reference implementation from over at the + transitional DTD: +

+ +
<!ELEMENT FORM - - (%flow;)* -(FORM)   -- interactive form -->
+<!ATTLIST FORM
+  %attrs;                              -- %coreattrs, %i18n, %events --
+  action      %URI;          #REQUIRED -- server-side form handler --
+  method      (GET|POST)     GET       -- HTTP method used to submit the form--
+  enctype     %ContentType;  "application/x-www-form-urlencoded"
+  accept      %ContentTypes; #IMPLIED  -- list of MIME types for file upload --
+  name        CDATA          #IMPLIED  -- name of form for scripting --
+  onsubmit    %Script;       #IMPLIED  -- the form was submitted --
+  onreset     %Script;       #IMPLIED  -- the form was reset --
+  target      %FrameTarget;  #IMPLIED  -- render in this frame --
+  accept-charset %Charsets;  #IMPLIED  -- list of supported charsets --
+  >
+ +

+ Juicy! With just this, we can answer four of our five questions: +

+ +
    +
  1. What is the element's name? form
  2. +
  3. What content set does this element belong to? Block + (this needs a little sleuthing, I find the easiest way is to search + the DTD for FORM and determine which set it is in.)
  4. +
  5. What are the allowed children of this element? One + or more flow elements, but no nested forms
  6. +
  7. What attributes does the element allow that are general? Common
  8. +
  9. What attributes does the element allow that are specific to this element? A whole bunch, see ATTLIST; + we're going to do the vital ones: action, method and name
  10. +
+ +

+ Time for some code: +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+$config->set('Cache.DefinitionImpl', null); // remove this later!
+$def = $config->getHTMLDefinition(true);
+$def->addAttribute('a', 'target', new HTMLPurifier_AttrDef_Enum(
+  array('_blank','_self','_target','_top')
+));
+$form = $def->addElement(
+  'form',   // name
+  'Block',  // content set
+  'Flow', // allowed children
+  'Common', // attribute collection
+  array( // attributes
+    'action*' => 'URI',
+    'method' => 'Enum#get|post',
+    'name' => 'ID'
+  )
+);
+$form->excludes = array('form' => true);
+ +

+ Each of the parameters corresponds to one of the questions we asked. + Notice that we added an asterisk to the end of the action + attribute to indicate that it is required. If someone specifies a + form without that attribute, the tag will be axed. + Also, the extra line at the end is a special extra declaration that + prevents forms from being nested within each other. +

+ +

+ And that's all there is to it! Implementing the rest of the form + module is left as an exercise to the user; to see more examples + check the library/HTMLPurifier/HTMLModule/ directory + in your local HTML Purifier installation. +

+ +

And beyond...

+ +

+ Perceptive users may have realized that, to a certain extent, we + have simply re-implemented the facilities of XML Schema or the + Document Type Definition. What you are seeing here, however, is + not just an XML Schema or Document Type Definition: it is a fully + expressive method of specifying the definition of HTML that is + a portable superset of the capabilities of the two above-mentioned schema + languages. What makes HTMLDefinition so powerful is the fact that + if we don't have an implementation for a content model or an attribute + definition, you can supply it yourself by writing a PHP class. +

+ +

+ There are many facets of HTMLDefinition beyond the Advanced API I have + walked you through today. To find out more about these, you can + check out these source files: +

+ + + +

Notes for HTML Purifier 4.2.0 and earlier

+ +

+ Previously, this tutorial gave some incorrect template code for + editing raw definitions, and that template code will now produce the + error Due to a documentation error in previous version of HTML + Purifier... Here is how to mechanically transform old-style + code into new-style code. +

+ +

+ First, identify all code that edits the raw definition object, and + put it together. Ensure none of this code must be run on every + request; if some sub-part needs to always be run, move it outside + this block. Here is an example below, with the raw definition + object code bolded. +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+$def = $config->getHTMLDefinition(true);
+$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
+$purifier = new HTMLPurifier($config);
+ +

+ Next, replace the raw definition retrieval with a + maybeGetRawHTMLDefinition method call inside an if conditional, and + place the editing code inside that if block. +

+ +
$config = HTMLPurifier_Config::createDefault();
+$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
+$config->set('HTML.DefinitionRev', 1);
+if ($def = $config->maybeGetRawHTMLDefinition()) {
+    $def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
+}
+$purifier = new HTMLPurifier($config);
+ +

+ And you're done! Alternatively, if you're OK with not ever caching + your code, the following will still work and not emit warnings. +

+ +
$config = HTMLPurifier_Config::createDefault();
+$def = $config->getHTMLDefinition(true);
+$def->addAttribute('a', 'target', 'Enum#_blank,_self,_target,_top');
+$purifier = new HTMLPurifier($config);
+ +

+ A slightly less efficient version of this was what was going on with + old versions of HTML Purifier. +

+ +

+ Technical notes: ajh pointed out on in a forum topic that + HTML Purifier appeared to be repeatedly writing to the cache even + when a cache entry already existed. Investigation lead to the + discovery of the following infelicity: caching of customized + definitions didn't actually work! The problem was that even though + a cache file would be written out at the end of the process, there + was no way for HTML Purifier to say, Actually, I've already got a + copy of your work, no need to reconfigure your + customizations. This required the API to change: placing + all of the customizations to the raw definition object in a + conditional which could be skipped. +

+ + + + -- cgit v1.2.3