aboutsummaryrefslogtreecommitdiffstats
path: root/lib/htmlpurifier/docs/dev-includes.txt
blob: d3382b593333033d3c67e5804f8b312f721ce630 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
INCLUDES, AUTOLOAD, BYTECODE CACHES and OPTIMIZATION

The Problem
-----------

HTML Purifier contains a number of extra components that are not used all
of the time, only if the user explicitly specifies that we should use
them.

Some of these optional components are optionally included (Filter,
Language, Lexer, Printer), while others are included all the time
(Injector, URIFilter, HTMLModule, URIScheme). We will stipulate that these
are all developer specified: it is conceivable that certain Tokens are not
used, but this is user-dependent and should not be trusted.

We should come up with a consistent way to handle these things and ensure
that we get the maximum performance when there is bytecode caches and
when there are not. Unfortunately, these two goals seem contrary to each
other.

A peripheral issue is the performance of ConfigSchema, which has been
shown take a large, constant amount of initialization time, and is
intricately linked to the issue of includes due to its pervasive use
in our plugin architecture.

Pros and Cons
-------------

We will assume that user-based extensions will be included by them.

Conditional includes:
  Pros:
    - User management is simplified; only a single directive needs to be set
    - Only necessary code is included
  Cons:
    - Doesn't play nicely with opcode caches
    - Adds complexity to standalone version
    - Optional configuration directives are not exposed without a little
      extra coaxing (not implemented yet)

Include it all:
  Pros:
    - User management is still simple
    - Plays nicely with opcode caches and standalone version
    - All configuration directives are present
  Cons:
    - Lots of (how much?) extra code is included
    - Classes that inherit from external libraries will cause compile
      errors

Build an include stub (Let's do this!):
  Pros:
    - Only necessary code is included
    - Plays nicely with opcode caches and standalone version
    - require (without once) can be used, see above
    - Could further extend as a compilation to one file
  Cons:
    - Not implemented yet
    - Requires user intervention and use of a command line script
    - Standalone script must be chained to this
    - More complex and compiled-language-like
    - Requires a whole new class of system-wide configuration directives,
      as configuration objects can be reused
    - Determining what needs to be included can be complex (see above)
    - No way of autodetecting dynamically instantiated classes
    - Might be slow

Include stubs
-------------

This solution may be "just right" for users who are heavily oriented
towards performance. However, there are a number of picky implementation
details to work out beforehand.

The number one concern is how to make the HTML Purifier files "work
out of the box", while still being able to easily get them into a form
that works with this setup. As the codebase stands right now, it would
be necessary to strip out all of the require_once calls. The only way
we could get rid of the require_once calls is to use __autoload or
use the stub for all cases (which might not be a bad idea).

    Aside
    -----
    An important thing to remember, however, is that these require_once's
    are valuable data about what classes a file needs. Unfortunately, there's
    no distinction between whether or not the file is needed all the time,
    or whether or not it is one of our "optional" files. Thus, it is
    effectively useless.

    Deprecated
    ----------
    One of the things I'd like to do is have the code search for any classes
    that are explicitly mentioned in the code. If a class isn't mentioned, I
    get to assume that it is "optional," i.e. included via introspection.
    The choice is either to use PHP's tokenizer or use regexps; regexps would
    be faster but a tokenizer would be more correct. If this ends up being
    unfeasible, adding dependency comments isn't a bad idea. (This could
    even be done automatically by search/replacing require_once, although
    we'd have to manually inspect the results for the optional requires.)

    NOTE: This ends up not being necessary, as we're going to make the user
    figure out all the extra classes they need, and only include the core
    which is predetermined.

Using the autoload framework with include stubs works nicely with
introspective classes: instead of having to have require_once inside
the function, we can let autoload do the work; we simply need to
new $class or accept the object straight from the caller. Handling filters
becomes a simple matter of ticking off configuration directives, and
if ConfigSchema spits out errors, adding the necessary includes. We could
also use the autoload framework as a fallback, in case the user forgets
to make the include, but doesn't really care about performance.

    Insight
    -------
    All of this talk is merely a natural extension of what our current
    standalone functionality does. However, instead of having our code
    perform the includes, or attempting to inline everything that possibly
    could be used, we boot the issue to the user, making them include
    everything or setup the fallback autoload handler.

Configuration Schema
--------------------

A common deficiency for all of the conditional include setups (including
the dynamically built include PHP stub) is that if one of this
conditionally included files includes a configuration directive, it
is not accessible to configdoc. A stopgap solution for this problem is
to have it piggy-back off of the data in the merge-library.php script
to figure out what extra files it needs to include, but if the file also
inherits classes that don't exist, we're in big trouble.

I think it's high time we centralized the configuration documentation.
However, the type checking has been a great boon for the library, and
I'd like to keep that. The compromise is to use some other source, and
then parse it into the ConfigSchema internal format (sans all of those
nasty documentation strings which we really don't need at runtime) and
serialize that for future use.

The next question is that of format. XML is very verbose, and the prospect
of setting defaults in it gives me willies. However, this may be necessary.
Splitting up the file into manageable chunks may alleviate this trouble,
and we may be even want to create our own format optimized for specifying
configuration. It might look like (based off the PHPT format, which is
nicely compact yet unambiguous and human-readable):

Core.HiddenElements
TYPE:    lookup
DEFAULT: array('script', 'style') // auto-converted during processing
--ALIASES--
Core.InvisibleElements, Core.StupidElements
--DESCRIPTION--
<p>
  Blah blah
</p>

The first line is the directive name, the lines after that prior to the
first --HEADER-- block are single-line values, and then after that
the multiline values are there. No value is restricted to a particular
format: DEFAULT could very well be multiline if that would be easier.
This would make it insanely easy, also, to add arbitrary extra parameters,
like:

VERSION:  3.0.0
ALLOWED:  'none', 'light', 'medium', 'heavy' // this is wrapped in array()
EXTERNAL: CSSTidy // this would be documented somewhere else with a URL

The final loss would be that you wouldn't know what file the directive
was used in; with some clever regexps it should be possible to
figure out where $config->get($ns, $d); occurs. Reflective calls to
the configuration object is mitigated by the fact that getBatch is
used, so we can simply talk about that in the namespace definition page.
This might be slow, but it would only happen when we are creating
the documentation for consumption, and is sugar.

We can put this in a schema/ directory, outside of HTML Purifier. The serialized
data gets treated like entities.ser.

The final thing that needs to be handled is user defined configurations.
They can be added at runtime using ConfigSchema::registerDirectory()
which globs the directory and grabs all of the directives to be incorporated
in. Then, the result is saved. We may want to take advantage of the
DefinitionCache framework, although it is not altogether certain what
configuration directives would be used to generate our key (meta-directives!)

    Further thoughts
    ----------------
    Our master configuration schema will only need to be updated once
    every new version, so it's easily versionable. User specified
    schema files are far more volatile, but it's far too expensive
    to check the filemtimes of all the files, so a DefinitionRev style
    mechanism works better. However, we can uniquely identify the
    schema based on the directories they loaded, so there's no need
    for a DefinitionId until we give them full programmatic control.

    These variables should be directly incorporated into ConfigSchema,
    and ConfigSchema should handle serialization. Some refactoring will be
    necessary for the DefinitionCache classes, as they are built with
    Config in mind. If the user changes something, the cache file gets
    rebuilt. If the version changes, the cache file gets rebuilt. Since
    our unit tests flush the caches before we start, and the operation is
    pretty fast, this will not negatively impact unit testing.

One last thing: certain configuration directives require that files
get added. They may even be specified dynamically. It is not a good idea
for the HTMLPurifier_Config object to be used directly for such matters.
Instead, the userland code should explicitly perform the includes. We may
put in something like:

REQUIRES: HTMLPurifier_Filter_ExtractStyleBlocks

To indicate that if that class doesn't exist, and the user is attempting
to use the directive, we should fatally error out. The stub includes the core files,
and the user includes everything else. Any reflective things like new
$class would be required to tie in with the configuration.

It would work very well with rarely used configuration options, but it
wouldn't be so good for "core" parts that can be disabled. In such cases
the core include file would need to be modified, and the only way
to properly do this is use the configuration object. Once again, our
ability to create cache keys saves the day again: we can create arbitrary
stub files for arbitrary configurations and include those. They could
even be the single file affairs. The only thing we'd need to include,
then, would be HTMLPurifier_Config! Then, the configuration object would
load the library.

    An aside...
    -----------
    One questions, however, the wisdom of letting PHP files write other PHP
    files. It seems like a recipe for disaster, or at least lots of headaches
    in highly secured setups, where PHP does not have the ability to write
    to its root. In such cases, we could use sticky bits or tell the user
    to manually generate the file.

    The other troublesome bit is actually doing the calculations necessary.
    For certain cases, it's simple (such as URIScheme), but for AttrDef
    and HTMLModule the dependency trees are very complex in relation to
    %HTML.Allowed and friends. I think that this idea should be shelved
    and looked at a later, less insane date.

An interesting dilemma presents itself when a configuration form is offered
to the user. Normally, the configuration object is not accessible without
editing PHP code; this facility changes thing. The sensible thing to do
is stipulate that all classes required by the directives you allow must
be included.

Unit testing
------------

Setting up the parsing and translation into our existing format would not
be difficult to do. It might represent a good time for us to rethink our
tests for these facilities; as creative as they are, they are often hacky
and require public visibility for things that ought to be protected.
This is especially applicable for our DefinitionCache tests.

Migration
---------

Because we are not *adding* anything essentially new, it should be trivial
to write a script to take our existing data and dump it into the new format.
Well, not trivial, but fairly easy to accomplish. Primary implementation
difficulties would probably involve formatting the file nicely.

Backwards-compatibility
-----------------------

I expect that the ConfigSchema methods should stick around for a little bit,
but display E_USER_NOTICE warnings that they are deprecated. This will
require documentation!

New stuff
---------

VERSION: Version number directive was introduced
DEPRECATED-VERSION: If the directive was deprecated, when was it deprecated?
DEPRECATED-USE: If the directive was deprecated, what should the user use now?
REQUIRES: What classes does this configuration directive require, but are
    not part of the HTML Purifier core?

    vim: et sw=4 sts=4