diff options
Diffstat (limited to 'lib/htmlpurifier/docs/proposal-errors.txt')
-rw-r--r-- | lib/htmlpurifier/docs/proposal-errors.txt | 211 |
1 files changed, 0 insertions, 211 deletions
diff --git a/lib/htmlpurifier/docs/proposal-errors.txt b/lib/htmlpurifier/docs/proposal-errors.txt deleted file mode 100644 index 87cb2ac19..000000000 --- a/lib/htmlpurifier/docs/proposal-errors.txt +++ /dev/null @@ -1,211 +0,0 @@ -Considerations for ErrorCollection - -Presently, HTML Purifier takes a code-execution centric approach to handling -errors. Errors are organized and grouped according to which segment of the -code triggers them, not necessarily the portion of the input document that -triggered the error. This means that errors are pseudo-sorted by category, -rather than location in the document. - -One easy way to "fix" this problem would be to re-sort according to line number. -However, the "category" style information we derive from naively following -program execution is still useful. After all, each of the strategies which -can report errors still process the document mostly linearly. Furthermore, -not only do they process linearly, but the way they pass off operations to -sub-systems mirrors that of the document. For example, AttrValidator will -linearly proceed through elements, and on each element will use AttrDef to -validate those contents. From there, the attribute might have more -sub-components, which have execution passed off accordingly. - -In fact, each strategy handles a very specific class of "error." - -RemoveForeignElements - element tokens -MakeWellFormed - element token ordering -FixNesting - element token ordering -ValidateAttributes - attributes of elements - -The crucial point is that while we care about the hierarchy governing these -different errors, we *don't* care about any other information about what actually -happens to the elements. This brings up another point: if HTML Purifier fixes -something, this is not really a notice/warning/error; it's really a suggestion -of a way to fix the aforementioned defects. - -In short, the refactoring to take this into account kinda sucks. - -Errors should not be recorded in order that they are reported. Instead, they -should be bound to the line (and preferably element) in which they were found. -This means we need some way to uniquely identify every element in the document, -which doesn't presently exist. An easy way of adding this would be to track -line columns. An important ramification of this is that we *must* use the -DirectLex implementation. - - 1. Implement column numbers for DirectLex [DONE!] - 2. Disable error collection when not using DirectLex [DONE!] - -Next, we need to re-orient all of the error declarations to place CurrentToken -at utmost important. Since this is passed via Context, it's not always clear -if that's available. ErrorCollector should complain HARD if it isn't available. -There are some locations when we don't have a token available. These include: - - * Lexing - this can actually have a row and column, but NOT correspond to - a token - * End of document errors - bump this to the end - -Actually, we *don't* have to complain if CurrentToken isn't available; we just -set it as a document-wide error. And actually, nothing needs to be done here. - -Something interesting to consider is whether or not we care about the locations -of attributes and CSS properties, i.e. the sub-objects that compose these things. -In terms of consistency, at the very least attributes should have column/line -numbers attached to them. However, this may be overkill, as attributes are -uniquely identifiable. You could go even further, with CSS, but they are also -uniquely identifiable. - -Bottom-line is, however, this information must be available, in form of the -CurrentAttribute and CurrentCssProperty (theoretical) context variables, and -it must be used to organize the errors that the sub-processes may throw. -There is also a hierarchy of sorts that may make merging this into one context -variable more sense, if it hadn't been for HTML's reasonably rigid structure. -A CSS property will never contain an HTML attribute. So we won't ever get -recursive relations, and having multiple depths won't ever make sense. Leave -this be. - -We already have this information, and consequently, using start and end is -*unnecessary*, so long as the context variables are set appropriately. We don't -care if an error was thrown by an attribute transform or an attribute definition; -to the end user these are the same (for a developer, they are different, but -they're better off with a stack trace (which we should add support for) in such -cases). - - 3. Remove start()/end() code. Don't get rid of recursion, though [DONE] - 4. Setup ErrorCollector to use context information to setup hierarchies. - This may require a different internal format. Use objects if it gets - complex. [DONE] - - ASIDE - More on this topic: since we are now binding errors to lines - and columns, a particular error can have three relationships to that - specific location: - - 1. The token at that location directly - RemoveForeignElements - AttrValidator (transforms) - MakeWellFormed - 2. A "component" of that token (i.e. attribute) - AttrValidator (removals) - 3. A modification to that node (i.e. contents from start to end - token) as a whole - FixNesting - - This needs to be marked accordingly. In the presentation, it might - make sense keep (3) separate, have (2) a sublist of (1). (1) can - be a closing tag, in which case (3) makes no sense at all, OR it - should be related with its opening tag (this may not necessarily - be possible before MakeWellFormed is run). - - So, the line and column counts as our identifier, so: - - $errors[$line][$col] = ... - - Then, we need to identify case 1, 2 or 3. They are identified as - such: - - 1. Need some sort of semaphore in RemoveForeignElements, etc. - 2. If CurrentAttr/CurrentCssProperty is non-null - 3. Default (FixNesting, MakeWellFormed) - - One consideration about (1) is that it usually is actually a - (3) modification, but we have no way of knowing about that because - of various optimizations. However, they can probably be treated - the same. The other difficulty is that (3) is never a line and - column; rather, it is a range (i.e. a duple) and telling the user - the very start of the range may confuse them. For example, - - <b>Foo<div>bar</div></b> - ^ ^ - - The node being operated on is <b>, so the error would be assigned - to the first caret, with a "node reorganized" error. Then, the - ChildDef would have submitted its own suggestions and errors with - regard to what's going in the internals. So I suppose this is - ok. :-) - - Now, the structure of the earlier mentioned ... would be something - like this: - - object { - type = (token|attr|property), - value, // appropriate for type - errors => array(), - sub-errors = [recursive], - } - - This helps us keep things agnostic. It is also sufficiently complex - enough to warrant an object. - -So, more wanking about the object format is in order. The way HTML Purifier is -currently setup, the only possible hierarchy is: - - token -> attr -> css property - -These relations do not exist all of the time; a comment or end token would not -ever have any attributes, and non-style attributes would never have CSS properties -associated with them. - -I believe that it is worth supporting multiple paths. At some point, we might -have a hierarchy like: - - * -> syntax - -> token -> attr -> css property - -> url - -> css stylesheet <style> - -et cetera. Now, one of the practical implications of this is that every "node" -on our tree is well-defined, so in theory it should be possible to either 1. -create a separate class for each error struct, or 2. embed this information -directly into HTML Purifier's token stream. Embedding the information in the -token stream is not a terribly good idea, since tokens can be removed, etc. -So that leaves us with 1... and if we use a generic interface we can cut down -on a lot of code we might need. So let's leave it like this. - -~~~~ - -Then we setup suggestions. - - 5. Setup a separate error class which tells the user any modifications - HTML Purifier made. - -Some information about this: - -Our current paradigm is to tell the user what HTML Purifier did to the HTML. -This is the most natural mode of operation, since that's what HTML Purifier -is all about; it was not meant to be a validator. - -However, most other people have experience dealing with a validator. In cases -where HTML Purifier unambiguously does the right thing, simply giving the user -the correct version isn't a bad idea, but problems arise when: - -- The user has such bad HTML we do something odd, when we should have just - flagged the HTML as an error. Such examples are when we do things like - remove text from directly inside a <table> tag. It was probably meant to - be in a <td> tag or be outside the table, but we're not smart enough to - realize this so we just remove it. In such a case, we should tell the user - that there was foreign data in the table, but then we shouldn't "demand" - the user remove the data; it's more of a "here's a possible way of - rectifying the problem" - -- Giving line context for input is hard enough, but feasible; giving output - line context will be extremely difficult due to shifting lines; we'd probably - have to track what the tokens are and then find the appropriate out context - and it's not guaranteed to work etc etc etc. - -```````````` - -Don't forget to spruce up output. - - 6. Output needs to automatically give line and column numbers, basically - "at line" on steroids. Look at W3C's output; it's ok. [PARTIALLY DONE] - - - We need a standard CSS to apply (check demo.css for some starting - styling; some buttons would also be hip) - - vim: et sw=4 sts=4 |