diff options
Diffstat (limited to 'vendor/sabre/dav/docs/rfc5051.txt')
-rw-r--r-- | vendor/sabre/dav/docs/rfc5051.txt | 395 |
1 files changed, 395 insertions, 0 deletions
diff --git a/vendor/sabre/dav/docs/rfc5051.txt b/vendor/sabre/dav/docs/rfc5051.txt new file mode 100644 index 000000000..0a4479cad --- /dev/null +++ b/vendor/sabre/dav/docs/rfc5051.txt @@ -0,0 +1,395 @@ + + + + + + +Network Working Group M. Crispin +Request for Comments: 5051 University of Washington +Category: Standards Track October 2007 + + + i;unicode-casemap - Simple Unicode Collation Algorithm + +Status of This Memo + + This document specifies an Internet standards track protocol for the + Internet community, and requests discussion and suggestions for + improvements. Please refer to the current edition of the "Internet + Official Protocol Standards" (STD 1) for the standardization state + and status of this protocol. Distribution of this memo is unlimited. + +Abstract + + This document describes "i;unicode-casemap", a simple case- + insensitive collation for Unicode strings. It provides equality, + substring, and ordering operations. + +1. Introduction + + The "i;ascii-casemap" collation described in [COMPARATOR] is quite + simple to implement and provides case-independent comparisons for the + 26 Latin alphabetics. It is specified as the default and/or baseline + comparator in some application protocols, e.g., [IMAP-SORT]. + + However, the "i;ascii-casemap" collation does not produce + satisfactory results with non-ASCII characters. It is possible, with + a modest extension, to provide a more sophisticated collation with + greater multilingual applicability than "i;ascii-casemap". This + extension provides case-independent comparisons for a much greater + number of characters. It also collates characters with diacriticals + with the non-diacritical character forms. + + This collation, "i;unicode-casemap", is intended to be an alternative + to, and preferred over, "i;ascii-casemap". It does not replace the + "i;basic" collation described in [BASIC]. + +2. Unicode Casemap Collation Description + + The "i;unicode-casemap" collation is a simple collation which is + case-insensitive in its treatment of characters. It provides + equality, substring, and ordering operations. The validity test + operation returns "valid" for any input. + + + + + +Crispin Standards Track [Page 1] + +RFC 5051 i;unicode-casemap October 2007 + + + This collation allows strings in arbitrary (and mixed) character + sets, as long as the character set for each string is identified and + it is possible to convert the string to Unicode. Strings which have + an unidentified character set and/or cannot be converted to Unicode + are not rejected, but are treated as binary. + + Each input string is prepared by converting it to a "titlecased + canonicalized UTF-8" string according to the following steps, using + UnicodeData.txt ([UNICODE-DATA]): + + (1) A Unicode codepoint is obtained from the input string. + + (a) If the input string is in a known charset that can be + converted to Unicode, a sequence in the string's charset + is read and checked for validity according to the rules of + that charset. If the sequence is valid, it is converted + to a Unicode codepoint. Note that for input strings in + UTF-8, the UTF-8 sequence must be valid according to the + rules of [UTF-8]; e.g., overlong UTF-8 sequences are + invalid. + + (b) If the input string is in an unknown charset, or an + invalid sequence occurs in step (1)(a), conversion ceases. + No further preparation is performed, and any partial + preparation results are discarded. The original string is + used unchanged with the i;octet comparator. + + (2) The following steps, using UnicodeData.txt ([UNICODE-DATA]), + are performed on the resulting codepoint from step (1)(a). + + (a) If the codepoint has a titlecase property in + UnicodeData.txt (this is normally the same as the + uppercase property), the codepoint is converted to the + codepoints in the titlecase property. + + (b) If the resulting codepoint from (2)(a) has a decomposition + property of any type in UnicodeData.txt, the codepoint is + converted to the codepoints in the decomposition property. + This step is recursively applied to each of the resulting + codepoints until no more decomposition is possible + (effectively Normalization Form KD). + + Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON) + has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D + WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a + decomposition property of U+0044 (LATIN CAPITAL LETTER D) + U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a + decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c + + + +Crispin Standards Track [Page 2] + +RFC 5051 i;unicode-casemap October 2007 + + + (COMBINING CARON). Neither U+0044, U+007A, nor U+030C have + any decomposition properties. Therefore, U+01C4 is converted + to U+0044 U+007A U+030C by this step. + + (3) The resulting codepoint(s) from step (2) is/are appended, in + UTF-8 format, to the "titlecased canonicalized UTF-8" string. + + (4) Repeat from step (1) until there is no more data in the input + string. + + Following the above preparation process on each string, the equality, + ordering, and substring operations are as for i;octet. + + It is permitted to use an alternative implementation of the above + preparation process if it produces the same results. For example, it + may be more convenient for an implementation to convert all input + strings to a sequence of UTF-16 or UTF-32 values prior to performing + any of the step (2) actions. Similarly, if all input strings are (or + are convertible to) Unicode, it may be possible to use UTF-32 as an + alternative to UTF-8 in step (3). + + Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3), + because UTF-16 surrogates will cause i;octet to collate codepoints + U+E0000 through U+FFFF after non-BMP codepoints. + + This collation is not locale sensitive. Consequently, care should be + taken when using OS-supplied functions to implement this collation. + Functions such as strcasecmp and toupper are sometimes locale + sensitive and may inconsistently casemap letters. + + The i;unicode-casemap collation is well suited to use with many + Internet protocols and computer languages. Use with natural language + is often inappropriate; even though the collation apparently supports + languages such as Swahili and English, in real-world use it tends to + mis-sort a number of types of string: + + o people and place names containing scripts that are not collated + according to "alphabetical order". + o words with characters that have diacriticals. However, + i;unicode-casemap generally does a better job than i;ascii-casemap + for most (but not all) languages. For example, German umlaut + letters will sort correctly, but some Scandinavian letters will + not. + o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike + in English), + o strings containing other non-letter symbols; e.g., euro and pound + sterling symbols, quotation marks other than '"', dashes/hyphens, + etc. + + + +Crispin Standards Track [Page 3] + +RFC 5051 i;unicode-casemap October 2007 + + +3. Unicode Casemap Collation Registration + + <?xml version='1.0'?> + <!DOCTYPE collation SYSTEM 'collationreg.dtd'> + <collation rfc="5051" scope="global" intendedUse="common"> + <identifier>i;unicode-casemap</identifier> + <title>Unicode Casemap</title> + <operations>equality order substring</operations> + <specification>RFC 5051</specification> + <owner>IETF</owner> + <submitter>mrc@cac.washington.edu</submitter> + </collation> + +4. Security Considerations + + The security considerations for [UTF-8], [STRINGPREP], and [UNICODE- + SECURITY] apply and are normative to this specification. + + The results from this comparator will vary depending upon the + implementation for several reasons. Implementations MUST consider + whether these possibilities are a problem for their use case: + + 1) New characters added in Unicode may have decomposition or + titlecase properties that will not be known to an implementation + based upon an older revision of Unicode. This impacts step (2). + + 2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that + does not require normalization of out-of-order diacriticals. + However, an implementation MAY use an NFKD library routine that + does such normalization. This impacts step (2)(b) and possibly + also step (1)(a), and is an issue only with ill-formed UTF-8 + input. + + 3) The set of charsets handled in step (1)(a) is open-ended. UTF-8 + (and, by extension, US-ASCII) are the only mandatory-to-implement + charsets. This impacts step (1)(a). + + Implementations SHOULD, as far as feasible, support all the + charsets they are likely to encounter in the input data, in order + to avoid poor collation caused by the fall through to the (1)(b) + rule. + + 4) Other charsets may have revisions which add new characters that + are not known to an implementation based upon an older revision. + This impacts step (1)(a) and possibly also step (1)(b). + + + + + + +Crispin Standards Track [Page 4] + +RFC 5051 i;unicode-casemap October 2007 + + + An attacker may create input that is ill-formed or in an unknown + charset, with the intention of impacting the results of this + comparator or exploiting other parts of the system which process this + input in different ways. Note, however, that even well-formed data + in a known charset can impact the result of this comparator in + unexpected ways. For example, an attacker can substitute U+0041 + (LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or + U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a + non-match of strings which visually appear the same and/or causing + the string to appear elsewhere in a sort. + +5. IANA Considerations + + The i;unicode-casemap collation defined in section 2 has been added + to the registry of collations defined in [COMPARATOR]. + +6. Normative References + + [COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen, + "Internet Application Protocol Collation + Registry", RFC 4790, February 2007. + + [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of + Internationalized Strings ("stringprep")", RFC + 3454, December 2002. + + [UTF-8] Yergeau, F., "UTF-8, a transformation format of + ISO 10646", STD 63, RFC 3629, November 2003. + + [UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/ + UnicodeData.txt> + + Although the UnicodeData.txt file referenced + here is part of the Unicode standard, it is + subject to change as new characters are added + to Unicode and errors are corrected in Unicode + revisions. As a result, it may be less stable + than might otherwise be implied by the + standards status of this specification. + + [UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security + Considerations", February 2006, + <http://www.unicode.org/reports/tr36/>. + + + + + + + + +Crispin Standards Track [Page 5] + +RFC 5051 i;unicode-casemap October 2007 + + +7. Informative References + + [BASIC] Newman, C., Duerst, M., and A. Gulbrandsen, + "i;basic - the Unicode Collation Algorithm", + Work in Progress, March 2007. + + [IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message + Access Protocol - SORT and THREAD Extensions", + Work in Progress, September 2007. + +Author's Address + + Mark R. Crispin + Networks and Distributed Computing + University of Washington + 4545 15th Avenue NE + Seattle, WA 98105-4527 + + Phone: +1 (206) 543-5762 + EMail: MRC@CAC.Washington.EDU + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Crispin Standards Track [Page 6] + +RFC 5051 i;unicode-casemap October 2007 + + +Full Copyright Statement + + Copyright (C) The IETF Trust (2007). + + This document is subject to the rights, licenses and restrictions + contained in BCP 78, and except as set forth therein, the authors + retain all their rights. + + This document and the information contained herein are provided on an + "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS + OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND + THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS + OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF + THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED + WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. + +Intellectual Property + + The IETF takes no position regarding the validity or scope of any + Intellectual Property Rights or other rights that might be claimed to + pertain to the implementation or use of the technology described in + this document or the extent to which any license under such rights + might or might not be available; nor does it represent that it has + made any independent effort to identify any such rights. Information + on the procedures with respect to rights in RFC documents can be + found in BCP 78 and BCP 79. + + Copies of IPR disclosures made to the IETF Secretariat and any + assurances of licenses to be made available, or the result of an + attempt made to obtain a general license or permission for the use of + such proprietary rights by implementers or users of this + specification can be obtained from the IETF on-line IPR repository at + http://www.ietf.org/ipr. + + The IETF invites any interested party to bring to its attention any + copyrights, patents or patent applications, or other proprietary + rights that may cover technology that may be required to implement + this standard. Please address the information to the IETF at + ietf-ipr@ietf.org. + + + + + + + + + + + + +Crispin Standards Track [Page 7] + |