Browse Source

Improve Unicode documentation, fix typos.

git-svn-id: file:///fltk/svn/fltk/branches/branch-1.3@11548 ea41ed52-d2ee-0310-a9c1-e6b18d33e121
pull/49/head
Albrecht Schlosser 9 years ago
parent
commit
1bc1f910e0
  1. 43
      documentation/src/unicode.dox

43
documentation/src/unicode.dox

@ -12,8 +12,9 @@ the current state of Unicode support.
\section unicode_about About Unicode, ISO 10646 and UTF-8 \section unicode_about About Unicode, ISO 10646 and UTF-8
The summary of Unicode, ISO 10646 and UTF-8 given below is The summary of Unicode, ISO 10646 and UTF-8 given below is
deliberately brief, and provides just enough information for deliberately brief and provides just enough information for
the rest of this chapter. the rest of this chapter.
For further information, please see: For further information, please see:
- http://www.unicode.org - http://www.unicode.org
- http://www.iso.org - http://www.iso.org
@ -21,11 +22,12 @@ For further information, please see:
- http://www.cl.cam.ac.uk/~mgk25/unicode.html - http://www.cl.cam.ac.uk/~mgk25/unicode.html
- http://www.apps.ietf.org/rfc/rfc3629.html - http://www.apps.ietf.org/rfc/rfc3629.html
\par The Unicode Standard \par The Unicode Standard
The Unicode Standard was originally developed by a consortium of mainly The Unicode Standard was originally developed by a consortium of mainly
US computer manufacturers and developers of multi-lingual software. US computer manufacturers and developers of multi-lingual software.
It has now become a defacto standard for character encoding, It has now become a defacto standard for character encoding
and is supported by most of the major computing companies in the world. and is supported by most of the major computing companies in the world.
Before Unicode, many different systems, on different platforms, Before Unicode, many different systems, on different platforms,
@ -40,7 +42,8 @@ and typographic publishing systems, such as algorithms for sorting and
comparing text, composite character and text rendering, right-to-left comparing text, composite character and text rendering, right-to-left
and bi-directional text handling. and bi-directional text handling.
<i>There are currently no plans to add this extra functionality to FLTK.</i> \note There are currently no plans to add this extra functionality to FLTK.
\par ISO 10646 \par ISO 10646
@ -57,8 +60,8 @@ which contains the characters required for almost all known languages.
The standard also defines three different implementation levels specifying The standard also defines three different implementation levels specifying
how these characters can be combined. how these characters can be combined.
<i>There are currently no plans for handling the different implementation \note There are currently no plans for handling the different implementation
levels or the combining characters in FLTK.</i> levels or the combining characters in FLTK.
In UCS, characters have a unique numerical code and an official name, In UCS, characters have a unique numerical code and an official name,
and are usually shown using 'U+' and the code in hexadecimal, and are usually shown using 'U+' and the code in hexadecimal,
@ -67,15 +70,15 @@ The UCS characters U+0000 to U+007F correspond to US-ASCII,
and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1). and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1).
ISO 10646 was originally designed to handle a 31-bit character set ISO 10646 was originally designed to handle a 31-bit character set
from U+00000000 to U+7FFFFFFF, but the current idea is that 21-bits from U+00000000 to U+7FFFFFFF, but the current idea is that 21 bits
will be sufficient for all future needs, giving characters up to will be sufficient for all future needs, giving characters up to
U+10FFFF. The complete character set is sub-divided into \e planes. U+10FFFF. The complete character set is sub-divided into \e planes.
<i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b> <i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b>
(BMP), ranges from U+0000 to U+FFFD and consists of the most commonly (BMP), ranges from U+0000 to U+FFFD and consists of the most commonly
used characters from previous encoding standards. Other planes used characters from previous encoding standards. Other planes
contain characters for specialist applications. contain characters for specialist applications.
\todo
Do we need this info about planes? \todo Do we need this info about planes?
The UCS also defines various methods of encoding characters as The UCS also defines various methods of encoding characters as
a sequence of bytes. a sequence of bytes.
@ -87,7 +90,7 @@ but this is even more wasteful for ASCII or Latin1.
\par UTF-8 \par UTF-8
The Unicode standard defines various UCS Transformation Formats. The Unicode standard defines various UCS Transformation Formats (UTF).
UTF-16 and UTF-32 are based on units of two and four bytes. UTF-16 and UTF-32 are based on units of two and four bytes.
UCS characters requiring more than 16 bits are encoded using UCS characters requiring more than 16 bits are encoded using
"surrogate pairs" in UTF-16. "surrogate pairs" in UTF-16.
@ -100,9 +103,11 @@ making the transformation to Unicode quick and easy.
All UCS characters above U+007F are encoded as a sequence of All UCS characters above U+007F are encoded as a sequence of
several bytes. The top bits of the first byte are set to show several bytes. The top bits of the first byte are set to show
the length of the byte sequence, and subseqent bytes are the length of the byte sequence, and subseqent bytes are
always in the range 0x80 to 0x8F. This combination provides always in the range 0x80 to 0xBF. This combination provides
some level of synchronisation and error detection. some level of synchronisation and error detection.
\par
<table summary="Unicode character byte sequences" align="center"> <table summary="Unicode character byte sequences" align="center">
<tr> <tr>
<td>Unicode range</td> <td>Unicode range</td>
@ -134,6 +139,8 @@ some level of synchronisation and error detection.
</tr> </tr>
</table> </table>
\par
Moving from ASCII encoding to Unicode will allow all new FLTK Moving from ASCII encoding to Unicode will allow all new FLTK
applications to be easily internationalized and used all over applications to be easily internationalized and used all over
the world. By choosing UTF-8 encoding, FLTK remains largely the world. By choosing UTF-8 encoding, FLTK remains largely
@ -176,12 +183,12 @@ the following limitations:
- FLTK will only handle single characters, so composed characters - FLTK will only handle single characters, so composed characters
consisting of a base character and floating accent characters consisting of a base character and floating accent characters
will be treated as multiple characters; will be treated as multiple characters.
- FLTK will only compare or sort strings on a byte by byte basis - FLTK will only compare or sort strings on a byte by byte basis
and not on a general Unicode character basis; and not on a general Unicode character basis.
- FLTK will not handle right-to-left or bi-directional text; - FLTK will not handle right-to-left or bi-directional text.
\todo \todo
Verify 16/24 bit Unicode limit for different character sets? Verify 16/24 bit Unicode limit for different character sets?
@ -189,7 +196,7 @@ the following limitations:
appears to handle a wider set. What about illegal characters? appears to handle a wider set. What about illegal characters?
See comments in %fl_utf8fromwc() and %fl_utf8toUtf16(). See comments in %fl_utf8fromwc() and %fl_utf8toUtf16().
\section unicode_illegals Illegal Unicode and UTF-8 sequences \section unicode_illegals Illegal Unicode and UTF-8 Sequences
Three pre-processor variables are defined in the source code [1] that Three pre-processor variables are defined in the source code [1] that
determine how %fl_utf8decode() handles illegal UTF-8 sequences: determine how %fl_utf8decode() handles illegal UTF-8 sequences:
@ -240,7 +247,7 @@ of the sequence. Trailing bytes in a UTF-8 sequence will return -1.
Please see the individual function description for further details Please see the individual function description for further details
about error handling and return values. about error handling and return values.
\section unicode_fltk_calls FLTK Unicode and UTF-8 functions \section unicode_fltk_calls FLTK Unicode and UTF-8 Functions
This section currently provides a brief overview of the functions. This section currently provides a brief overview of the functions.
For more details, consult the main text for each function via its link. For more details, consult the main text for each function via its link.
@ -348,8 +355,8 @@ or ISO-8859-1 characters below 0xFF are replaced with '?'.
\par \par
Both functions return the number of bytes that would be written, not Both functions return the number of bytes that would be written, not
counting the null terminator. counting the null terminator.
\p destlen provides a means of limiting the number of bytes written, \p dstlen provides a means of limiting the number of bytes written,
so setting \p destlen to zero is a means of measuring how much storage so setting \p dstlen to zero is a means of measuring how much storage
would be needed before doing the real conversion. would be needed before doing the real conversion.
@ -455,7 +462,7 @@ converts the strings to lower case Unicode as part of the comparison.
\p %flt_utf_strncasecmp() only compares the first \p n characters [bytes?] \p %flt_utf_strncasecmp() only compares the first \p n characters [bytes?]
\section unicode_system_calls FLTK Unicode versions of system calls \section unicode_system_calls FLTK Unicode Versions of System Calls
- int fl_access(const char* f, int mode) - int fl_access(const char* f, int mode)
\b OksiD \b OksiD

Loading…
Cancel
Save