Text Conversion Functions |
This text describes the functions rtl_convertTextToUnicode()
and rtl_convertUnicodeToText()
, the meaning of all the
accompanying RTL_TEXTTOUNICODE_FLAGS_XXX
,
RTL_TEXTTOUNICODE_INFO_XXX
,
RTL_UNICODETOTEXT_FLAGS_XXX
and
RTL_UNICODETOTEXT_INFO_XXX
flags, and the conversion
context conventions.
Conversion Context
It is valid to pass a null pointer instead of an
rtl_TextToUnicodeContext
or rtl_UnicodeToTextContext
to the conversion functions. In that case, the functions behave as if they
received an initial context, as obtained by
rtl_createTextToUnicodeContext()
,
rtl_resetTextToUnicodeContext()
,
rtl_createUnicodeToTextContext()
, or
rtl_resetUnicodeToTextContext()
, and simply do not return any
context information (which is effectively lost). This implies that you should
always specify the FLAGS_FLUSH
flag when using a null context,
for otherwise it is not possible in general to find out whether the input
buffer has been completely converted.
Handling of Undefined Codes
An undefined code is any of the following:
- A code from the source encoding that is valid (see
“invalid code”), but not (yet) assigned
a character. Examples are
0xA5
in ISO 8859-3,0xA2A1
in EUC-CN, and0x167F
in Unicode. - A code from the source encoding that is assigned a character that
cannot be mapped to the destination encoding. Examples are
0x0100
in Unicode, which cannot be mapped to ISO 8859-1; and0xA698
in HangulTalk, which cannot be mapped to Unicode. - A code from the source encoding that is reserved for private use, and thus cannot be mapped to the destination encoding. (Even if the destination encoding also has private-use codes, a higher-level protocol would be needed to map between these private-use areas.)
In the text-to-Unicode direction, the conversion functions distinguish
between single-byte and multi-byte undefined codes (0xA5
in
ISO 8859-3 and 0x80
in GB-18030 are single-byte undefined codes,
while 0xA2A1
in EUC-CN and 0xFE39FE39
in GB-18030
are multi-byte undefined codes.)
When encountering an undefined code, the conversion functions allow any of the following behaviours (which are mutually exclusive):
FLAGS_UNDEFINED_ERROR
FLAGS_MBUNDEFINED_ERROR
- Read past the undefined code in the input buffer, set both the
INFO_UNDEFINED
orINFO_MBUNDEFINED
and theINFO_ERROR
flags, and immediately quit the conversion (ignoring anyFLAGS_FLUSH
flag). FLAGS_UNDEFINED_IGNORE
FLAGS_MBUNDEFINED_IGNORE
- Read past the undefined code in the input buffer, set the
INFO_UNDEFINED
orINFO_MBUNDEFINED
flag, and continue with the conversion. FLAGS_UNDEFINED_MAPTOPRIVATE
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
undefined code in the input buffer, set the
INFO_UNDEFINED
flag, writeU+F1xx
into the output buffer (wherexx
is the single-byte undefined code), and continue with the conversion. FLAGS_UNDEFINED_0
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
undefined code in the input buffer, set the
INFO_UNDEFINED
flag, write an (appropriately encoded) ASCIINUL
character (0x00
) into the output buffer, and continue with the conversion. FLAGS_UNDEFINED_QUESTIONMARK
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
undefined code in the input buffer, set the
INFO_UNDEFINED
flag, write an (appropriately encoded) ASCII “?
” character (0x3F
) into the output buffer, and continue with the conversion. FLAGS_UNDEFINED_UNDERLINE
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
undefined code in the input buffer, set the
INFO_UNDEFINED
flag, write an (appropriately encoded) ASCII “_
” character (0x5F
) into the output buffer, and continue with the conversion. FLAGS_UNDEFINED_DEFAULT
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
undefined code in the input buffer, set the
INFO_UNDEFINED
flag, write some output-encoding–specific character (currentlyU+FFFD
for Unicode and “?
” for all other encodings) into the output buffer, and continue with the conversion.
In the Unicode-to-text direction, the conversion functions also allow any
of the following extra flags (of which an arbitrary number can be specified).
In all cases, the usual checks for an exhausted output
buffer are made, and otherwise the INFO_UNDEFINED
flag is
set.
FLAGS_UNDEFINED_REPLACE
- Some Unicode characters that have no direct mapping to the destination
encoding are mapped to similar single characters in the destination
encoding. For example,
U+00A0
(NO-BREAK SPACE) could be mapped to0x20
(SPACE) in ASCII. Expect this to be poorly supported by the current implementation. FLAGS_UNDEFINED_REPLACESTR
- Some Unicode characters that have no direct mapping to the destination
encoding are mapped to similar strings of characters in the destination
encoding. For example,
U+00A9
(COPYRIGHT SIGN) could be mapped to the three-character string “(C)
” in ASCII. Expect this to be poorly supported by the current implementation. FLAGS_PRIVATE_MAPTO0
- Private-use characters (
U+E000
–U+F8FF
,U+F0000
–U+FFFFD
, andU+100000
–U+10FFFD
) are mapped to an (appropriately encoded) ASCIINUL
character (0x00
) in the output buffer. FLAGS_NONSPACING_IGNORE
- Certain non-spacing characters, like
U+200B
(ZERO WIDTH SPACE) andU+FEFF
(ZERO WIDTH NO-BREAK SPACE), are ignored. Expect some uncertainty in the current implementation as to which characters are affected. FLAGS_CONTROL_IGNORE
- Control characters (
U+0000
–U+001F
andU+007F
–U+009F
) are ignored. FLAGS_PRIVATE_IGNORE
- Private-use characters (
U+E000
–U+F8FF
,U+F0000
–U+FFFFD
, andU+100000
–U+10FFFD
) are ignored.
There is also a FLAGS_NOCOMPOSITE
flag, of which I am not sure
what it should be used for.
Handling of Invalid Codes
An invalid code is a string of one or more units in the input buffer that is not valid according to the input encoding:
- It is not valid because it may never appear in the input encoding
(e.g.,
0x80
in ASCII, or0xFF
in GB-18030). - It is not valid because it is only the prefix of a valid string of
units, with further units missing (e.g., the single high-surrogate
0xD800
in Unicode, with a following low-surrogate missing, or0xA1
in EUC-CN, with a second byte in the range0xA1
–0xFE
missing).
Invalid codes of the second category (that are potentially prefixes of
valid strings) are handled specially at the end of the input buffer. If the
FLAGS_FLUSH
flag is specified, they are handled like all other
invalid codes. Otherwise, the INFO_SRCBUFFERTOSMALL
flag is set
to indicate that the input buffer possibly ended in the middle of an input
character (and the prefix is either not yet read, or is stored in the
conversion context, or is partly read and partly stored in the conversion
context).
When encountering an invalid code (other than the special cases at the end of the input buffer), the conversion functions allow any of the following behaviours (which are mutually exclusive):
FLAGS_INVALID_ERROR
- Read past the invalid code in the input buffer, set both the
INFO_INVALID
and theINFO_ERROR
flags, and immediately quit the conversion (ignoring anyFLAGS_FLUSH
flag). FLAGS_INVALID_IGNORE
- Read past the invalid code in the input buffer, set the
INFO_INVALID
flag, and continue with the conversion. FLAGS_INVALID_0
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
invalid code in the input buffer, set the
INFO_INVALID
flag, write an (appropriately encoded) ASCIINUL
character (0x00
) into the output buffer, and continue with the conversion. FLAGS_INVALID_QUESTIONMARK
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
invalid code in the input buffer, set the
INFO_INVALID
flag, write an (appropriately encoded) ASCII “?
” character (0x3F
) into the output buffer, and continue with the conversion. FLAGS_INVALID_UNDERLINE
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
invalid code in the input buffer, set the
INFO_INVALID
flag, write an (appropriately encoded) ASCII “_
” character (0x5F
) into the output buffer, and continue with the conversion. FLAGS_INVALID_DEFAULT
- If there is not enough space left in the output buffer,
act accordingly. Otherwise, read past the
invalid code in the input buffer, set the
INFO_INVALID
flag, write some output-encoding–specific character (currentlyU+FFFD
for Unicode and “?
” for all other encodings) into the output buffer, and continue with the conversion.
Handling of Destination Buffer Exhaustion
If, in the course of conversion, there is not enough space left in the
output buffer (either for a normal character mapping or for a special mapping
of undefined or invalid codes), the INFO_DESTBUFFERTOSMALL
flag
is set, and the conversion is quit immediately (ignoring any
FLAGS_FLUSH
flag). It is unspecified whether the input units
that would overflow the output buffer are already read (and stored in the
conversion context) or not, but the number of processed input buffer units
returned by the conversion function will be correct in either case.
Author: Stephan Bergmann (Last modification $Date: 2004/12/08 14:22:01 $). Copyright 2001 OpenOffice.org Foundation. All Rights Reserved. |