Text Conversion Functions

This text describes the functions rtl_convertTextToUnicode() and rtl_convertUnicodeToText(), the meaning of all the accompanying RTL_TEXTTOUNICODE_FLAGS_XXX, RTL_TEXTTOUNICODE_INFO_XXX, RTL_UNICODETOTEXT_FLAGS_XXX and RTL_UNICODETOTEXT_INFO_XXX flags, and the conversion context conventions.

Conversion Context

It is valid to pass a null pointer instead of an rtl_TextToUnicodeContext or rtl_UnicodeToTextContext to the conversion functions. In that case, the functions behave as if they received an initial context, as obtained by rtl_createTextToUnicodeContext(), rtl_resetTextToUnicodeContext(), rtl_createUnicodeToTextContext(), or rtl_resetUnicodeToTextContext(), and simply do not return any context information (which is effectively lost). This implies that you should always specify the FLAGS_FLUSH flag when using a null context, for otherwise it is not possible in general to find out whether the input buffer has been completely converted.

Handling of Undefined Codes

An undefined code is any of the following:

A code from the source encoding that is valid (see “invalid code”), but not (yet) assigned a character. Examples are 0xA5 in ISO 8859-3, 0xA2A1 in EUC-CN, and 0x167F in Unicode.
A code from the source encoding that is assigned a character that cannot be mapped to the destination encoding. Examples are 0x0100 in Unicode, which cannot be mapped to ISO 8859-1; and 0xA698 in HangulTalk, which cannot be mapped to Unicode.
A code from the source encoding that is reserved for private use, and thus cannot be mapped to the destination encoding. (Even if the destination encoding also has private-use codes, a higher-level protocol would be needed to map between these private-use areas.)

In the text-to-Unicode direction, the conversion functions distinguish between single-byte and multi-byte undefined codes (0xA5 in ISO 8859-3 and 0x80 in GB-18030 are single-byte undefined codes, while 0xA2A1 in EUC-CN and 0xFE39FE39 in GB-18030 are multi-byte undefined codes.)

When encountering an undefined code, the conversion functions allow any of the following behaviours (which are mutually exclusive):

FLAGS_UNDEFINED_ERROR
FLAGS_MBUNDEFINED_ERROR: Read past the undefined code in the input buffer, set both the INFO_UNDEFINED or INFO_MBUNDEFINED and the INFO_ERROR flags, and immediately quit the conversion (ignoring any FLAGS_FLUSH flag).
FLAGS_UNDEFINED_IGNORE
FLAGS_MBUNDEFINED_IGNORE: Read past the undefined code in the input buffer, set the INFO_UNDEFINED or INFO_MBUNDEFINED flag, and continue with the conversion.
FLAGS_UNDEFINED_MAPTOPRIVATE: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write U+F1xx into the output buffer (where xx is the single-byte undefined code), and continue with the conversion.
FLAGS_UNDEFINED_0: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write an (appropriately encoded) ASCII NUL character (0x00) into the output buffer, and continue with the conversion.
FLAGS_UNDEFINED_QUESTIONMARK: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write an (appropriately encoded) ASCII “?” character (0x3F) into the output buffer, and continue with the conversion.
FLAGS_UNDEFINED_UNDERLINE: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write an (appropriately encoded) ASCII “_” character (0x5F) into the output buffer, and continue with the conversion.
FLAGS_UNDEFINED_DEFAULT: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the undefined code in the input buffer, set the INFO_UNDEFINED flag, write some output-encoding–specific character (currently U+FFFD for Unicode and “?” for all other encodings) into the output buffer, and continue with the conversion.

In the Unicode-to-text direction, the conversion functions also allow any of the following extra flags (of which an arbitrary number can be specified). In all cases, the usual checks for an exhausted output buffer are made, and otherwise the INFO_UNDEFINED flag is set.

FLAGS_UNDEFINED_REPLACE: Some Unicode characters that have no direct mapping to the destination encoding are mapped to similar single characters in the destination encoding. For example, U+00A0 (NO-BREAK SPACE) could be mapped to 0x20 (SPACE) in ASCII. Expect this to be poorly supported by the current implementation.
FLAGS_UNDEFINED_REPLACESTR: Some Unicode characters that have no direct mapping to the destination encoding are mapped to similar strings of characters in the destination encoding. For example, U+00A9 (COPYRIGHT SIGN) could be mapped to the three-character string “(C)” in ASCII. Expect this to be poorly supported by the current implementation.
FLAGS_PRIVATE_MAPTO0: Private-use characters (U+E000–U+F8FF, U+F0000–U+FFFFD, and U+100000–U+10FFFD) are mapped to an (appropriately encoded) ASCII NUL character (0x00) in the output buffer.
FLAGS_NONSPACING_IGNORE: Certain non-spacing characters, like U+200B (ZERO WIDTH SPACE) and U+FEFF (ZERO WIDTH NO-BREAK SPACE), are ignored. Expect some uncertainty in the current implementation as to which characters are affected.
FLAGS_CONTROL_IGNORE: Control characters (U+0000–U+001F and U+007F–U+009F) are ignored.
FLAGS_PRIVATE_IGNORE: Private-use characters (U+E000–U+F8FF, U+F0000–U+FFFFD, and U+100000–U+10FFFD) are ignored.

There is also a FLAGS_NOCOMPOSITE flag, of which I am not sure what it should be used for.

Handling of Invalid Codes

An invalid code is a string of one or more units in the input buffer that is not valid according to the input encoding:

It is not valid because it may never appear in the input encoding (e.g., 0x80 in ASCII, or 0xFF in GB-18030).
It is not valid because it is only the prefix of a valid string of units, with further units missing (e.g., the single high-surrogate 0xD800 in Unicode, with a following low-surrogate missing, or 0xA1 in EUC-CN, with a second byte in the range 0xA1–0xFE missing).

Invalid codes of the second category (that are potentially prefixes of valid strings) are handled specially at the end of the input buffer. If the FLAGS_FLUSH flag is specified, they are handled like all other invalid codes. Otherwise, the INFO_SRCBUFFERTOSMALL flag is set to indicate that the input buffer possibly ended in the middle of an input character (and the prefix is either not yet read, or is stored in the conversion context, or is partly read and partly stored in the conversion context).

When encountering an invalid code (other than the special cases at the end of the input buffer), the conversion functions allow any of the following behaviours (which are mutually exclusive):

FLAGS_INVALID_ERROR: Read past the invalid code in the input buffer, set both the INFO_INVALID and the INFO_ERROR flags, and immediately quit the conversion (ignoring any FLAGS_FLUSH flag).
FLAGS_INVALID_IGNORE: Read past the invalid code in the input buffer, set the INFO_INVALID flag, and continue with the conversion.
FLAGS_INVALID_0: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write an (appropriately encoded) ASCII NUL character (0x00) into the output buffer, and continue with the conversion.
FLAGS_INVALID_QUESTIONMARK: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write an (appropriately encoded) ASCII “?” character (0x3F) into the output buffer, and continue with the conversion.
FLAGS_INVALID_UNDERLINE: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write an (appropriately encoded) ASCII “_” character (0x5F) into the output buffer, and continue with the conversion.
FLAGS_INVALID_DEFAULT: If there is not enough space left in the output buffer, act accordingly. Otherwise, read past the invalid code in the input buffer, set the INFO_INVALID flag, write some output-encoding–specific character (currently U+FFFD for Unicode and “?” for all other encodings) into the output buffer, and continue with the conversion.

Handling of Destination Buffer Exhaustion

If, in the course of conversion, there is not enough space left in the output buffer (either for a normal character mapping or for a special mapping of undefined or invalid codes), the INFO_DESTBUFFERTOSMALL flag is set, and the conversion is quit immediately (ignoring any FLAGS_FLUSH flag). It is unspecified whether the input units that would overflow the output buffer are already read (and stored in the conversion context) or not, but the number of processed input buffer units returned by the conversion function will be correct in either case.