Wide_Wide_Text_IO - GNAT Reference Manual

Next: Stream_IO, Previous: Wide_Text_IO, Up: The Implementation of Standard I/O

8.7 Wide_Wide_Text_IO

Wide_Wide_Text_IO is similar in most respects to Text_IO, except that both input and output files may contain special sequences that represent wide wide character values. The encoding scheme for a given file may be specified using a FORM parameter:

     WCEM=x

as part of the FORM string (WCEM = wide character encoding method), where x is one of the following characters

`h': Hex ESC encoding
`u': Upper half encoding
`s': Shift-JIS encoding
`e': EUC Encoding
`8': UTF-8 encoding
`b': Brackets encoding

The encoding methods match those that can be used in a source program, but there is no requirement that the encoding method used for the source program be the same as the encoding method used for files, and different files may use different encoding methods.

The default encoding method for the standard files, and for opened files for which no WCEM parameter is given in the FORM string matches the wide character encoding specified for the main program (the default being brackets encoding if no coding method was specified with -gnatW).

UTF-8 Coding

A wide character is represented using UCS Transformation Format 8 (UTF-8) as defined in Annex R of ISO 10646-1/Am.2. Depending on the character value, the representation is a one, two, three, or four byte sequence:

          16#000000#-16#00007f#: 2#0xxxxxxx#
          16#000080#-16#0007ff#: 2#110xxxxx# 2#10xxxxxx#
          16#000800#-16#00ffff#: 2#1110xxxx# 2#10xxxxxx# 2#10xxxxxx#
          16#010000#-16#10ffff#: 2#11110xxx# 2#10xxxxxx# 2#10xxxxxx# 2#10xxxxxx#

where the xxx bits correspond to the left-padded bits of the 21-bit character value. Note that all lower half ASCII characters are represented as ASCII bytes and all upper half characters and other wide characters are represented as sequences of upper-half characters.

Brackets Coding

In this encoding, a wide wide character is represented by the following eight character sequence if is in wide character range

          [ " a b c d " ]

and by the following ten character sequence if not

          [ " a b c d e f " ]

where a, b, c, d, e, and f are the four or six hexadecimal characters (using uppercase letters) of the wide wide character code. For example, ["01A345"] is used to represent the wide wide character with code 16#01A345#.

This scheme is compatible with use of the full Wide_Wide_Character set. On input, brackets coding can also be used for upper half characters, e.g. ["C1"] for lower case a. However, on output, brackets notation is only used for wide characters with a code greater than 16#FF#.

If is also possible to use the other Wide_Character encoding methods, such as Shift-JIS, but the other schemes cannot support the full range of wide wide characters. An attempt to output a character that cannot be represented using the encoding scheme for the file causes Constraint_Error to be raised. An invalid wide character sequence on input also causes Constraint_Error to be raised.