Wide_Text_IO - GNAT Reference Manual

Next: Wide_Wide_Text_IO, Previous: Text_IO, Up: The Implementation of Standard I/O

8.6 Wide_Text_IO

Wide_Text_IO is similar in most respects to Text_IO, except that both input and output files may contain special sequences that represent wide character values. The encoding scheme for a given file may be specified using a FORM parameter:

     WCEM=x

as part of the FORM string (WCEM = wide character encoding method), where x is one of the following characters

`h': Hex ESC encoding
`u': Upper half encoding
`s': Shift-JIS encoding
`e': EUC Encoding
`8': UTF-8 encoding
`b': Brackets encoding

The encoding methods match those that can be used in a source program, but there is no requirement that the encoding method used for the source program be the same as the encoding method used for files, and different files may use different encoding methods.

The default encoding method for the standard files, and for opened files for which no WCEM parameter is given in the FORM string matches the wide character encoding specified for the main program (the default being brackets encoding if no coding method was specified with -gnatW).

Hex Coding

In this encoding, a wide character is represented by a five character sequence:

          ESC a b c d

where a, b, c, d are the four hexadecimal characters (using upper case letters) of the wide character code. For example, ESC A345 is used to represent the wide character with code 16#A345#. This scheme is compatible with use of the full Wide_Character set.

Upper Half Coding

The wide character with encoding 16#abcd#, where the upper bit is on (i.e. a is in the range 8-F) is represented as two bytes 16#ab# and 16#cd#. The second byte may never be a format control character, but is not required to be in the upper half. This method can be also used for shift-JIS or EUC where the internal coding matches the external coding.

Shift JIS Coding

A wide character is represented by a two character sequence 16#ab# and 16#cd#, with the restrictions described for upper half encoding as described above. The internal character code is the corresponding JIS character according to the standard algorithm for Shift-JIS conversion. Only characters defined in the JIS code set table can be used with this encoding method.

EUC Coding

A wide character is represented by a two character sequence 16#ab# and 16#cd#, with both characters being in the upper half. The internal character code is the corresponding JIS character according to the EUC encoding algorithm. Only characters defined in the JIS code set table can be used with this encoding method.

UTF-8 Coding

A wide character is represented using UCS Transformation Format 8 (UTF-8) as defined in Annex R of ISO 10646-1/Am.2. Depending on the character value, the representation is a one, two, or three byte sequence:

          16#0000#-16#007f#: 2#0xxxxxxx#
          16#0080#-16#07ff#: 2#110xxxxx# 2#10xxxxxx#
          16#0800#-16#ffff#: 2#1110xxxx# 2#10xxxxxx# 2#10xxxxxx#

where the xxx bits correspond to the left-padded bits of the 16-bit character value. Note that all lower half ASCII characters are represented as ASCII bytes and all upper half characters and other wide characters are represented as sequences of upper-half (The full UTF-8 scheme allows for encoding 31-bit characters as 6-byte sequences, but in this implementation, all UTF-8 sequences of four or more bytes length will raise a Constraint_Error, as will all invalid UTF-8 sequences.)

Brackets Coding

In this encoding, a wide character is represented by the following eight character sequence:

          [ " a b c d " ]

where a, b, c, d are the four hexadecimal characters (using uppercase letters) of the wide character code. For example, ["A345"] is used to represent the wide character with code 16#A345#. This scheme is compatible with use of the full Wide_Character set. On input, brackets coding can also be used for upper half characters, e.g. ["C1"] for lower case a. However, on output, brackets notation is only used for wide characters with a code greater than 16#FF#.

Note that brackets coding is not normally used in the context of Wide_Text_IO or Wide_Wide_Text_IO, since it is really just designed as a portable way of encoding source files. In the context of Wide_Text_IO or Wide_Wide_Text_IO, it can only be used if the file does not contain any instance of the left bracket character other than to encode wide character values using the brackets encoding method. In practice it is expected that some standard wide character encoding method such as UTF-8 will be used for text input output.

If brackets notation is used, then any occurrence of a left bracket in the input file which is not the start of a valid wide character sequence will cause Constraint_Error to be raised. It is possible to encode a left bracket as ["5B"] and Wide_Text_IO and Wide_Wide_Text_IO input will interpret this as a left bracket.

However, when a left bracket is output, it will be output as a left bracket and not as ["5B"]. We make this decision because for normal use of Wide_Text_IO for outputting messages, it is unpleasant to clobber left brackets. For example, if we write:

             Put_Line ("Start of output [first run]");

we really do not want to have the left bracket in this message clobbered so that the output reads:

             Start of output ["5B"]first run]

In practice brackets encoding is reasonably useful for normal Put_Line use since we won't get confused between left brackets and wide character sequences in the output. But for input, or when files are written out and read back in, it really makes better sense to use one of the standard encoding methods such as UTF-8.

For the coding schemes other than UTF-8, Hex, or Brackets encoding, not all wide character values can be represented. An attempt to output a character that cannot be represented using the encoding scheme for the file causes Constraint_Error to be raised. An invalid wide character sequence on input also causes Constraint_Error to be raised.