GNAT allows wide character codes to appear in character and string literals, and also optionally in identifiers, by means of the following possible encoding schemes:
In this encoding, a wide character is represented by the following five character sequence:
ESC a b c d
where a
, b
, c
, d
are the four hexadecimal
characters (using uppercase letters) of the wide character code. For
example, ESC A345 is used to represent the wide character with code
16#A345#
.
This scheme is compatible with use of the full Wide_Character set.
The wide character with encoding 16#abcd#
where the upper bit is on
(in other words, ‘a’ is in the range 8-F) is represented as two bytes,
16#ab#
and 16#cd#
. The second byte cannot be a format control
character, but is not required to be in the upper half. This method can
be also used for shift-JIS or EUC, where the internal coding matches the
external coding.
A wide character is represented by a two-character sequence,
16#ab#
and
16#cd#
, with the restrictions described for upper-half encoding as
described above. The internal character code is the corresponding JIS
character according to the standard algorithm for Shift-JIS
conversion. Only characters defined in the JIS code set table can be
used with this encoding method.
A wide character is represented by a two-character sequence
16#ab#
and
16#cd#
, with both characters being in the upper half. The internal
character code is the corresponding JIS character according to the EUC
encoding algorithm. Only characters defined in the JIS code set table
can be used with this encoding method.
A wide character is represented using UCS Transformation Format 8 (UTF-8) as defined in Annex R of ISO 10646-1/Am.2. Depending on the character value, the representation is a one, two, or three byte sequence:
16#0000#-16#007f#: 2#0xxxxxxx# 16#0080#-16#07ff#: 2#110xxxxx# 2#10xxxxxx# 16#0800#-16#ffff#: 2#1110xxxx# 2#10xxxxxx# 2#10xxxxxx#
where the xxx
bits correspond to the left-padded bits of the
16-bit character value. Note that all lower half ASCII characters
are represented as ASCII bytes and all upper half characters and
other wide characters are represented as sequences of upper-half
(The full UTF-8 scheme allows for encoding 31-bit characters as
6-byte sequences, and in the following section on wide wide
characters, the use of these sequences is documented).
In this encoding, a wide character is represented by the following eight character sequence:
[ " a b c d " ]
where a
, b
, c
, d
are the four hexadecimal
characters (using uppercase letters) of the wide character code. For
example, [‘A345’] is used to represent the wide character with code
16#A345#
. It is also possible (though not required) to use the
Brackets coding for upper half characters. For example, the code
16#A3#
can be represented as ['A3']
.
This scheme is compatible with use of the full Wide_Character set, and is also the method used for wide character encoding in some standard ACATS (Ada Conformity Assessment Test Suite) test suite distributions.
|