Character Set Control - GNAT User's Guide for Native Platforms

Next: File Naming Control, Previous: Compiling Different Versions of Ada, Up: Compiler Switches

4.3.11 Character Set Control

-gnati`c'

Normally GNAT recognizes the Latin-1 character set in source program identifiers, as described in the Ada Reference Manual. This switch causes GNAT to recognize alternate character sets in identifiers. c is a single character indicating the character set, as follows:

`1'

ISO 8859-1 (Latin-1) identifiers

`2'

ISO 8859-2 (Latin-2) letters allowed in identifiers

`3'

ISO 8859-3 (Latin-3) letters allowed in identifiers

`4'

ISO 8859-4 (Latin-4) letters allowed in identifiers

`5'

ISO 8859-5 (Cyrillic) letters allowed in identifiers

`9'

ISO 8859-15 (Latin-9) letters allowed in identifiers

`p'

IBM PC letters (code page 437) allowed in identifiers

`8'

IBM PC letters (code page 850) allowed in identifiers

`f'

Full upper-half codes allowed in identifiers

`n'

No upper-half codes allowed in identifiers

`w'

Wide-character codes (that is, codes greater than 255) allowed in identifiers

See Foreign Language Representation for full details on the implementation of these character sets.

-gnatW`e'

Specify the method of encoding for wide characters. e is one of the following:

`h'

Hex encoding (brackets coding also recognized)

`u'

Upper half encoding (brackets encoding also recognized)

`s'

Shift/JIS encoding (brackets encoding also recognized)

`e'

EUC encoding (brackets encoding also recognized)

`8'

UTF-8 encoding (brackets encoding also recognized)

`b'

Brackets encoding only (default value)

For full details on these encoding methods see Wide_Character Encodings. Note that brackets coding is always accepted, even if one of the other options is specified, so for example `-gnatW8' specifies that both brackets and UTF-8 encodings will be recognized. The units that are with'ed directly or indirectly will be scanned using the specified representation scheme, and so if one of the non-brackets scheme is used, it must be used consistently throughout the program. However, since brackets encoding is always recognized, it may be conveniently used in standard libraries, allowing these libraries to be used with any of the available coding schemes.

Note that brackets encoding only applies to program text. Within comments, brackets are considered to be normal graphic characters, and bracket sequences are never recognized as wide characters.

If no `-gnatW?' parameter is present, then the default representation is normally Brackets encoding only. However, if the first three characters of the file are 16#EF# 16#BB# 16#BF# (the standard byte order mark or BOM for UTF-8), then these three characters are skipped and the default representation for the file is set to UTF-8.

Note that the wide character representation that is specified (explicitly or by default) for the main program also acts as the default encoding used for Wide_Text_IO files if not specifically overridden by a WCEM form parameter.

When no `-gnatW?' is specified, then characters (other than wide characters represented using brackets notation) are treated as 8-bit Latin-1 codes. The codes recognized are the Latin-1 graphic characters, and ASCII format effectors (CR, LF, HT, VT). Other lower half control characters in the range 16#00#..16#1F# are not accepted in program text or in comments. Upper half control characters (16#80#..16#9F#) are rejected in program text, but allowed and ignored in comments. Note in particular that the Next Line (NEL) character whose encoding is 16#85# is not recognized as an end of line in this default mode. If your source program contains instances of the NEL character used as a line terminator, you must use UTF-8 encoding for the whole source program. In default mode, all lines must be ended by a standard end of line sequence (CR, CR/LF, or LF).

Note that the convention of simply accepting all upper half characters in comments means that programs that use standard ASCII for program text, but UTF-8 encoding for comments are accepted in default mode, providing that the comments are ended by an appropriate (CR, or CR/LF, or LF) line terminator. This is a common mode for many programs with foreign language comments.

`1'	ISO 8859-1 (Latin-1) identifiers
`2'	ISO 8859-2 (Latin-2) letters allowed in identifiers
`3'	ISO 8859-3 (Latin-3) letters allowed in identifiers
`4'	ISO 8859-4 (Latin-4) letters allowed in identifiers
`5'	ISO 8859-5 (Cyrillic) letters allowed in identifiers
`9'	ISO 8859-15 (Latin-9) letters allowed in identifiers
`p'	IBM PC letters (code page 437) allowed in identifiers
`8'	IBM PC letters (code page 850) allowed in identifiers
`f'	Full upper-half codes allowed in identifiers
`n'	No upper-half codes allowed in identifiers
`w'	Wide-character codes (that is, codes greater than 255) allowed in identifiers

`h'	Hex encoding (brackets coding also recognized)
`u'	Upper half encoding (brackets encoding also recognized)
`s'	Shift/JIS encoding (brackets encoding also recognized)
`e'	EUC encoding (brackets encoding also recognized)
`8'	UTF-8 encoding (brackets encoding also recognized)
`b'	Brackets encoding only (default value)