|
|
Earlier in this section we looked at the encoding scheme used for the multibyte characters that are needed to represent Asian-language ideograms. We noted that because single-byte characters can be intermixed with multibyte characters, the sequence of bytes needed to encode an ideogram must be self-identifying: regardless of the supplementary code set used, each byte of a multibyte character will have the high-order bit set. In this way, any byte of a multibyte character can always be distinguished from a member of the primary, 7-bit US ASCII code set, whose high-order bit is not set (or "0"). If code sets 2 or 3 are used, each multibyte character will also be preceded by a shift byte; that is, if code set 1 were dedicated to a single-byte character set, either of code sets 2 or 3 could be used to represent multibyte characters. Given some set of these encodings, then any program interested in the next character will be able to determine whether the next byte represents a single-byte character or the first byte of a multibyte character. If the latter, then the program will have to retrieve bytes until the character is complete.
Some of the inconvenience of handling multibyte characters
would be eliminated, of course, if all
characters were a uniform number of bytes.
ANSI C provides the implementation-defined integral type
wchar_t
to let you manipulate variable-width characters
as uniformly sized data objects called wide characters.
Since there can be thousands or tens of thousands of ideograms
in an Asian-language set, programs should use a 32-bit sized
integral value to hold all members.
wchar_t
is defined in the headers
<stdlib.h> and
<wchar.h> as
a typedef of a 32 bit signed integer.
Implementations provide appropriate libraries with functions that you can use to manage multibyte and wide characters. We will look at these functions below.
For each wide character
there is a corresponding EUC representation
and vice versa;
the wide character that corresponds to a regular single-byte character
has the same numeric value as its single-byte value,
including the null character.
There is no guarantee that the value of the macro
EOF
can be stored in a
wchar_t
,
just as
EOF
might not be representable as a
char
.
EUC and corresponding 32-bit wide-character representation
Code set | EUC code representation | Wide-character representation |
---|---|---|
0 | 0xxxxxxx | 0000000000000000000000000xxxxxxx |
1 | 1xxxxxxx | 0011000000000000000000000xxxxxxx |
1xxxxxxx1xxxxxxx | 001100000000000000xxxxxxxxxxxxxx | |
1xxxxxxx1xxxxxxx1xxxxxxx | 00110000000xxxxxxxxxxxxxxxxxxxxx | |
2 | SS2 1xxxxxxx | 0001000000000000000000000xxxxxxx |
SS2 1xxxxxxx1xxxxxxx | 000100000000000000xxxxxxxxxxxxxx | |
SS2 1xxxxxxx1xxxxxxx1xxxxxxx | 00010000000xxxxxxxxxxxxxxxxxxxxx | |
3 | SS3 1xxxxxxx | 0010000000000000000000000xxxxxxx |
SS3 1xxxxxxx1xxxxxxx | 001000000000000000xxxxxxxxxxxxxx | |
SS3 1xxxxxxx1xxxxxxx1xxxxxxx | 00100000000xxxxxxxxxxxxxxxxxxxxx |
Most of the functions provided let you convert multibyte characters into wide characters and back again. Before we turn to the functions, we should note that most application programs will not need to convert multibyte characters to wide characters in the first place. Programs such as diff, for example, will read in and write out multibyte characters, needing only to check for an exact byte-for-byte match. More complicated programs such as grep, that use regular expression pattern matching, may need to understand multibyte characters, but only the common set of functions that manages the regular expression needs this knowledge. The program grep itself requires no other special multibyte character handling. Finally, note that except for libc, the libraries described below are archives, not shared objects. They cannot be dynamically linked with your program.
ANSI C provides five library functions that manage
multibyte and wide characters:
mblen length of next multibyte character
mbtowc convert multibyte character to wide character
wctomb convert wide character to multibyte character
mbstowcs convert multibyte character string to wide character string
wcstombs convert wide character string to multibyte character string
The first three functions are described on the mbchar(3C) manual page, the last two on the mbstring(3C) page.
Since most programs will convert between multibyte and wide characters just before or after performing I/O, libc provides routines that let you manage the conversion within the I/O function itself as if the input or output stream were wide characters instead of multibyte characters. fgetwc, for instance, reads bytes from a stream until a complete EUC character has been seen and returns it in its wide-character representation. fgetws does the same thing for strings; fputwc and fputws are the corresponding write versions. Of course, these routines and others are functionally similar to the Intro(3S) functions; they differ only in their handling of EUC representations. See their manual pages for details. Here is a look at how you can expect the functions to work.
Given the following declarations
#include <stdio.h> #include <wchar.h>a multibyte string can be input intowchar_t s1[BUFSIZ]; /* declare array s1 to store wide characters */ char s2[BUFSIZ]; /* declare array s2 of characters for EUC representation */
s1
using
fgetws:
fgetws(s1, BUFSIZ, stdin); /* read EUC string from stdin and convert to process code string in s1 */
fgets(s2, BUFSIZ, stdin); /* read EUC string from stdin into s2 */ mbstowcs(s1, s2, BUFSIZ); /* convert EUC string in s2 to process code string in s1 */
the %S conversion specifier for scanf:
scanf("%S", s1); /* read EUC string from stdin and convert to process code string in s1 */
the %S conversion specifier for scanf and mbstowcs:
scanf("%S", s2); /* read EUC string from stdin into s2 */ mbstowcs(s1, s2, BUFSIZ); /* convert EUC string in s2 to process code string in s1 */
You can use fputws, wcstombs, and the %S conversion specifier for printf (see fprintf(3S)) in the same way for output.
Single- and multibyte character classification and conversion functions are provided in libc. You can use these routines to test 7-bit US ASCII characters, for instance, in their wide-character representations, or to determine whether multibyte characters are ideograms, phonograms, or the like. See the wctype(3C) and wconv(3C) manual pages for details.
As noted, these routines are declared in the <wcchar.h> header.
32-bit versions of certain UNIX System V Release 4 (SVR4) curses functions are provided in libocurses and declared in <ocurses.h>. Check the curses(3ocurses), manual page for some of the things you need to look out for in using these functions.
The POSIX curses library (libcurses) supports the wide character functions specified in the POSIX standard. See Intro(3curses).
To give even more flexibility to the programmer in an Asian environment, ANSI C provides 32-bit wide character constants and wide string literals. These have the same form as their non-wide versions except that they are immediately prefixed by the letter L:
Note that multibyte characters are valid in both the regular and wide versions. The sequence of bytes necessary to produce the ideogram ¥ is encoding-specific, but if it consists of more than one byte, the value of the character constant '¥' is implementation-defined, just as the value of 'ab' is implementation-defined. A regular string literal contains exactly the bytes (except for escape sequences) specified between the quotes, including the bytes of each specified multibyte character.
When the compilation system encounters a wide character constant
or wide string literal,
each multibyte character is converted
(as if by calling the
mbtowc
function)
into a wide character.
Thus the type of L'¥'
is
wchar_t
and the type of L"abc¥xyz"
is array of
wchar_t
with length eight.
(Just as with regular string literals,
each wide string literal has an extra zero-valued element appended,
but in these cases it is a
wchar_t
with value zero.)
Just as regular string literals can be used as a short-hand
method for character array initialization,
wide string literals can be used to initialize
wchar_t
arrays:
wchar_t *wp = L"a¥z"; wchar_t x[] = L"a¥z"; wchar_t y[] = {L'a', L'¥', L'z', 0}; wchar_t z[] = {'a', L'¥', 'z', '\0'};In the above example, the three arrays x, y and z as well as the array pointed to by wp, have the same length and all are initialized with identical values.
Adjacent wide string literals will be concatenated, just as with regular string literals. Adjacent regular and wide string literals produce undefined behavior.