Once more unto the breach, dear friends.
--- William Shakespeare, Henry V, III:1
Once upon a time -- last month, to be exact -- we had
been jarred into talking about internationalization (often
abbreviated I18N) by some work that Copeland is doing at Softway
Systems, in preparing their Interix System for Unix branding. We
were explaining the steps you must take to enable your anglophone
software to handle characters other than ASCII and languages
other than English. In some ways, it's a cookbook problem,
because the patterns repeat themselves. On the other hand, we
may need to consider our software a little more closely because
some algorithms appropriate for an character set with 128
elements just won't work for one with 6400 ideographs. This is
an application of the Hawker Observation: you need to seriously
rethink the algorithm you're using every time you increase your
data set by two orders of magnitude.
To review quickly: in most cases you'll read data into
your program with each character represented by one or more
bytes. You'll convert the multibyte characters into the standard
C wchar_t data type using the mbtowc()
interface. After that, you'll often be able to process the
wchar_ts using similar algorithms to your existing ones,
but using different interfaces.
More Character
Strings.
When we left you at the battlements at the end of last
month's column, we had just finished talking about collating
sequences. After that complicated problem, the remaining string-processing interfaces are fairly straight-forward.
For example, in the i18n environment -- as long as we
are sure we have a single-byte character and have called
setlocale() to set up the language specifics -- we can
still use the normal macros from <ctype.h>. In
other words, islower(ü) returns true even though
ü isn't in the ASCII range. But we also have wide-character versions of those same macros defined in
<wctype.h>: if we have a wide-character
wc containing that same u-umlaut, iswlower(wc)
will be true. Even towlower() and towupper()
do the expected thing.
That still leaves us with the interfaces for handling
full strings rather than single characters. Again, things work
in a reasonable fashion. There are wide-character equivalents
for the major string functions defined in
<string.h>. For example, wcscpy(a,b)
copies the wchar_t *b into a, and returns the
pointer to a.
This means that if you have a code fragment for assembling a string such as:
you can now use the analogous:char *result, *whole, *fraction; strcpy(result, whole); strcat(result, "."); strcat(result, fraction);
wchar_t *result, *whole, *fraction; wchar_t wradix[10]; char *radix; wcscpy(result, whole); radix = localeconv()->decimal_point; mbstowcs(wradix, radix, 10); wcscat(result, radix); wcscat(result, fraction);
Notice that the code is not completely the same. We
can't assume that a foreign language uses the same marker for
decimal point as we do in English. That locale-specific radix
character is available from the locale as a multi-byte string in
the lconv structure whose pointer is returned by
localeconv(). Notice that we're careful about
converting the multi-byte string to a wide character string
before we append it to our result. (Yes, if this was in the
inner loop of our program we'd prepare the wide-character version
of the radix outside the loop.)
Input and Output.
It's all well-and-good to have strings, even strings
with European or Asian characters in them, but how can we get
them in and out of our programs?
Again, by analogy, we have %lc and %ls specifiers to printf() and scanf(). This lets us print the string we assembled above with a line such as
printf("%d %ls\n", n++, result);
Notice that %ls (or the equivalent %S) converts
the wide-character string to its multi-byte form before writing
it. In other words, the multi-byte form is the normal external
one, and the wide-character version is used for internal
processing only. If you think about it, that makes perfect
sense: The mapping from multi-byte to wide-character is
implementation dependent and hence may not be portable between
systems. For example, while Solaris uses a 32-bit wide-character
representation, Interix uses 16-bits, so that if Haemer writes a
document in Japanese wide-characters on his Sun server, Copeland
will be unable to read it from his Interix machine. However, we
can easily exchange the document in the shift-JIS codeset, which
is one of the common Japanese multi-byte representations.
But how do we get Japanese (or Chinese, or Korean)
characters into the computer in the first place? That varies
from system-to-system, and is not covered in any standard.
However, in general, there is something called an input method
editor, or IME. In rough outline, an IME provides a way to enter
a word, which can be translated into an ideograph. In the case
of Japanese, we generally have a keyboard with two shift keys.
One shifts from lower-to-upper case, and the other shifts the
keyboard into kana mode, which allows the keys to type hiragana
or katakana, the Japanese phonetic characters. When we enter a
word in hiragana, we are then presented with the possible
Japanese kanji characters for that word -- remember that there is
a many-to-one mapping in Japanese both for words into kanji (the
name ``Yoshihara'' may have several different renditions in
kanji) and kanji into words (a given kanji may have several
different readings). After we've chosen the correct kanji
rendition, it is entered into the file. Again, there is no
standard way to do this, so your mileage will certainly vary.
There is one last vital consideration for output. The
order of words in a sentence is different in various languages,
for example, ``yellow flower'' becomes ``la flor amarilla'' en
Espanbsp;nol. The differences between English and German are
equally dramatic. To solve this problem, the standard
printf() allows the order of the arguments to be handled
in variable order in the format string. For example, if we have
printf(fmt, month, date, who); we can use an
fmt of %s %d is %s's birthday in English to
produce ``April 26 is James's birthday'' and %3$s's
geburtstag ist %2$d. %1$s in German to generate ``James's
geburtstag ist 26. April.'' Notice that qualifiers like
2$ allows us to reorder the parameters to account for
differing word order in different languages.
Message Catalogs.
Allowing for different format strings is very nice, but
how do we provide those in the program? We use the mechanism of
message catalogs, which we'll only cover in outline here.
Catalogs are a collection of text strings that your program loads at run time, rather than having them stored in the executable itself. Generally, we find the catalogs by using the NLSPATH environment variable, which tells the catopen() interface where to look for the catalog, based on current settings for enviornment variables like LANG. Our program then looks up individual strings with the catgets() interface. A typical use is something like:
printf( catgets(cat,msg_set,msg_id,"%s %d"),
mon, day );
In other words, we look up a particular message in catalog
cat, and use %s %d if we can't find it.
In many existing programs, you'll find an amazing
variety of syntactic sugar to hide the complicated
catgets() call. The most common is the
_("message") syntax used in many of GNU utilities.
One last warning about text strings: You should remember that plurals are different from language to language, so the familiar English fragment,
printf("%d error%s", n, (n==1)?"":"s");
won't work.
Translation of text strings is a complicated process.
Whole organizations whose ostensible purpose is
internationalization actually spend most of their time providing
translation services.
Time.
How to get time values sensibly printed is a real
problem, given the different names for months and days of the
week, different cultural requirements for format of the date, and
different numbering schemes. These problems are all subsumed by
the strftime() interface. It takes a buffer pointer, a
size, a format and a pointer to a tm structure, as
returned (for example) by localtime(), and generates a
formatted string into the buffer, returning the length of the
result.
The format specifiers for strftime are more numerous and every bit as complicated as those for printf, however many of the specifiers will be familiar if you've used alternate formats from the date command. For example,
produces ``Thursday 7 February 1985.'' The important point is that if I have LANG set to some locale other than POSIX, I can equally easily generate a string like ``Donnerstag 7 Februari 1985.''strftime(buf, SZ, "%A %d %B %Y", tm);
One of the complications for strftime is that to meet the x/Open specifications it must also handle dates based on eras, the best-known example of which is the Japanese Imperial date. In Japan, the year we think of as 1985 was ``Shouwa 60'' or the sixtieth year of Hirohito's reign. To produce a date like this, we say
which in a Japanese locale results in ``mokuyoubi 7 2gatsu 60 shouwanen,'' in the appropriate kanji characters. In a locale without information for era dating, the normal Gregorian year is supplied.strftime(buf, SZ, "%A %d %B %EY", tm);
For the inverse problem -- I have a string and I need
the tm structure it represents -- we have the
strptime() interface. It takes as arguments a buffer
and a format specifier similar to strftime's, and fills
in the given tm with the information it is able to glean
about the date from the string.
Even more amazing is the getdate() interface.
While it isn't supplied in every system -- it's a relatively
recent addition to the standards -- getdate allows us to
check a variety of different date format strings so that we need
not exactly know the format of the date we're trying to parse
a priori. Getdate does this magic by referring
to a file of possible date formats, and parsing the given buffer
based on the first format it finds to match. Different
applications can use different files of date formats, based on
the DATEMSK environment variable.
Numbers and
Money.
We've talked about the interfaces for character class,
collation, strings, messages, and time. We haven't yet talked
about how we use the LC_NUMERIC and LC_MONETARY
locale categories. Their data is relatively sparse, and slightly
overlapping.
The LC_NUMERIC category tells us what
characters to use for the decimal point and thousands separator
in the current locale. This allows us to change 1,789.456 in the
US to 1.789,456 in France. Our old friend printf
already understands about decimal point, but to make it use the
thousands separator, we need to use the %' modifier.
The number we show above is rendered with a format like
%'.3f.
Similarly, monetary quantities suffer from a large
number of cultural variations. Many of them are handled in the
strfmon() interface. Like strftime(), it takes
a buffer and size, a format and arguments, and returns the length
of the character data placed in the buffer. The format specifier
allows us to decide whether to use the local currency sign (like
a dollar sign) or the international currency name (like ``USD''),
how many digits of pence or cents or pfennigs to include past the
decimal point, whether to include thousands separators, what kind
of fill characters to use for that check-like look, and whether
to use a minus sign or parentheses when checkbook balance looks
like the government's.
While numeric and monetary items use similar data, they
use orthogonal locale categories because money may have a
different format than other numbers. Also, strfmon()
doesn't use all the information provided in the locale. To get
the other data -- for example, whether the minus sign preceeds
the currency symbol or follows it -- we can use the data returned
by the localeconv() interface.
Localeconv, which we mentioned briefly above,
returns a pointer to an lconv structure which contains
all of the data in the LC_NUMERIC and
LC_MONETARY categories. Unless you are in a situation
where you absolutely need the raw locale data though, you're
better off using the provided interfaces, which are (usually)
more general, and don't depend on the underlying data formats.
Finishing Up.
We've spent two months quickly reviewing
internationalization. I18N is a large can of worms, and we've
just shown you the top layer of worms. Think of this as a
checklist rather than a tutorial. For the complete story,
nothing will substitute for reading the standards and the manual
pages. The locale chapter in x/Open's System Interface
Definitions (see http://www.opengroup.org/publications/
and click on ``Common Access to the Unix Documentation'' for an
on-line version) is particularly useful.
As usual, we have no clue what we'll discuss next time, because there's no telling what problems we'll find interesting in the next month. But until then, happy trails.