I’ve spent a lot of time this week concentrating on internationalisation (shortened to i18n in coder speak) for the fluxus scratchpad editor. I didn’t have much luck finding comprehensible help online, most of this was done from memory of porting Evolva to Japanese years ago. I thought a blog post about it might be good for anyone treading this path in the future.
The gnome character map was very useful, as was this website which includes a bit more information.
Firstly the history of how text is dealt with is a rather shameful mess of assumptions and out of date compromises. With higher level programming languages this is almost a solved problem, but in C++ it’s not at all. It would also possibly be less of an issue if we could use an editor widget from a normal windowing toolkit, but we can’t as we need to render text in OpenGL with all sorts of whizzbang livecoding zooming effects.
It’s an often told story, first there was ascii, and assumptions of everyone speaking english, and 8bit having to be enough (actually 7bits) for all characters needed in text. Then a whole host of ascii extensions for different parts of the world, using the unused top 127 values of ascii for special characters, while the Japanese for example having their own system entirely called Shift_JIS.
Then in an effort to sort out this mess, lots of committees were formed and Unicode was the result – with the following aim: “Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”
In order to do this, unicode needs to support 107,000 possible characters – enough for all the languages ever spoken (including exotic scripts such as ancient Egyptian hieroglyphics, Canadian Aboriginal syllabics and Byzantine music scripts) and lots of room for new ones too. In order to do this, you need at least 32 bits, on their own 16, and obviously 8 bits are not enough.
The problem is that ascii is so prevalent, and 32bits per character is a lot for embedded devices, so unicode comprises multiple encoding formats. utf-32 is the easy, but memory eating one, where all characters are the same size, and there is utf-16 and utf-8 which encode variable length characters to save space. utf-8 has the interesting extra property of being compatible with ascii, it reserves the single byte ascii codes and uses the upper 128 values as control codes to signify multibyte chars. For this reason, utf-8 has become the text encoding of the internet, and therefore a standard for plain text files pretty much everywhere.
In the fluxus scratchpad, I decided to use utf-32 for all internal strings – using the std::wstring class, and encode from and to utf-8 at all points of input and output. The reason for this is that fairly simple operations like finding the length of a string, or moving back a single character is complicated with utf-8 – you need to detect the multibyte characters at each point. Using utf-32 it’s much simpler, but you also need to be able to read and write utf8/ascii files.
Ok, that’s the strategy, and you’d think there would be lots of code online to do conversions between these formats – or even standard libraries in C++ to do this. No. The only code I found was buggy (a ‘>’ instead of a ‘>>’ which took me too long to find) but some working code is here. In those functions string is assumed to contain ascii or utf-8 and wstrings are assumed to be utf-32 (even this is going to have to change on the windows version where wstrings are 16bit, sigh).
So, a big search replace for string to wstring in the scratchpad code with conversions in the loading, saving and entry and exit points for the PLT Scheme interpreter, and all was good…
Except keyboard input. I’m still trying to get to the bottom of this and find out what is an artifact of using GLUT and what is the underlying operating system, but at the moment it looks like the mac sends utf-8 keyboard codes, so multiple calls to the glutKeyboardFunc callback per keypress for multibyte chars. On Linux it seems like just the first utf-8 byte gets through, I can’t find where the others get stashed. The minimal information online seems to indicate that this is stretching the abilities of the GLUT toolkit somewhat so I’ve given up for now at least.
The last thing to do was switch the fluxus editor font from the minimal Bitstream Vera Sans Mono to the more characterful Deja Vu Sans Mono which other than having many more glyphs looks exactly the same.
Portuguese help in fluxus – from utf-8 in the code doc comments, to a utf-8 helpstrings file, read in by the scheme interpreter, converted by the scratchpad to utf-32 and rendered by fluxus’s OpenGL glyph renderer.