Thursday, August 15, 2013

Open Dylan streams and character encodings

As I look at extending Open Dylan to support Unicode, it is apparent that I need to step back from the lexer and compiler and add support for character set transcoders in the io library. This can be done before we can fully support 32-bit characters:
  1. ISO-8859-1 (but not Windows CP1252) can be trivially converted to an (8-bit) <character>.
  2. UTF-8 encoded values between U+0000 and U+00FF can also be decoded into an (8-bit) <character>, since these are identical to ISO-8859-1.
  3. UTF-16 and UTF-32 values between U+0000 and U+00FF are also readable.
In other words it is possible to create a transcoding protocol and integrate it into the streams module. This could be done with a <wrapper-stream> but the more I think about it the more I think it should be added to <file-stream> directly. A wrapper stream could then be used around a <byte-string-stream> whenever you wanted to apply a transcoder to a byte string.

One problem is that <file-stream> is a subclass of <positional-stream>: in the presence of a transcoder the position is the logical character in the stream. For multi-byte encodings like UTF-8 this means changing the position is not necessarily a constant-time operation.

There is more to think about, especially how this will tie into a Dylan that defines <character> as a 32-bit Unicode value. For example, make(<file-stream>) takes an element-type: parameter which is documented to take a byte-character or unicode-character. Later on this will default to <character>.

To be continued...

Tuesday, August 13, 2013

A URL Encoding Horror Story

There once was a web application that needed to create a link that included the name of a journal. Its name was "Ûžno-Rossijskij muzykal’nyj al’manah". But the link the application created didn't work. In fact, it failed spectacularly to create a valid link. This is what it made:

Ûžno%2DRossijskij%20muzykal%2019nyj%20al%2019manah

(Insert music from the shower scene in Psycho)

There are a two things wrong.

The first is that the initial two characters in the name are not encoded correctly. According to RFC 3986 (and the definitions in RFC 2234) it's illegal for those characters to appear in a URI — they must be percent encoded in UTF-8. The first part of the name should look like

%C3%9B%C5%BEno-Rossijskij…

The second is related to the quotation marks in the second and third words. This is U+2019 RIGHT SINGLE QUOTATION MARK. Notice how al’manah appears in the generated link:

al%2019manah

Whatever created this URL appears to have put the actual hexadecimal code-point for the quote into the encoded string! But what you end up with after decoding is "al 19manah" which is obviously wrong. What could lead to this?

The ECMAScript standard, ECMA-262, provides a function escape that generates a percent encoded version of its argument. This function diverges from RFC 3986 in two ways:
  1. A character whose code-point is 0xFF or less is encoded as a two digit escape sequence: 0xDB becomes %DB.
  2. A character whose code-point is greater than 0xFF is encoded using a four digit escape sequence of the form %uXXXX. For example, U+2019 becomes %u2019.
Encoded with escape the title is

%DB%u017Eno-Rossijskij%20muzykal%u2019nyj%20al%u2019manah

Unfortunately the W3C has rejected this syntax, and it isn't portable.

Perhaps whoever wrote the code to URL encode the title was thinking about the ECMAScript function but misremembered the syntax? If that was the case I would have expected the ž to appear as %017E, but it wasn't.

I really don't know what happened to this poor title. Here is what it should look like:

%C3%9B%C5%BEno-Rossijskij%20muzykal%E2%80%99nyj%20al%E2%80%99manah

These problems are endemic: any Unicode string that contains characters outside of US-ASCII gets similar treatment by the application. It is very sad.

Wednesday, August 7, 2013

Fun with Cross-Locale CJK Searching

The Japanese word for "economy" is 経済 (けいざい, keizai). In Mandarin the word is pronounced jīngjì, written 经济 in Simplified Chinese and 經濟 in Traditional Chinese and in Korea. This is the same word written with different character variants.


The Japanese used the Traditional Chinese form through the end of World War II. The Tōyō kanji list — promulgated at the end of 1946 — introduced 1,850 official "simplified" forms including the two used to write keizai. Many of these were forms already in use which were simply blessed for standard use. In 1956 the People's Republic of China introduced the first official list of simplified characters, which was followed up two decades later with an extended list.

The crucial point here is that the simplified forms used by the PRC are different from those selected by the Japanese Ministry of Education.

In my experience, there is an expectation from Asian librarians that searching for any of these variants will find documents written with any other variant. This makes sense: many academic libraries contain works in several languages and all should be discoverable. This is similar to a medical researcher wanting to find articles containing anaesthesia and anesthesia.

The ability to search across languages doesn't end with kanji/hanzi variation though. It is common in Korea for academic books have their titles written in hanja (ideographs):

韓國 政治 經濟學事典

Our friend 經濟 (경제, gyeongje) appears in the third 'word'. Most Korean searches are written in hangul, so this title will not be found if someone searches for 경제 even though it is probably relevant to the user's information need.

This is just a small example of the problems I encounter with the work I do at EBSCO.





Friday, August 2, 2013

Lithuanian Mojibake

At EBSCO I deal with a lot of data that comes to us from around the world, in encodings that may or may not be correctly identified, and that often get munged in intriguing ways. Here's one example.

The title for a Lithuanian work was being displayed incorrectly on the website, so I went digging into the database to see what was actually being stored. This is what I saw:

Auka^Atojo mokslo kokyb^W^A

These were the byte values:

41 75 6b 61 01 74 6f 6a 6f 20 6d 6f 6b 73 6c 6f 20 6b 6f 6b 79 62 17 01

The correct title is

Aukštojo mokslo kokybė

The diacritized characters are being corrupted. But how?
  • The character 'š' is U+0161 (ISO-8859-13 0xF0)
  • The character 'ė' is U+0117 (ISO-8859-13 0xEB)

Somehow the UTF-16LE bytes for these two characters were being inserted into the byte-stream at some point, whether in the original data feed from the provider or by one of our processes. Regardless, the mind reels at how this could happen.


Wednesday, July 31, 2013

Combining Forms




On Character Representation in Dylan

In response to my previous post, Dustin Voss added a great comment describing his view on character and string representation in Dylan. This is my response.

Dustin proposes a distinction between a <code-point> and a <character>.

A <code-point> contains the integer value for a character in the codespace. On the assumption that Unicode is the character set used in Dylan moving forward, and that we will adopt UTF-32 as the internal encoding, <code-point> is equivalent to my proposed <character> and <unicode-character>.

Dustin's <character> represents a grapheme cluster: the user's perceived notion of a character that may actually consist of multiple <code-point>s. A grapheme cluster is a high-level concept, and while it crucial in many text processing activities, IMHO is too high-level to be included at the language level. Grapheme clusters are a feature built on the functionality provided by the language.

How do we work with a grapheme cluster: what is <character>'s literal representation and what is its integer value?

as(<integer>, 'ñ') ⇒ ?

In this case the grapheme cluster (letter n + combining tilde) has a precomposed form (0xf1), but it doesn't have to. It could just as easily be represented by "\<006E>\<0303>" which won't even lex. We would have to ensure that the strings are constantly kept normalized.

If a <string> is a sequence of grapheme cluster <character>s then the iteration protocol becomes more complex, and we still need a way of splitting a <character> into its constituent code points, or the protocol needs to convert on the fly to code-points, and normalization concerns come back.

I believe the language should provide the basic functionality for dealing with characters, where we define Unicode to be the language's character set, and no more. Support for higher-level operations belong in a library: perhaps natively, or through the FFI to ICU.

We would define a <grapheme-cluster> which would be implemented in terms of <character>, and a <grapheme-string> that returns clusters. These could be efficiently implemented using something like Common Lisp's displaced arrays. When writing code that needs to know about clusters you can get that view into the <string>, but other uses (which I would argue are the majority) would just see individual characters.

In my opinion characters and strings were under-specified in Dylan, just as they are in Common Lisp and pre-C++11. This is why the approach taken by Java and C# ("character set is Unicode, internal representation is UTF-16") is an appealing model to use. Dylan is young enough (or, more to the point, has a small enough user base) that we can make proclamations and define the direction moving forward. For most users, if we do it right, the impact will be minimal.

Monday, July 29, 2013

OpenDylan and Unicode, a Journal

I recently started looking at what will be involved to add Unicode support to Open Dylan, and have decided to keep an online journal of the progress. I make no promises that these are complete or will be at all understandable to anyone else.

Internal character/string encoding: UTF-32. CCL and SBCL do this. By default wchar_t in GCC is a signed 32-bit value (__WCHAR_MAX__ == 0x7fffffff) which will hold the largest valid Unicode value of 0x10ffff. This avoids problems with surrogate pairs. In 2013 complaints about memory consumption of wide strings are silly.

  • <character> and <unicode-character> will be the identical.
  • <byte-character> will contain an unsigned 8-bit value. Interpretation as ASCII, ISO‑8859‑n, is undefined.
UTF-8 will be the defined source encoding for Dylan. Other encodings will not be supported. I've started a DEP specifying this and clarifying the portions of the DRM related to characters and their encoding, but it isn't done so I haven't released it even to the hackers list.

Need to write a UTF-8 to UTF-32 transcoder for use by the lexer.

Lexer needs to be updated:
  • Break the 1-byte == 1-character assumptions: a single character literal may require multiple octets: '木' is 27 e6 9c a8 27.
  • Handle character code escapes greater than 255: '\<6728>'
  • Symbols? #"木" vs. 木: 
Need to review the lexical grammar and clarify wrt Unicode.

Create unit tests defining the expected behavior:
  • as(<integer>, 'a') ⇒ 97
  • as(<integer>, '木') ⇒ 26408
  • as(<integer>, '\<6728>') ⇒ 26408
  • as(<character>, 0x61) 'a'
  • as(<character>, 0x6728) ⇒ '木' or '\<6728>'
Will need to figure out how we print non-ASCII in an interactive environment: do we transcode to the locale? To UTF-8? To something else?

as-uppercase and as-lowercase (and their in-place versions) will implement the simple Unicode case mappings. Will need to write a utility to build a compressed table from the UnicodeData.

Need to determine how code generation is handled for characters and strings, and modify each backend appropriately.

Find all touch-points for characters.

Literal string creation and representation.

FFI : look at Java and SBCL for inspiration on how to handle string transcoding to/from the Dylan side.

More to come…