Tuna Breath: Unicode client woes

October 28, 2003

Unicode client woes

So, first, the appearance that we can't handle UNICODE rankles. RLG has been involved in the Development of UNICODE over the last decade, and before that with other character encoding schemes. I have much respect for my colleagues here, and i hate to embarrass them.

So, we do have a known bug: when you see a "Related Subject" or "Related Author" with non-ASCII characters in it, the JavaScript creating the link chokes on the UNICODE. We're working on a fix for that. Other than that, this system is all UTF-8. Which isn't to say it's perfect, but there are interpretation issues on the client side -- and i think they're even worse than the simple rendering variants across browsers.

Last week, i could not cut and paste 毛澤東 with my Mac. They displayed just fine in Firebird, but no luck cutting and pasting. This worked just fine on my Win 2000 box at work (Mozilla 1.0.2). I've upgraded my Mac to 10.3 over the weekend. I can now cut and paste those characters into the RedLightGreen search box and have a perfectly happy search.

So, now to reconsider the diacritics. I spoke with our local UNICODE expert, Joan Aliprand (the first named author of The Unicode Standard, Version 4.0. and here for RLG's involvement with the UNICODE Consortium in 1994). Joan pointed out that there are (at least) two ways to express many diacritic characters -- the character may be a composite of two codes or it can be the distinct code for the character. We chose composite characters for our translation.

We chose to configure the MindServer to be completely character set and language agnostic -- we have too many languages and character sets in our data to lock them down. The MindServer, then will match UTF-8 character to character -- it's not smart enough to decompose or recompose characters with diacritics.

When Porst tried his Jänich search, it's possible that the umlaut-a was a single character (although, when i search "janich," i find that Wissenschaftstheorie als Wissenschaftskritik has a German record where Jürgen has an umlaut-u but Janich has no diacritics). My diacritic sample search is for Hans Küng.

On my 10.2 Mac i had lousy results, which contradicted my testing on my Win 2000 box. Today i can repeat the search. First, because i'm resisting learning how to type anything but what's on my American keyboard, i search for "kung." When i view the details, i see "Ku[]ng" where the [] is the box character used to indicate "yeah, that's a proper character but i can't display it." (It displays just fine on the Win 200 box.) It's clear that this is a composed diacritic character -- the diacritic is the character that's not displaying: I paste that into the search box, and the u-umlaut displays. ("Küng")*sigh* The search works just fine.

Now, i want to demonstrate the frustration due to the two different UNICODE encodings. I suspect that the "Küng" i searched for earlier had a single u-umlaut character. I cut and past from the blog -- voila, no results. I replicate this pattern on the PC, as well.

So, what to do? RLG's practice, developed over the past twenty some years, was to normalize Latin Character set text strings in indices, removing diacritics. (Our ugly, punctuation-stripped title-cluster labels demonstrate the normalization that was in use.) RedLightGreen built upon that normalization effort, using both the original encodings and the normalized indexes to create the record MindServer indexes. Should we index again, adding a duplicate access point with the diacritics represented the other way ("disk is cheap")? Should we expect the underlying tools (DB2, MindServer) to recognize that the composed and single character diacritics are synonyms?

As we migrate the whole Union Catalog, we'll be figuring it out -- and maybe it's already in a spec somewhere. Eventually, it will percolate into RedLightGreen. Until then, we've a frustrating situation with respect to diacritics.

Posted by judielaine at October 28, 2003 11:37 AM | TrackBack

Comments