Tuna Breath: October 2003 Archives

October 31, 2003

Upgrade & Promotion

We've added a number of bug fixes and a few labeling revisions to RedLightGreen this evening. During the "overnight" outage we're also fixing the most perplexing bug -- the one where you can't search on a single word unless it can be separated into a root and suffix(es).

We're going to be promoting at our partner institutions next week -- i expect we'll begin to see real undergraduate usage.

We are also still suffering the DB2 JDBC bug that crashes Websphere -- waiting for a DB2 fixpack on that....

Happy Halloween!

Posted by judielaine at 06:07 PM | Comments (0) | TrackBack

October 28, 2003

Google survey

Around the one hundredth Google result on a search on "RedLightGreen" was this comment

October 16, 2003 Not limited to government documents, but still: RLG and OCLC seem to be moving toward having limited versions of their union catalogs available for free online. So far, I have been utterly unimpressed with RLG's RedLightGreen; for example, a search for the LC subject heading "Women" just now turned up no results. Perhaps I am misunderstanding something about it, because that just can't be right

Finally! Something other than an example of the telephone game (where i'm left wondering where Barbara Quint got some of her facts). Unfortunately the Libronaut has no comments or trackback links.

First, I'll wager The Libronaut was searching via the "more search options" instead of the keyword box. If you search on "women" in the keyword box,

However, we have determined there is a bug in the fielded searching due to the MindServer's stemming implementation. We tested the system on a 10% database that did not include stemming, and the fielded search worked just fine. The full database was delivered a little late; that, plus some other crises, i've alluded to here, meant we failed to as through a QA as we would have liked. Recommind is working like mad to fix it.

As proof of it being a bizarre stemming bug, use the "more search options" to search on the subject "womens" instead of "women" and plenty of results will be provided.

My hair isn't grey yet....

This post was inspired by the Quarter Life Crisis review. Unfortunately (a) posting a comment to the blog or getting a track back link fails and (b) we don't yet support what s/he wants to do.

I will try to remember to let you know when we get the static URL working. Currently, we do not support linking into the system. We are aware that it is a significant flaw.

Posted by judielaine at 03:53 PM | Comments (1) | TrackBack

Unicode client woes

So, first, the appearance that we can't handle UNICODE rankles. RLG has been involved in the Development of UNICODE over the last decade, and before that with other character encoding schemes. I have much respect for my colleagues here, and i hate to embarrass them.

So, we do have a known bug: when you see a "Related Subject" or "Related Author" with non-ASCII characters in it, the JavaScript creating the link chokes on the UNICODE. We're working on a fix for that. Other than that, this system is all UTF-8. Which isn't to say it's perfect, but there are interpretation issues on the client side -- and i think they're even worse than the simple rendering variants across browsers.

Last week, i could not cut and paste 毛澤東 with my Mac. They displayed just fine in Firebird, but no luck cutting and pasting. This worked just fine on my Win 2000 box at work (Mozilla 1.0.2). I've upgraded my Mac to 10.3 over the weekend. I can now cut and paste those characters into the RedLightGreen search box and have a perfectly happy search.

So, now to reconsider the diacritics. I spoke with our local UNICODE expert, Joan Aliprand (the first named author of The Unicode Standard, Version 4.0. and here for RLG's involvement with the UNICODE Consortium in 1994). Joan pointed out that there are (at least) two ways to express many diacritic characters -- the character may be a composite of two codes or it can be the distinct code for the character. We chose composite characters for our translation.

We chose to configure the MindServer to be completely character set and language agnostic -- we have too many languages and character sets in our data to lock them down. The MindServer, then will match UTF-8 character to character -- it's not smart enough to decompose or recompose characters with diacritics.

When Porst tried his Jänich search, it's possible that the umlaut-a was a single character (although, when i search "janich," i find that Wissenschaftstheorie als Wissenschaftskritik has a German record where Jürgen has an umlaut-u but Janich has no diacritics). My diacritic sample search is for Hans Küng.

On my 10.2 Mac i had lousy results, which contradicted my testing on my Win 2000 box. Today i can repeat the search. First, because i'm resisting learning how to type anything but what's on my American keyboard, i search for "kung." When i view the details, i see "Ku[]ng" where the [] is the box character used to indicate "yeah, that's a proper character but i can't display it." (It displays just fine on the Win 200 box.) It's clear that this is a composed diacritic character -- the diacritic is the character that's not displaying: I paste that into the search box, and the u-umlaut displays. ("Küng")*sigh* The search works just fine.

Now, i want to demonstrate the frustration due to the two different UNICODE encodings. I suspect that the "Küng" i searched for earlier had a single u-umlaut character. I cut and past from the blog -- voila, no results. I replicate this pattern on the PC, as well.

So, what to do? RLG's practice, developed over the past twenty some years, was to normalize Latin Character set text strings in indices, removing diacritics. (Our ugly, punctuation-stripped title-cluster labels demonstrate the normalization that was in use.) RedLightGreen built upon that normalization effort, using both the original encodings and the normalized indexes to create the record MindServer indexes. Should we index again, adding a duplicate access point with the diacritics represented the other way ("disk is cheap")? Should we expect the underlying tools (DB2, MindServer) to recognize that the composed and single character diacritics are synonyms?

As we migrate the whole Union Catalog, we'll be figuring it out -- and maybe it's already in a spec somewhere. Eventually, it will percolate into RedLightGreen. Until then, we've a frustrating situation with respect to diacritics.

Posted by judielaine at 11:37 AM | Comments (0) | TrackBack

October 24, 2003

Quarter Life Crisis's review -- part II -- LC Subject Headings

In Quarter Life Crisis's section Refining, Prost writes:

The catalogue also supports refining searches, which is a very good idea and can be very helpful. Unfortunately the Subjects provided seem very random. When searching for Vector Bundles Why can I choose Vector Bundles – Congresses but it’s not related to Vector Bundles? Why can’t I simply limit my search results to a broad area like Mathematics, say? For example in this search, all the top results are maths related. Still the top Subjects to come up are completely unrelated.

The "Refine Search" section is based on the primarily LC Subject headings. It's tempting to believe these are hierarchical. They're not. They're *linear*. They are designed for *DRAWERS*. ANALOG. Little cards.

(deep breath -- sorry for the rant)

We spent MONTHS trying to use Recommind's automated classification to create something more intuitive. We tried the fragments of the subject headings, we tried using Dewey and LC Class Numbers. (Designed for collocating *books* on *shelves.* Physical object. One to one relations. More ANALOG.) It didn't really work. We had problems with granularity, problems with antique classifiers (where books on contemporary Iraqi politics have Dewey class numbers that put them "under" Archaeology -- which makes sense if you want your books on Mesopotamia and Ancient Persia to be near Iran and Iraq), and problems generating labels.

In the end, we do a rapid analysis of all the subject headings associated with the first hundred works that are returned. Catalogers gave subject headings of Vector Bundles – Congresses with the understanding that they'd be right behind the books on subject cards for Vector Bundles. They knew that the researcher intent on discovery would keep flipping back through the cards. No point on creating two different cards for the user to flip by, was there? And there was certainly no point of labeling this book "Mathematics" -- only books that address the whole broad subject area would get a subject heading like that.

...rambling about an example use and some bugs and forthcoming fixes...

I don't know much about vector bundles, so i'll switch to something i do: halo nuclei. The disambiguation problem is clear to me in this case. There are books about the structures of galaxies -- halo galaxies and the nuclei -- the thick centers -- of galaxies. Not my interest. I want halo nuclei -- stable nuclei that are believed to have a density distribution that falls off much less rapidly than the lighter isotopes of the same element. (Usually the next lighter nuclide is not stable.) Many of these elements are important in nucleosynthesis, though (which occurs in stars).

These "related" subjects, then, are the frequency ranked subjects assigned to the titles in the result set. If i'm looking for halo nuclei in the context of nucleosynthesis, i can limit by astrophysics and then check the refined list of subjects. I notice a bug i thought we'd gotten rid of -- astrophysics remains as "refine search by" subject. I check all the remaining subjects to see if i can limit on nucleosynthesis -- i can.

We're planning on changing the "show more subjects" listing to an alphabetic list. The current list puts the most frequently occurring at the top. Without numbers, it's hard to understand the ordering. [We don't display any numbers because we do the analysis across the first one hundred results -- so if twenty-five of the first one hundred results have "astrophysics" as a subject, i might get forty results when i select that limit.]

Posted by judielaine at 10:14 PM | Comments (0) | TrackBack

Quarter Life Crisis's review -- part I -- diacritical comments

I was warned that starting a blog while managing a project might be a bit demanding -- and i haven't been as attentive to this as i have been to the project itself. So it goes.

I just had someone send the link to Sven-S. Porst's blog to me. It's the first substantial critique of the system I've seen outside our feedback mail, so i'll begin addressing some of the points brought up there.

Regrettably, our system is session-bound. We will likely restructure the session handling and system URLs once we know where the pilot system is headed, but so much of the query handling is in temporary relational database tables that links to result lists and so on may not be supported anytime soon. The most critical issue is getting session-free links to editions. So, unfortunately, no one can see the carefully linked examples in Quarter Life Crisis. My regrets.

Umlauts. Oh, ack. Actually, this is very easy to have all muddled unless you force all your users to use the same platforms and same clients. On my work PC running some version of Mozilla (not Firebird), I'm able to cut and paste handling of umlauts and Chinese scripts. I know i've tested using "Küng" and the Chinese character for Mao. I also know, for some reason, i can't cut and paste those Chinese characters on my OS 10.2 Mac running Firebird Mozilla -- even though they display beautifully. I just now tested using "Küng" -- *sigh* -- "no results." Off to feedback for this problem

Merged Records Also known as our FRBR-like title clustering. This is flawed, and is going to be very hard to get better without losing much of the positive aspects altogether. Porst's point is to examples of bad/sloppy cataloging (my guess, without holding the book in my hand). I'm trying to replicate Porst's search -- I believe it was "do carmo manfredo differential." I note these two results -- and have indicated in bold the distinguishing factor:

2. Differential Geometry Of Curves And Surfaces, by Manfredo Perdig Ao Do Carmo
7. Differential Geometry Of Curves And Surfaces, by Manfredo Perdigao Do Carmo

If i go to RLG's Eureka and search on ti=Differential Geometry Of Curves And Surfaces and pn=Carmo, i find three records. All three REALLY are the same edition: "Englewood Cliffs, N.J. : Prentice-Hall, c1976."

The largest cluster of records (and do go read about RLG's clustering at this point) gives a dimension of 24 cm and displays the tilde a as "~a." I will not digress to give a lecture about the age of the union catalog and some of these records, about the fact they're in EBCDIC with some character codings developed in house because UNICODE did not exist yet. See here instead. Suffice it to say that, over a quarter century, catalogers have struggled to key in non-ascii data with varied success.

So the largest cluster has 39 holdings. The next largest has 2. I am struggling to see what the difference is. I remain mystified about the purpose of LCCNs (despite innumerable lectures) -- the LCCNs differ between the two records, and, yup, the LCCN is used to distinguish between editions. Someone dropped the 0 in LCCN: 75-22094. The third cluster has only 1 holding. Here the name is spelled with a "ã" and the dimension of the book is given as 23 cm. The spelling of the name does not affect the RLG clustering -- but the difference in the physical description kept this one edition from clustering with the 39. The correct expression of the tilde-a, however, was normalized correctly as a "a" whereas the incorrect expression of the tilde-a in the largest cluster was normalized to a space-a. Thus, the two appear to have two different authors and did not form the same title cluster.

Well, it's nice to fob all that off on poor cataloging, bad data, GIGO. There are places we can improve on the title clustering, even with good data. The first result if you search on Inferno can point that out.

As far as the German translation "missing" -- it's not in our Union Catalog. Unfortunately, if it was, it would likely not cluster into either of those titles because it is unlikely a cataloger would have established a uniform title for the work. I am NOT a cataloger, but i believe that the uniform title for this book should be the original Portuguese title. (Caveat later.) Thus every cataloged edition should have the title page title (in English or German) in the 245 field, and the Portuguese title in the uniform title field. However, catalogers have to make effort judgements. I am not surprised that for a mathematical text the uniform title has not been used. It is not used, in general. I do regret that the German translation isn't present. I wonder about the original Portuguese -- since the note on the English edition says that this is a translation "of a book and a set of notes, both published originally in Portuguese," whether the published book and separate(?) notes are among the five Portuguese works RedLightGreen displays for this search. (And if the English translation is of two separately and previously published works, the rules for uniform titles get a little out of control. And i won't begin with 7XX added entries.)

Posted by judielaine at 06:15 PM | Comments (1) | TrackBack

October 20, 2003

Run-away

Went on vacation, then got sick. (Sigh, airlines.) While i was out, our President sent the following 'round:

Folks, I love these snarkhunting guys. Their most recent online newsletter is lots of fun. Check it out.
http://www.snarkhunting.com/

One of the things on their website is this taxonomoy of search engine names that they created. Check it out. I think RedLightGreen is an Evocative, Level 2.
http://www.igorinternational.com/webportaltaxonomy.html

Cheers, Jim

I rather think RedLightGreen scores higher in the Evocative level -- it's better than Magellan. (I used the Magellan crawler when i was at the Internet Archive, and i was always forgetting the name.) And TunaBreath, well, no one is about to forget that name! (Um, maybe they should.)

Posted by judielaine at 06:00 PM | Comments (0) | TrackBack

October 03, 2003

User Studies

I should point out Günter Waibel's article Letting Users Show the Way in RLG . I personally would never refer to ranking algorithms with the phrase, "setting in motion a higher intelligence," but otherwise the article does a great job describing how early user tests with paper wireframes helped us sort out features.

Posted by judielaine at 07:00 PM | Comments (0) | TrackBack

"Total" Number Of Results

As i was working on tuning the results we get from the MindServer, i became a little worried about how misleading the statement that there were only 99 results on the search for Shakespeare would be.

We were able to fix part of the problem by tuning the relevancy calculation -- we choose to assume that if you're searching you're more likely searching for "aboutness" than anything else. So we've ranked subjects and titles as more important than authorship. (We already knew that this system "looses" at known item searching -- we choose to optimize for our strength.)

The problem remained that we were never going to return more than 500 results -- and that often the 500 results would compress down from 500 editions to fewer, as often more than one edition in a work will be returned as a result. A search on widgets might produce 294 resulting works, a search on whatsits might produce 459 resulting -- but there may be many many more things on widgets in the Union Catalog, including your uncle's dissertation on Widgets -- but The Classic Text on Widgets, which went through twenty editions and is held by every major institution, overwhelmed the 500 editions results.

So, the user who goes through all 294 results on widgets to see how his uncle's dissertation was ranked is terribly disappointed. It's not mentioned at all. And then the user searches on widgets and his uncle's name, and there's the dissertation....

I suggested we change the word "total" to something else -- "recommended" ("recomminded") or "relevant," perhaps. The discussion went on and one of the Information Architects suggested we remove the number altogether. Good enough, i figured, and we had a reviewer lined up who would tell us if this was going to be the usability error others immediately insisted it would be.

Well, who needs a reviewer. At every institution Merrilee visited, the "where's the number of results?" question boomed. It's back, it's back -- without the word that bothered me: "total."

Now remains the question: how do people use this number? It behaves unlike other numbers, mainly because if your ssearch term is at all common, we're retreiving the top 500 editions, identifying them by works (thus reducing the 500 by however many editions of the same wor were returned), and displaying that number.

442 tea
451 tea coffee
449 +tea +coffee
457 tea or coffee
444 tea not coffee
444 tea not (coffee or cafe)
440 tea not (coffee cafe)

I doubt that we'll get questions about this wierd behavior -- which is why i bring it up here.

Then again, i have been impressed by the intelligently critical feedback from users so far. I might just be surprised. (Meanwhile, i'm trying to remember what i bet my boss with respect to feed back from "users." )

Posted by judielaine at 06:35 PM | Comments (0) | TrackBack