This reflects our experience in building the RedLightGreen DB2 database and our current work in a similar but far more comprehensive Union Catalog database.
Dave Richards, chief technology officer for The Research Libraries Group Inc., of Mountain View, Calif., said his organization uses XML to store bibliographic data, which requires a great deal of auxiliary table construction to search and access records. Richards said the native XML support improvements for DB2 will allow digital records to be stored in a preparsed form, enabling streamlined and more efficient searches. "We wanted to be able to support queries that just were based on information in the e-records that had not been indexed. The way we have to do that at the moment is not terribly efficient," said Richards. "[Native XML support] is going to enable us to store things more compactly and access them easier ... and make it easy for us to be able to ingest and then export data in XML when we're able to migrate to that version of DB2." Richards said RLG is implementing a 1.5TB database featuring 140 million records representing books, serials, maps, films and music scores.
Merrilee wrote to our pilot partners today, and i'm cribbing here.
I hope you are having a great summer. We are looking forward to another academic year with RedLightGreen, and we'll be in touch soon about revving up for the next academic term. In the meantime, here's some summer reading.Nancy Pressman-Levy (Princeton) wrote a great piece on RedLightGreen that appeared in August RLG Focus. It's very entertaining, in addition to having great search examples. You can check it out at:
http://www.rlg.org/en/page.php?Page_ID=17921#article4
RLG has a new publication called TopShelf (this mostly goes to RLG Member Reps -- we hope they are passing relevant information on to you all as well). In the July TopShelf, there is a short piece that will give you a peek at the future of RedLightGreen. What's most important to note is that RedLightGreen is no longer a pilot -- it's now a service, and we'll maintain this service free of charge into the future. We hope this will provide some assurance, that your partnership with us in the earliest phase of the project has not been in vain.
http://www.rlg.org/en/page.php?Page_ID=19301#redlg
Enjoy the remainder of August,
Merrilee
I should just give her an author account here, shouldn't i.
I'm still not completely moved into my new home, and we had some family events this summer that kept me from much attention to anything beyond the critical path. I'm still not sure when i can bring my attention back to this blog.
Oh, well. I'll just continue to tease and block blog spammers.
Someone at work pointed out a Cornell March 2004 powerpoint presentation by Anne Kenney. She showcases the NLM (National Library of Medicine) site as "the most popular library site on the web" before showing the RedLightGreen interface.
NLM also uses the MindServer software from Recommind.
From an announcement to the RedLightGreen partners:
The February RLG Focus has just been issued, and there's an article on
RedLightGreen that features some discussion of the Coffee Tour, and also
summarizes "what we've learned" in the last few months.
http://www.rlg.org/r-focus/i66.html
RedLightGreen gets a brief mention by EContentMag.com when they name RLG as one of the top 100 EContent companies. Regrettably, the "Union Catalog on the Web" concept name has been shortened in press blurbs to "Union Catalog" project., so RedLightGreen becomes synonymous with the Union Catalog. No.
I do like how they said, "(note that eponymous acronym)," about RedLightGreen.
Someone at work sent this notice around to us on 10/29/03 02:25 PM:
Interesting RedLightGreen tidbit of the day. As of the last time I checked, there have been 62 different sessions which searched for the term "memes". And it isn't one of the links on the front page.
We'll be working on that -- it's important!
Around the one hundredth Google result on a search on "RedLightGreen" was this comment
October 16, 2003 Not limited to government documents, but still: RLG and OCLC seem to be moving toward having limited versions of their union catalogs available for free online. So far, I have been utterly unimpressed with RLG's RedLightGreen; for example, a search for the LC subject heading "Women" just now turned up no results. Perhaps I am misunderstanding something about it, because that just can't be right
Finally! Something other than an example of the telephone game (where i'm left wondering where Barbara Quint got some of her facts). Unfortunately the Libronaut has no comments or trackback links.
First, I'll wager The Libronaut was searching via the "more search options" instead of the keyword box. If you search on "women" in the keyword box,
However, we have determined there is a bug in the fielded searching due to the MindServer's stemming implementation. We tested the system on a 10% database that did not include stemming, and the fielded search worked just fine. The full database was delivered a little late; that, plus some other crises, i've alluded to here, meant we failed to as through a QA as we would have liked. Recommind is working like mad to fix it.
As proof of it being a bizarre stemming bug, use the "more search options" to search on the subject "womens" instead of "women" and plenty of results will be provided.
My hair isn't grey yet....
This post was inspired by the Quarter Life Crisis review. Unfortunately (a) posting a comment to the blog or getting a track back link fails and (b) we don't yet support what s/he wants to do.
I will try to remember to let you know when we get the static URL working. Currently, we do not support linking into the system. We are aware that it is a significant flaw.
In Quarter Life Crisis's section Refining, Prost writes:
The catalogue also supports refining searches, which is a very good idea and can be very helpful. Unfortunately the Subjects provided seem very random. When searching for Vector Bundles Why can I choose Vector Bundles – Congresses but it’s not related to Vector Bundles? Why can’t I simply limit my search results to a broad area like Mathematics, say? For example in this search, all the top results are maths related. Still the top Subjects to come up are completely unrelated.
The "Refine Search" section is based on the primarily LC Subject headings. It's tempting to believe these are hierarchical. They're not. They're *linear*. They are designed for *DRAWERS*. ANALOG. Little cards.
(deep breath -- sorry for the rant)
We spent MONTHS trying to use Recommind's automated classification to create something more intuitive. We tried the fragments of the subject headings, we tried using Dewey and LC Class Numbers. (Designed for collocating *books* on *shelves.* Physical object. One to one relations. More ANALOG.) It didn't really work. We had problems with granularity, problems with antique classifiers (where books on contemporary Iraqi politics have Dewey class numbers that put them "under" Archaeology -- which makes sense if you want your books on Mesopotamia and Ancient Persia to be near Iran and Iraq), and problems generating labels.
In the end, we do a rapid analysis of all the subject headings associated with the first hundred works that are returned. Catalogers gave subject headings of Vector Bundles – Congresses with the understanding that they'd be right behind the books on subject cards for Vector Bundles. They knew that the researcher intent on discovery would keep flipping back through the cards. No point on creating two different cards for the user to flip by, was there? And there was certainly no point of labeling this book "Mathematics" -- only books that address the whole broad subject area would get a subject heading like that.
...rambling about an example use and some bugs and forthcoming fixes...
I don't know much about vector bundles, so i'll switch to something i do: halo nuclei. The disambiguation problem is clear to me in this case. There are books about the structures of galaxies -- halo galaxies and the nuclei -- the thick centers -- of galaxies. Not my interest. I want halo nuclei -- stable nuclei that are believed to have a density distribution that falls off much less rapidly than the lighter isotopes of the same element. (Usually the next lighter nuclide is not stable.) Many of these elements are important in nucleosynthesis, though (which occurs in stars).
These "related" subjects, then, are the frequency ranked subjects assigned to the titles in the result set. If i'm looking for halo nuclei in the context of nucleosynthesis, i can limit by astrophysics and then check the refined list of subjects. I notice a bug i thought we'd gotten rid of -- astrophysics remains as "refine search by" subject. I check all the remaining subjects to see if i can limit on nucleosynthesis -- i can.
We're planning on changing the "show more subjects" listing to an alphabetic list. The current list puts the most frequently occurring at the top. Without numbers, it's hard to understand the ordering. [We don't display any numbers because we do the analysis across the first one hundred results -- so if twenty-five of the first one hundred results have "astrophysics" as a subject, i might get forty results when i select that limit.]
I was warned that starting a blog while managing a project might be a bit demanding -- and i haven't been as attentive to this as i have been to the project itself. So it goes.
I just had someone send the link to Sven-S. Porst's blog to me. It's the first substantial critique of the system I've seen outside our feedback mail, so i'll begin addressing some of the points brought up there.
Regrettably, our system is session-bound. We will likely restructure the session handling and system URLs once we know where the pilot system is headed, but so much of the query handling is in temporary relational database tables that links to result lists and so on may not be supported anytime soon. The most critical issue is getting session-free links to editions. So, unfortunately, no one can see the carefully linked examples in Quarter Life Crisis. My regrets.
Umlauts. Oh, ack. Actually, this is very easy to have all muddled unless you force all your users to use the same platforms and same clients. On my work PC running some version of Mozilla (not Firebird), I'm able to cut and paste handling of umlauts and Chinese scripts. I know i've tested using "Küng" and the Chinese character for Mao. I also know, for some reason, i can't cut and paste those Chinese characters on my OS 10.2 Mac running Firebird Mozilla -- even though they display beautifully. I just now tested using "Küng" -- *sigh* -- "no results." Off to feedback for this problem
Merged Records Also known as our FRBR-like title clustering. This is flawed, and is going to be very hard to get better without losing much of the positive aspects altogether. Porst's point is to examples of bad/sloppy cataloging (my guess, without holding the book in my hand). I'm trying to replicate Porst's search -- I believe it was "do carmo manfredo differential." I note these two results -- and have indicated in bold the distinguishing factor:
2. Differential Geometry Of Curves And Surfaces, by Manfredo Perdig Ao Do Carmo
7. Differential Geometry Of Curves And Surfaces, by Manfredo Perdigao Do Carmo
If i go to RLG's Eureka and search on ti=Differential Geometry Of Curves And Surfaces and pn=Carmo, i find three records. All three REALLY are the same edition: "Englewood Cliffs, N.J. : Prentice-Hall, c1976."
The largest cluster of records (and do go read about RLG's clustering at this point) gives a dimension of 24 cm and displays the tilde a as "~a." I will not digress to give a lecture about the age of the union catalog and some of these records, about the fact they're in EBCDIC with some character codings developed in house because UNICODE did not exist yet. See here instead. Suffice it to say that, over a quarter century, catalogers have struggled to key in non-ascii data with varied success.
So the largest cluster has 39 holdings. The next largest has 2. I am struggling to see what the difference is. I remain mystified about the purpose of LCCNs (despite innumerable lectures) -- the LCCNs differ between the two records, and, yup, the LCCN is used to distinguish between editions. Someone dropped the 0 in LCCN: 75-22094. The third cluster has only 1 holding. Here the name is spelled with a "ă" and the dimension of the book is given as 23 cm. The spelling of the name does not affect the RLG clustering -- but the difference in the physical description kept this one edition from clustering with the 39. The correct expression of the tilde-a, however, was normalized correctly as a "a" whereas the incorrect expression of the tilde-a in the largest cluster was normalized to a space-a. Thus, the two appear to have two different authors and did not form the same title cluster.
Well, it's nice to fob all that off on poor cataloging, bad data, GIGO. There are places we can improve on the title clustering, even with good data. The first result if you search on Inferno can point that out.
As far as the German translation "missing" -- it's not in our Union Catalog. Unfortunately, if it was, it would likely not cluster into either of those titles because it is unlikely a cataloger would have established a uniform title for the work. I am NOT a cataloger, but i believe that the uniform title for this book should be the original Portuguese title. (Caveat later.) Thus every cataloged edition should have the title page title (in English or German) in the 245 field, and the Portuguese title in the uniform title field. However, catalogers have to make effort judgements. I am not surprised that for a mathematical text the uniform title has not been used. It is not used, in general. I do regret that the German translation isn't present. I wonder about the original Portuguese -- since the note on the English edition says that this is a translation "of a book and a set of notes, both published originally in Portuguese," whether the published book and separate(?) notes are among the five Portuguese works RedLightGreen displays for this search. (And if the English translation is of two separately and previously published works, the rules for uniform titles get a little out of control. And i won't begin with 7XX added entries.)
We had our first flash crowd recently. Hava Kagle sent a late afternoon message out to San Jose's library school students and alumni. Wham! The good news was we discovered an artificial limit to how many folks could be on the system. It wasn't until 12:45 am the next day that there was no one on the system, and "we" could restart DB2 to up the limit. (I was not part of that "we," but i will be tonight as we institute other overnight maintenance to improve tuning.) It does remain a limit, though, so we'll need to evaluate it as time goes on.
Elsewhile at RLG, Karen Smith-Yoshimura notes that September 12 marked the 20th anniversary of the implementation of the CJK script enhancements to RLIN.
I am amused by trying to read logs in OS X's terminal where apparently someone is testing the Unicode support of RedLightGreen.
[2004-08-18 I deleted two very similar comments but left a third, because it at least has nothing to do with cheap drugs -- it's quasi topical. I'm closing comments. Teel free to contact me by email if you're the author of the search msg. And i don't have Windows XP.]
I've been vanity surfing for RedLightGreen instead of working on my scripts for log analysis this morning. There's only one addition to the Google results. We also were notified of both Columbia & Swarthmore adding it to their resource pages.
http://www.libr.org/Juice/issues/vol6/LJ_6.21.html
http://www.swarthmore.edu/library/mt/archives/cat_new_databases.html
http://www.columbia.edu/cu/lweb/news/spotlight/2003/2003-09-24.redlightgreen.html
http://www.columbia.edu/cu/libraries/indexes/redlightgreen.html
I've noticed a few mentions[1] that RedLightGreen is modeled on Amazon as well as Google. I know this got started in our own information about the system, in which we stated RedLightGreen would be more like Amazon or Google than an OPAC. (Perhaps i should have read the document with a eye for smaller details than I did.) We investigated some features of Amazon that we would have been willing to include in the system -- specifically the reviews. While the Amazon reviews may be wonderful for selling books to the consumer or the self-directed reader, college students looking for books don't want to know what other college students think. They expressed distrust.
They would like the additional bookseller information -- the image of the book cover, the publisher's blurb, and professional reviews. Depending on the results of the RedLightGreen pilot study, those features may get pulled into the dataset.
Amazon makes acquiring the found item as easy as possible; we'd like to emulate that. I've been a little surprised at the cynicism and distrust of students, though -- the inclusion of bookseller links on the results page produced negative feedback. This doesn't make RedLightGreen unlike an OPAC, though. I suspect university librarians make as much use of the web interface to their systems to get library materials into the hands of their students as fast as possible. RedLightGreen's efficiency will depend a great deal on the local library's ability to put the item in the hands of the user.
I suspect that Amazon and OPACs have a strength more in common with each other than with the pilot version of RedLightGreen, and that is the speed with which a known item search can be completed. We've biased the RedLightGreen interface for the user who is not a specialist, who is looking for works about a topic, and who has their own language to describe that topic. In doing so, i suspect that searches for known but uncommon (few editions) works will be supported better by Amazon or an OPAC than RedLightGreen. On the other hand, if you know you're interested in the caste system in India, a wide net cast into Amazon is going to pull up to many titles not of interest.
I want to record what's been said on the web about RedLightGreen before we launch. In part, it will be interesting to see if the characterization of certain features changes when people get their hands on the engine. Will the FRBR buzz go away? Where did the "like Amazon" comparison come from and will it go away? Will the Googleness remain praised? Will anyone notice the vocabulary expansion that comes about from linking Subject Authority records in the indexed terms or from Recommind's analysis of relatedness?
2003-07-20: Google displays 47 results of about 113. (Many are the Shelf Life citation repeated over and over.)