The current Recommind engine, when it has as much data in memory as this one has, has a little problem with garbage collection. The simple model, garbage collect frequently and often, essentially creates tiny outages every minute or so. Last night we restarted the engine in order to switch over to garbage collection every night. Recommind's new version will obviate the problem, but for the next few weeks we're choosing the overnight outage over the frequent micro-outages.
From top: Memory: 24G real, 370M free, 28G swap in use, 40G swap free
So, we also ran a back up of the DB2 database last night. The previous backup, it turns out, was significantly smaller. That may be because we haven't deleted some of the test tables we made for determining whether clustered indexing would improve performance. I doubt we've had *that* many folks register.
I am aware of this size differential because last night's back up failed, hanging the system from 3 am to 5:45 am. That's when my spouse tapped me and said, "Did you get up when your 3 am alarm went off?" The system then hung for another handful of minutes as i tracked down the phone number of the DBA. He made more space and began the backup all over again. I would have liked to postpone until midnight tonight, but apparently we would have had to block all registrations all day at best. The full backup took about three hours and created two parts each 199.98 GB in size.
So we're back up now.
I spent the time looking at the resolved addresses of the denied parties. We had sent out a message to all the RLG Member representatives yesterday at 4 pm. I had a nice sense of how global the membership is as i watched denials from Oxford to Hawaii.
If anyone sees Murphy, please tell him to throw the book at someone else for a while.
We had our first flash crowd recently. Hava Kagle sent a late afternoon message out to San Jose's library school students and alumni. Wham! The good news was we discovered an artificial limit to how many folks could be on the system. It wasn't until 12:45 am the next day that there was no one on the system, and "we" could restart DB2 to up the limit. (I was not part of that "we," but i will be tonight as we institute other overnight maintenance to improve tuning.) It does remain a limit, though, so we'll need to evaluate it as time goes on.
Elsewhile at RLG, Karen Smith-Yoshimura notes that September 12 marked the 20th anniversary of the implementation of the CJK script enhancements to RLIN.
I am amused by trying to read logs in OS X's terminal where apparently someone is testing the Unicode support of RedLightGreen.
[2004-08-18 I deleted two very similar comments but left a third, because it at least has nothing to do with cheap drugs -- it's quasi topical. I'm closing comments. Teel free to contact me by email if you're the author of the search msg. And i don't have Windows XP.]
I've been vanity surfing for RedLightGreen instead of working on my scripts for log analysis this morning. There's only one addition to the Google results. We also were notified of both Columbia & Swarthmore adding it to their resource pages.
http://www.libr.org/Juice/issues/vol6/LJ_6.21.html
http://www.swarthmore.edu/library/mt/archives/cat_new_databases.html
http://www.columbia.edu/cu/lweb/news/spotlight/2003/2003-09-24.redlightgreen.html
http://www.columbia.edu/cu/libraries/indexes/redlightgreen.html
Just a brief note to mention we've launched. http://www.redlightgreen.com
The performance isn't as snappy as we want, and we've got a list of possible problems and solutions: IO saturation on the fiber channel connecting the DB2 database and the raid storage, garbage collection in the java virtual machine that runs the search engine, some improvements in the design of some of the data tables.... In the mean time we may put an interstitial "Please wait" message in.
Just a favor -- if you register, please note in your "Major" or field of study "tunabreath" or "blogger" or "information science." We've got a pretty strict privacy policy in place, so this is one of the few items that will be recorded. It'd just be nice to know.
Thanks!
We are fast approaching launch, and in some strange mixing of metaphors, i find myself suffering bouts of vertigo.
Our information architects felt we needed a number of sample searches to help orient new users to RedLightGreen -- give a sense of the breadth of content, the type searches that are supported. Pam Dewey, from Member Services, suggested that we arrange it as different sample paper topics, and she looked up assignments on-line. Once we were able to get the full database to search, Joe Zeeman began vetting the assignments.
We had declined to use Recommind's stop words when indexing the documents. Over half of RLG's Union Catalog are records for items not in English. While the records would be coded for the language of the item, much of the cataloging is in English. Still, even were we to decide that we would use the 008 language fields to catalog the transcribed MARC fields and treat the rest as English, we would still apply English stop words to non-English fields. Not only are uniform titles not necessarily in the language of the item *or* in the cataloging language, but we also have catalog records from libraries that catalog in languages other than English. (For example, the Swiss National Library is now a member and catalogs in three languages.) Since using the Recommind relationship analysis to provide cross language access was a goal, we didn't want to sabotage it by applying stop words across the board.
With these stop words indexed, the Recommind engine gives the same results for a search on "caste in India" as "caste India." The very common articles and prepositions don't take a prominent position in the word vectors that determine relatedness. The same results, though, take an order of magnitude longer to return.
My initial response was to blow that off. Who searches with whole phrases? Well, it seems around half the folks with whom i talked search whole phrases. They know the articles and propositions don't affect the results; they expect them to be stripped away, Google-like, so they type the whole phrase.
I run to the opposite extreme. I'll often just use "francisco" with other terms when trying to find San Francisco related results. After hall conferences, a quick call to Recommind indicated we could turn on the stop words now, leaving the words indexed in the system. Enough RLG staff seemed to expect the words to be stripped out, and the delay in the search did seem significant. "Go ahead," i said, "and we'll review the stop word list later."
The stop word list was extensive. Recommind's large legal clientele seems obvious when noting the stop words of "accordingly," "consequently," "furthermore," "howbeit," and especially "thereafter," "thereby," "therefore," "therein," "thereof," "thereto," and "thereupon." The amount of correspondence indexed reveals itself in excluding a number of salutations.
"Death Becomes Her" became a search for "death" with this stop word list. I rapidly revised down the 576 stop words to 58. Twenty-five of those were the letters of the alphabet except for X. "Generation X" is, apparently, a well used subject heading.
We've got the shorter stop word list in place now; i hope no one expects to find the results of "Who, What, Where, and Why In Having A Will."
Constant review of the duration of searches and the frequency certain words are used is going to be required for us to tune that short list. We’ll likely implement the stop word list at the application layer, so that we can allow a user to require a stop word as well.
I was relaxing by taking a moment to be indignant over Verisign's disgusting move, and saw this article on The Register: Google - the only archive we'll ever need?
Someone thought Google could be considered an archive (for longer than a month or so)? Someone thought Google safeguarded privacy (in a broad society wide sense)? Ouch.
Two comments seem relevant to the RedLightGreen enterprise, however:
"The implications of Google have real implications for mass social procedure, on how we enquire," said Byfield. "It's so much bigger than terrifying - it's Interesting." and "I'm a librarian, and I like Google," said Steve Cisler from the floor. "But I appreciate the point being made that there are different information domains. There is a whole lot of information that's not on the Internet and may possibly be offline."
...may possibly be offline and residing in those book thingies in a library somewhere. Hopefully, RedLightGreen will begin to be a bridge between that changed way of inquiring and those "possibly offline" resources.
Merrilee Proffit gave her presentation today to the RLG staff. I'm not sure how i feel about how positively folks responded. They seem to think this engine, which will only be promoted at the pilot institutions, will be swooped on and used by the whole country by October. This, after months of struggling with how to reach students at the pilot institutions, seems just surreal. Do students really want to find *books*? (We think students just want to find journal articles.)
...back to work
Merrilee Proffitt had to announce to our partners a potential slide of one week due to data loss. This was pretty embarassing, and we already had a librarian comment back that weekly backups are too far apart. Well, yes. So, in one of my less fun compositional efforts, i wrote the following to be shared.
The RedLightGreen project has hit a series of cascading decisions that meant moving faster than with our usual caution. Unfortunately, we had to switch to the latest version of Websphere 5 in order to use the latest version of DB2. This was a surprise, as IBM notes that the DB2 clients are one version forward- and back-compatible. However, in the step between versions 7 and 8, the behavior for communicating between 32 and 64 bit architectures has been added/changed and is not back-compatible. This meant rapidly deploying to a version of Websphere that supports the version 8 client -- and other applications in our usual development environment were not ready for this promotion. We installed this new version of Websphere on an machine usually used for network monitoring. It will be both the development and production machine while the other applications are migrated to the new version. I did not remember to ask our operations staff to back up this machine as if it were a development machine. Thus, the conditions were set for our data loss.
I reply because I want to assure you that in normal operating and development situations RLG does conform to standard industry practices. In this situation, we were doing too many things at once. Even so, I'm proud of the work our folks have done in the strain of upgrading to two new application systems. I was warned at the onset of the database upgrade that there would be hiccups. Nontheless, the performance enhancements of going to the next version were well worth the struggle.
Given a search on "Heidi," we don't return the Heidi Chronicles.
If the problem is that there are 1000 authors named Heidi and no way for the Recommind engine to prefer one author to another (meaning if you were looking for an author named Heidi you'd be missing 500+ of them), should we decide to err on the side of titles and make titles slightly more relevant than authors? We can't perform the miracle of turning up that author you vaguely remember as having the name Heidi, not matter how hard we try -- we can't handle results larger than 500 in a timely manner. (Now, remember that her book was about biscuits and we can help.) So, we should decrease the relevancy value of the author fields.
The DB2 tuning week pointed out that we're just asking way too much. I think of the way we're presenting data as doing the three or four searches an expert might do when using Eureka as a discover tool all at once. Of course, an expert knows which way to narrow their search -- we have to narrow in multiple dimensions, so it's thirty or forty searches all at once.
We can normalize all the subjects to improve performance. Currently we do group by's and unique's on textually normalized subject strings (diacritics, case, most punctuation stripped away). We'll likely assign each unique subject string a key and replace the subject string by the key in the index table. This seems really stupid when you're aware how many unique subjects exist. However, the data manipulations of long ints will be much faster than the manipulations of the excessively long strings we allow for subjects.
But to build that -- how long? So, we might launch without much subject analysis at all. No disambiguation. No representative subject for each work. (I am annoyed by how novels often get "Large Print Edition" subjects as they rarely have subjects so the large print edition's subject floats to the top.)
I'm not happy about dropping these features -- they seem necessary for the engine to be useful. Yet, for launch....
Today is training on the Recommind MindServer.
We've spent the past month upgrading to DB2 vs 8. Version 8 of DB2 didn't talk to the DB2 vs 7 client, despite many assertions from IBM that "the client and database are one version foreword and back compatible." Oh -- you've gone to 64 bit vs 8 -- no, that won't work. Then we had to upgrade the client, but the client wasn't compatible with the version of the Websphere application server we were running. We needed to upgrade that.
Then there was a set of poorly installed patches to Solaris which filled up a root drive, bringing down our development system.
Last week Recommind delivered their database, but we were having a hard time keeping the full database up with version 8. I think we're finally beginning to start testing the Recommind engine plus our full database of books.
I have occasional bouts where i forget to breathe. I was looking for Hans Küng's Does God Exist?. (It came to mind because i am going to alter our disintegrating edition al la Tom Phillips.) "Kung god" -- amusing results, not what i'm looking for. "Hans Kung" -- look at all the works with Hans Küng as author, but can't find my title. "Does god exist" -- It seemed that we only had one work with the title "Does God Exist?" when there should have been -- and i go examine Eureka -- ten or eleven.
Ah-ah! I'm searching the test database, with only 5% of the books records. So, back to the full database. "kung god" -- this time, i get the note that there are 285 results, but none are displayed. "kung god exist" turns up
1.Existiert Gott, by Hans Kung
6 editions published between 1979 and 1991 in 2 languages.
Primary Subject: God
[score=1.0, rank count=88, order=5.47733681447821]
Well, yay.
But before i can think about the frustrating implication of non-English language uniform titles for American undergraduates, i have to search on my name, so my SQL expert can try to figure out why "kung god" never displays a result.
"judith" turns up many results about horses, particularly books i remember from my childhood, like Misty of Chincoteague. Since i grew up with horses, this seems oddly appropriate, but a little too close to mind-reading to be believed. I expect it's part of a caching problem we've seen all day.
The invisible results from "kung god" have to do with the CJK aggregator existing in our "work" data -- something that is used internal to RLG to facilitate CJK searching.
Meanwhile, another search on "judith" turns up the The Books of Common Prayer, Confessions of a saint, Cicero on the nature of gods, and "Kusumanjali, by Udayanacarya" in the first five results.
I could get a swelled head over this!