We are fast approaching launch, and in some strange mixing of metaphors, i find myself suffering bouts of vertigo.
Our information architects felt we needed a number of sample searches to help orient new users to RedLightGreen -- give a sense of the breadth of content, the type searches that are supported. Pam Dewey, from Member Services, suggested that we arrange it as different sample paper topics, and she looked up assignments on-line. Once we were able to get the full database to search, Joe Zeeman began vetting the assignments.
We had declined to use Recommind's stop words when indexing the documents. Over half of RLG's Union Catalog are records for items not in English. While the records would be coded for the language of the item, much of the cataloging is in English. Still, even were we to decide that we would use the 008 language fields to catalog the transcribed MARC fields and treat the rest as English, we would still apply English stop words to non-English fields. Not only are uniform titles not necessarily in the language of the item *or* in the cataloging language, but we also have catalog records from libraries that catalog in languages other than English. (For example, the Swiss National Library is now a member and catalogs in three languages.) Since using the Recommind relationship analysis to provide cross language access was a goal, we didn't want to sabotage it by applying stop words across the board.
With these stop words indexed, the Recommind engine gives the same results for a search on "caste in India" as "caste India." The very common articles and prepositions don't take a prominent position in the word vectors that determine relatedness. The same results, though, take an order of magnitude longer to return.
My initial response was to blow that off. Who searches with whole phrases? Well, it seems around half the folks with whom i talked search whole phrases. They know the articles and propositions don't affect the results; they expect them to be stripped away, Google-like, so they type the whole phrase.
I run to the opposite extreme. I'll often just use "francisco" with other terms when trying to find San Francisco related results. After hall conferences, a quick call to Recommind indicated we could turn on the stop words now, leaving the words indexed in the system. Enough RLG staff seemed to expect the words to be stripped out, and the delay in the search did seem significant. "Go ahead," i said, "and we'll review the stop word list later."
The stop word list was extensive. Recommind's large legal clientele seems obvious when noting the stop words of "accordingly," "consequently," "furthermore," "howbeit," and especially "thereafter," "thereby," "therefore," "therein," "thereof," "thereto," and "thereupon." The amount of correspondence indexed reveals itself in excluding a number of salutations.
"Death Becomes Her" became a search for "death" with this stop word list. I rapidly revised down the 576 stop words to 58. Twenty-five of those were the letters of the alphabet except for X. "Generation X" is, apparently, a well used subject heading.
We've got the shorter stop word list in place now; i hope no one expects to find the results of "Who, What, Where, and Why In Having A Will."
Constant review of the duration of searches and the frequency certain words are used is going to be required for us to tune that short list. We’ll likely implement the stop word list at the application layer, so that we can allow a user to require a stop word as well.