Friday morning sessions
Friday, June 22nd, 2007Agreeing to Disagree: Search Engines and their Public Interfaces, presented by Frank McCown with Michael L. Nelson
This paper described an experiment to see whether the search APIs offered by Google, MSN, and Yahoo differed from the results given by the web interface. There are some fascinating results in this five month study, available on-line. Broad conclusions include that the MSN search API produces the most stable and similar results to the web interface, that the indexes may be smaller than the web interface but are just as fresh, and there are big changes in the top ranked between API and WUI with Google, not so much with Yahoo. A note that Google’s backlink results seem pretty stale, while MSN and Yahoo are more fresh.
A good question was how did the researcher justify the violation of terms of service by screen scraping from the web interface. Frank described how he tried to get permission but could not reach someone to give (or not) that permission. He described trying to do no harm, limiting use to below that allowed by the API, and so on. The larger question of the very closed nature of these search engines — indeed, the research was driven because of the lack of information about the resource — their increasing impact in the academy was not directly addressed.
Static Reformulation: A User Study of Static Hypertext for Query-Based Reformulation, presented by Michael Huggett, co-author Joel Lanir
An interesting experiment on how most effectively follow the “scent of information” comparing keyword search to a browse of a cluster of computationally similar (based on keywords) documents. It was a vary controled and constructed experiment, The broad takeaway for myself was the distinction between task differences of when browse and search are more effective in retrieving articles. The trick here, it seems, is in constructing the browse of similar items. I’ve never found Google’s “similar” to support that; in bibliographic data classes and subject headings are the manualy constructed to provide that browse network. It’d be interesting to see the study repeated using that body of “similarity.” (I note that the Recommind engine esentially built collections of similar bibliographic records and that fed into the result sets.)
Later in the afternoon there was
“Effects of Structure and Interaction Style on Distinct Search Tasks”
presented by Robert Capra, coauthors Gary Marchionini, Jung Sun Oh, Fred Stutzman and Yan Zhang
This compared the hand-crafted, high information density front page of the Bureau of Labor Statistics (I suspect that Edward Tufte would approve) to two of faceted browse interfaces. This test had a couple of different tasks, but didn’t find a particular improvement between the two different interfaces. Admittedly, though, all three were browse tests. Users noted how they missed the search interface and were frustrated.
A Rich OPAC User Interface with AJAX, Jesse Prabawa Gozali and Min-Yen Kan, presented by min-Yen Kan
My work with RedLightGreen gave some depth to my admiration for this presentation of an OPAC interface using AJAX with a MySQL or Lucene backend which is intended to replace an Innovative interface (for discovery, presumably). Goals were to create a way to compare different detailed results and dynamically change sort order. A particularly elegant feature is how users can determine whether a particular item is listed in different search results (tabs for the search history). The interface can be reviewed at http://opac.comp.nus.edu.sg I ponder scalability, of course, and the difference in the RedLightGreen Google-like ranking as opposed to the traditional sort choices.
The Large Scale Collections session was rather engaging for me. (”Large-Scale Collections A New Generation of Textual Corpora: Mining Corpora from Very Large Collections” Gordon Stewart, Gregory Crane** and Alison Babeu; “Subject Metadata Enrichment using Statistical Topic Models” David Newman**, Kat Hagedorn, Chaitanya Chemudugunta and Padhraic Smyth; and “Organizing the OCA: Learning faceted subjects from a library of digital books” David Mimno** and Andrew McCallum; where **presenter) Greg Crane (of Perseus Digital Library), in a similar vein as his talk in the panel on Thursday, spoke about very rich tools needed for scholarly work in textual corpa. I reflect on a conversation I had with a colleague in RLG Programs, Monday before i left, about what tools are needed on top of these large collections of thext (Google, Open Content Alliance, Gutenberg). Greg has a long wish list! Of particular interest to me are the edition comparison and management needs, but Greg brings up an idea that — i’ve heard this before, does it go back to Vannevar Bush? Or Greg at a previous JCDL? — books should talk to other books. A book should link out to a concordance, a phrase that is referred to in other works should link to those references (ah, for trackbacks to Shakespeare). Greg frames this in the great humanities question of “how do we understand human expression” and notes that Western culture has perhaps done a good job understanding western culture, but not broader global human expression. (A hint, yesterday, that issues of “Homeland Security” might be better addressed by better cultural literacy.)
The next two talks had to do with statistical assignment of items to topics or classifications, a similar process to the latent semantic indexing (LSI) we applied to the union catalog with Recommind. The labeling of the topics still seems to be manual (which isn’t surprising but one can always hope for miracles). What was particularly interesting was, to support parallelizing the process, David Mimno describes in “Organizing the OCA: Learning faceted subjects from a library of digital books” applying the classification process on a page by page process (generate in book classification) and then classifying those across a much larger corpus. That page by page classification, though, then assigns each word to different classes. In a sense, each word on the page is classified, and inturn disambiguated from different meanings of the same term. It’s clear that there’s a springboard here to Greg Crane’s wish list.