Tuna Breath: July 2003 Archives

July 31, 2003

Shelf Life, No. 117 (July 31 2003) ISSN 1538-4284

Today's Shelf Life has three summaries that seem to build on a theme.

In one summary, Rita Vine, of Workingfaster.com states that researchers need a tool box of starting search places.

The next summary points to emerging subject specific search engines that specialize. The example give is CiteSeer. What's not mentioned is the software that drives CiteSeer, ResearchIndex, is available for developing new indexes. I remember Kurt battling with NEC lawyers to get a decent license for the code -- there's isn't an obvious link that makes it clear it's available but it is. (At the last JCDL there were several talks about making the extraction of metadata/features in the crawled articles more accurate.) Thus, a proliferation of ResearchIndex driven search engines leads to Rita Vine's multiplicity of rich starting places.

Finally, there's an article about how all these tools will need to implement methods for visualizing the result sets. I suspect that these tools are going to be for fee. "Most users" (in particular, the students we surveyed when developing RedLightGreen) want the Google "I feel lucky" result. They did their search, now tell them what they want to know. I have a cat who is having a biopsy for a lump on his gum -- i want a few websites about feline mouth cancers. I don't want to see a huge widget that shows the ratio of word choices like oncology
vs cancer, feline vs cat, but if i was trying to figure out the best treatment for my cat, I may very well be interested in subscribing to a resource that would help me quickly sort out the many results in the tail of the popularity distribution.

Posted by judielaine at 07:38 AM | Comments (0) | TrackBack

July 29, 2003

Is RedLightGreen like Amazon?

I've noticed a few mentions[1] that RedLightGreen is modeled on Amazon as well as Google. I know this got started in our own information about the system, in which we stated RedLightGreen would be more like Amazon or Google than an OPAC. (Perhaps i should have read the document with a eye for smaller details than I did.) We investigated some features of Amazon that we would have been willing to include in the system -- specifically the reviews. While the Amazon reviews may be wonderful for selling books to the consumer or the self-directed reader, college students looking for books don't want to know what other college students think. They expressed distrust.

They would like the additional bookseller information -- the image of the book cover, the publisher's blurb, and professional reviews. Depending on the results of the RedLightGreen pilot study, those features may get pulled into the dataset.

Amazon makes acquiring the found item as easy as possible; we'd like to emulate that. I've been a little surprised at the cynicism and distrust of students, though -- the inclusion of bookseller links on the results page produced negative feedback. This doesn't make RedLightGreen unlike an OPAC, though. I suspect university librarians make as much use of the web interface to their systems to get library materials into the hands of their students as fast as possible. RedLightGreen's efficiency will depend a great deal on the local library's ability to put the item in the hands of the user.

I suspect that Amazon and OPACs have a strength more in common with each other than with the pilot version of RedLightGreen, and that is the speed with which a known item search can be completed. We've biased the RedLightGreen interface for the user who is not a specialist, who is looking for works about a topic, and who has their own language to describe that topic. In doing so, i suspect that searches for known but uncommon (few editions) works will be supported better by Amazon or an OPAC than RedLightGreen. On the other hand, if you know you're interested in the caste system in India, a wide net cast into Amazon is going to pull up to many titles not of interest.

[1] The outline of Steven Bell's talk. I assume he pulled the preview screen shots from Merrilee Proffitt's CNI Briefing PowerPoint.

Posted by judielaine at 02:46 PM | Comments (0) | TrackBack

July 28, 2003

RSS of book lists, continued

Catalogablog, which noticed this blog's existence today has a fleeting mention of RSS as a hip tool. David suggests a feed of "most circulated books, books with most holds." I suppose the RedLightGreen system could provide a feed of the most common books in the "Your List" feature, but... other than being cool and hip, what would this do? More likely to be of interest for folks who want a sense of trends would be the top searches -- a RedLightGreen zeitgeist (pushing our emulation of Google too far).

I don't see any of these ideas really helping users discover books that inform their passions (or the field of study they said they'd stick out to make their parents happy).

I do find myself thinking of RSS as a "better than email" way of reaching users with timely information. I don't think we'd be able to come up with this for the pilot, but i'm beginning to imagine a catalog feed that combines general system notices ("Our Privacy statement has changed.") with localized notices (for all users in these zip codes, not the addition of this new library connection) with personalized notices ("Other RedLightGreen lists that have entries like yours also have the following entries:.....").

Posted by judielaine at 05:35 PM | TrackBack

RSS feeds for RedLightGreen

I've pondered the use of an RSS feed in the RedLightGreen environment. I can imagine someone wanting to "syndicate" the RedLightGreen "Your List" of records in a courseware interface or a syllabus on the web -- really just pulling the more or less static list into an HTML presentation. That seems to be a good enough justification. Having the list as a subscription in one's newsreader might be preferred by some users to having a bookmark in a web browser; that's a slightly less compelling motivation. I imagine a group working on a project together, sharing their reading lists -- but how often would these lists change? I know that there will be clever uses for it if we make a feed available, so I'll probably urge a trial at some point during the pilot.

I find myself more ambivalent after finding Dan Gillmor's note that "Amazon.com Syndicated Content" is now available. Amazon writes, "Selected categories, subcategories and search results in Amazon.com stores now have RSS feeds associated with them, delivering a headline-view of the top 10 bestsellers in that category or set of search results." It is, perhaps, that there are no little orange XML blocks anywhere to be seen that frustrates me. And perhaps i should celebrate: if i can't simply subscribe to the wish list of all my friends at Amazon, it means there's room for RedLightGreen to grow there.

Posted by judielaine at 04:14 PM | Comments (0) | TrackBack

Search by Email

I've been forwarded a news clipping from the BBC by a co-worker. "A Search Engine for the World's Poor," is the title, and it relates, "Researchers at MIT are designing a search engine geared to the needs of computer users in the world's disadvantaged countries, most of whom have only sporadic access to the Web at what are often less-than-optimal bandwidths." The clipping is about the TEK project; my co-worker suggests that we would do the same -- develop an e-mail interface to our search engines. It seems this could also be useful for complicated searches for which we haven't optimized the system. There are "data mining"-like queries i can imagine being of use to a researcher -- not RLG's primary audience of librarians, but academics who are interested in, say, the rate of new editions for classic authors. We have the data the researcher may be interested in, but we haven't optimized the system to retrieve it. By running the query in the background over some length of time and returning it by e-mail, a complicated query is possible.

Posted by judielaine at 03:20 PM | Comments (1) | TrackBack

July 26, 2003

Blogchalk

This is my new blogchalk:
United States, California, San Francisco, , English, , Judith, Female, 31-35.

Posted by judielaine at 08:37 AM | Comments (0)

July 25, 2003

Off topic -- internet archiving

I've just skimmed over the presentations from the conference, Preserving the Web (Kerkira, May 22-24, 2003). Nothing struck me as new. Still missing was any mention of archiving domain name registrations and IP block assignments. Between these two items, one can get a sense of who is behind a website -- the publisher and printer, if you will.

It's a soapbox of mine from my days at the Internet Archive. I just had to give it a kick.

Posted by judielaine at 11:25 AM | Comments (0) | TrackBack

July 24, 2003

Securing Privacy

I've been working on drafting our policy for keeping users' behavior private and secure while allowing us to learn a great deal from their clickstream. Assuming we have a moderately large number of registered users, we can go a good distance in ensuring privacy, particularly for those users who don't stand out from the crowd or for those who don't volunteer any information about themselves.

The last group is obviously fairly anonymous -- but is it because they're lazy or because they value their personal information?

The Apache log is a weakness and a strength. We'll keep a standard Apache log which will capture IP values along with a time stamp. We'll run that through a standard log analysis package, looking for initial referrers and observing the IPs in aggregate, and then destroy the logs. This is a strength, in that we can offload the IP address logging to Apache, not the application. While the log exists,though, the timestamp and IP address linkage is present, so the IP could be correlated to the clickstream logs.

But those users who do stand out from the crowd -- the one registered user with a 32789 zip code (picking on Winter Park, Fl, because it's far from our pilot institutions, yet i know someone who will try the system from there), the smart aleck who gives a major of hemp basket smoking -- they've lost it. As i struggle to decide whether we should only record the first three zip digits, possibly crippling an analysis of how far someone will range for a book, i wonder to what lengths anyone else goes to truly secure my privacy. "We're not going to track you as an individual, just as part of the demographics and average use" many privacy policies read, but nothing in that means that the data isn't there were it to be requested.

Posted by judielaine at 06:29 PM | Comments (0) | TrackBack

July 23, 2003

Response to RedLightGreen Before the Launch

I want to record what's been said on the web about RedLightGreen before we launch. In part, it will be interesting to see if the characterization of certain features changes when people get their hands on the engine. Will the FRBR buzz go away? Where did the "like Amazon" comparison come from and will it go away? Will the Googleness remain praised? Will anyone notice the vocabulary expansion that comes about from linking Subject Authority records in the indexed terms or from Recommind's analysis of relatedness?

2003-07-20: Google displays 47 results of about 113. (Many are the Shelf Life citation repeated over and over.)

In Library Journal, Not Your Mother's Union Catalog, Roy Tennant:
- We're offered up as an example of "FRBRizing." We're FRBR informed, but it's not strict FRBR. It's a very simple algorithm in abstract -- all the records with the same author and the same uniform title are grouped together. For those records without a uniform title, we match against the all 245 titles present along with uniform titles. This will exclude some things considered part of the work, but also pulls in related works as well. It also falls prey to bad cataloging practice from time to time.
- There seems to be a premise that one can't offer up a service with heterogeneous XML or data formats. While I do hope a successful RedLightGreen project will fold in OAI harvested records, I don't believe it depends on the data being in MARC or not-MARC.
- Data mining: we had to pull back from some of the automatic categorization methods. While Recommind's engine showed promise, and we spent much time experimenting, the labeling issue was problematic.
Some simple cheers for the Googleness of our approach:
- Raymond Yee
- Dylan Tweney (Entry 349 appears gone)
- "Citator Algorithm for Library Content" -- from Dylan Tweney's mention
Notes of our existence:

Posted by judielaine at 12:00 PM | Comments (0) | TrackBack

July 22, 2003

Testing our rankings

One of the many issues we've struggled with as we've designed this system is how to deal with the fact that we can't provide the users with the thing itself. We do have a measure of how widely cataloged items are, and we promote title-clusters that are widely held above items that may have a higher search-term relevancy but are less widely held. That widely-heldness, we believe, is an indication of how important and respected the work is, in a strong analogy to Google's link measures. In fact, since it costs money to add a book to a library collection, we expect that widely-heldness does measure authority more than a link score -- how often do libraries include books just so that they can be refuted, compared to a hyperlink to something an author intends to refute?

But what if our ranking is always putting things in front of users that they can't get?

We proposed yesterday to use the actual RedLightGreen searches to identify the sample of highly ranked items that we would then test for in our pilot partner's catalog. To facilitate that test, I will log the top five results for each search: the title-cluster ID, the search-term relevancy, and the widely-heldness score.

The test program would then be a perl script with a list of title-cluster IDs as input. The program would query our database for the title of the title-cluster (and perhaps all the identifying numbers for the editions in that title-cluster), and then search the partners' catalogs for those items. Actually, it seems that that process migh be best separated into two scripts: one to derive the search terms from our data base, another to do the searches.

By having the top five results for each search recorded in the log, I will also be able to test how often a classic title is a result. Hamlet, for example, has 1455 editions with a widely-heldedness measure of 3597. In our current sample data set, if you search on Denmark, Hamlet is the second result, the first result if you want an English language item.

Posted by judielaine at 10:38 AM | Comments (0) | TrackBack

July 21, 2003

Amazon to Provide Book Text Searching

This article about Amazon negotiating to provide searches of nonfiction book text makes me clench my teeth. The undergraduate audience that we are targeting knows that the web doesn't have the authoritative information they need but are impatient and want their information delivered to their dorm room (suddenly the citrus orange and green logo of Kozmo* comes to mind). That audience will be thrilled with this service. While there are certainly enough students who are happy to practice scholarship, there are also the students who just count those citations. Who cares if Amazon only shows a brief snippet? It's simple to collect far more citations. It's tempting to point out this audience won't buy the books, but Amazon won't care. The students will make an order for CDs or DVDs and perhaps even a whole Target order that sufficiently covers the possible search costs Amazon might incur.

* Ah, I wish Kozmo had been on the list of sites the LC asked the Archive to specifically crawl. We would have made an effort to get behind that log-in with a covered zip code and would have saved the list of ice cream, soda, paperbacks, etc, for posterity before the wake began.

Posted by judielaine at 07:05 AM

July 20, 2003

Welcome to Tuna Breath

I've been leading the project known as RedLightGreen since I arrived at RLG in April of 2001. We started off with internal brainstorming sessions, discussing what may be of interest in the Union Catalog to a web audience who did not have access to a library that contributes records. These sessions also introduced me to my co-workers and were the beginning of my education in cataloging.

Since then we've continued on this "marmite" project -- we have bibliographic data, how can we package it to satisfy a need that doesn't undermine the current uses of the data? We recognized that we were all book-loving, data-loving folk -- how representative were our interests, anyway? We'd love to drill down deep into the data, twist projections of it around, come up with a project like mapping the publication of books against a timeline to watch the spread of the press around the globe. That might be a novelty for a large number of folks and fascinating to a few academics, but it wouldn't help us towards creating a site that would become self-sustainable past the grant period.

We've developed RedLightGreen to test the waters of self-sustainability. When we launch at the end of the summer, we will begin observing the use of the system in as much detail as possible in the hopes of improving it and coming to a better understanding of the information seeking behavior of undergraduate.

I'll be writing about the data and analysis here, as well as other design choices as I confront them in the final implementation of the project.

I hope you enjoy reading.

Posted by judielaine at 09:36 AM | Comments (0)