January 15, 2004

Sorting by Date vs Limit to Recent

I was just asked if there was a way to sort the results by date:

No, there is no way to sort by date. We are about to release a function on the advanced search to limit results to works with recent publications. As this interface is not designed to support known item searching, limiting to works with recent publication seems to be the main need for undergraduates to discover the key academic resources for their topic.

One of our design choices to support "performance" is only return the "most relevant" (to the keyword) results, up to a fixed number of editions which collapses to a varying number of title clusters. I fear that sorting by date might create a sense that the result set was complete.

To construct an example, i'll search on "george fox speller," because i know there was a speller written by this founder of the Quaker sect that isn't exactly kept in print. I find four of our title clusters for these. Careful examination will reveal the slight differences in the titles that prevented all fourteen editions from clustering together in one title cluster.

1. Instructions For Right Spelling And Plain Directions For Reading And Writing True English With Several Delightful Things Very Useful And Necessary Both For Young And Old To Read And Learn, by George Fox
6 editions published between 1673 and 1743 in English.

2. Instructions For Rightspelling And Plain Directions For Reading And Writing True English With Several Delightful Things Very Useful And Necessary Both For Young And Old To Read And Learn, by George Fox
5 editions published between 1683 and 1702 in English.

3. Instructions For Right Spelling And Plain Directions For Reading And Writing True English With Several Other Things Very Useful And Necessary Both For Young And Old To Read And Learn, by George Fox
2 editions published in 1769 in English.

4. Instructions For The Right Spelling And Plain Directions For Reading And Writing True English With Several Delightful Things Very Useful And Necessary Both For Young And Old To Read And Lear, by George Fox
1 edition published in 1743 in English.

Now, if i search on "George Fox," limit to the author George Fox by selecting the Author name on the left hand side, i have a list of 144 titles and only one of the speller titles shows up. We're happy with result like this because, assuming our use case, a non-specialist who wanted to learn more about this founder of Quakerism, the first ten results here are excellent, except for an opera by someone else named George Fox.

I can't prevent[*] anyone from inferring that they've discovered all the titles by George Fox (and, in fact, there is at least one instance of this title) but functions like sorting by date and such are best handled by Eureka. There we've made design choices which support the needs of the serious scholar.

[*] Maybe a little yellow caution icon right next to the number of results when we believe there are relevant results not yet returned. I wonder if i can talk folks into adding that. The help text could simply say, "This search was so general that we were not able to retrieve all of the results. Try your search again using the + to require a specific word or quotes to ..." Or maybe only obsessive people who go to the end of a result list need to see this -- so it's at the end of results that are too large....

Honestly, users don't seem to go much past the first page....

Posted by judielaine at 03:48 PM | Comments (0) | TrackBack

December 18, 2003

Prayer and Hilda Doolittle

Search on "Hilda Doolittle" and The Book of Common Prayer is the third result. Why? A quick search in Eureka turns up:

Author: Church of England.
Title: [Book of common prayer]
The book of common prayer and administration of the sacraments and other rites and ceremonies of the church according to the use of the Church of England : together with the Psalter of Psalms of David, pointed as they are to be sung or said in churches : and the form and manner of making, ordaining, and consecrating of bishops, priests, and deacons.

Subjects: H. D. (Hilda Doolittle), 1886-1961--Ownership.

There it is. MARC and FRBR, not designed for each other.

Posted by judielaine at 01:39 PM | Comments (0) | TrackBack

October 03, 2003

"Total" Number Of Results

As i was working on tuning the results we get from the MindServer, i became a little worried about how misleading the statement that there were only 99 results on the search for Shakespeare would be.

We were able to fix part of the problem by tuning the relevancy calculation -- we choose to assume that if you're searching you're more likely searching for "aboutness" than anything else. So we've ranked subjects and titles as more important than authorship. (We already knew that this system "looses" at known item searching -- we choose to optimize for our strength.)

The problem remained that we were never going to return more than 500 results -- and that often the 500 results would compress down from 500 editions to fewer, as often more than one edition in a work will be returned as a result. A search on widgets might produce 294 resulting works, a search on whatsits might produce 459 resulting -- but there may be many many more things on widgets in the Union Catalog, including your uncle's dissertation on Widgets -- but The Classic Text on Widgets, which went through twenty editions and is held by every major institution, overwhelmed the 500 editions results.

So, the user who goes through all 294 results on widgets to see how his uncle's dissertation was ranked is terribly disappointed. It's not mentioned at all. And then the user searches on widgets and his uncle's name, and there's the dissertation....

I suggested we change the word "total" to something else -- "recommended" ("recomminded") or "relevant," perhaps. The discussion went on and one of the Information Architects suggested we remove the number altogether. Good enough, i figured, and we had a reviewer lined up who would tell us if this was going to be the usability error others immediately insisted it would be.

Well, who needs a reviewer. At every institution Merrilee visited, the "where's the number of results?" question boomed. It's back, it's back -- without the word that bothered me: "total."

Now remains the question: how do people use this number? It behaves unlike other numbers, mainly because if your ssearch term is at all common, we're retreiving the top 500 editions, identifying them by works (thus reducing the 500 by however many editions of the same wor were returned), and displaying that number.

442 tea
451 tea coffee
449 +tea +coffee
457 tea or coffee
444 tea not coffee
444 tea not (coffee or cafe)
440 tea not (coffee cafe)

I doubt that we'll get questions about this wierd behavior -- which is why i bring it up here.

Then again, i have been impressed by the intelligently critical feedback from users so far. I might just be surprised. (Meanwhile, i'm trying to remember what i bet my boss with respect to feed back from "users." )

Posted by judielaine at 06:35 PM | Comments (0) | TrackBack

August 04, 2003

Minoan Crete and John Donne

"A search on Minoan Crete returns as the 59th result a sermon by John Donne. Examination of the editions reveals nothing about Crete. What's strange is that the results farther down the list seem mostly relevant to the search term again."

My first thought was that some widely published sermon with, say an imprint by "Minoan Press" or some such, was triggering this result. But no. It's a 1626 imprint. We have three "editions" in the RLG Union Catalog -- but they're likely all the same edition. And it's a total of ten holdings (mostly in microform).

Author: Donne, John, 1572-1631.
Title: [Sermon Preached To The Kings Mtie At Whitehall 24 Febr 1625]
A sermon, preached to the Kings Mtie. at Whitehall, 24. Febr. 1625 / by Iohn Donne. And now by his Maiesties commandment published.
Publisher: London : Printed for T. Jones, 1626.

It turns out that the publisher is transcribed in one of the entries as Iones.

After much poking around, the best guess i have for the linkage is that there are works in which Iones, the Protohellenic tribe, and Minoan and/or Crete are linked. I don't see it in the actual data -- the other two relevant works that come up in the search "iones minoan crete" don't seem to have the word iones associated with the record itself. It is possible that , as we use the Subject Authority file to supplement the record with additional access points, that there may be an Iones lurking there

This is one more strike against using the publisher data in the keyword search. Right now we have it ranked as very low -- we should take it our altogether. Yet, as we indexed the publication data, those terms go into the modeling and training done by the Recommind Mind Server. So perhaps the publication data will still produce noise in the results as items without the search term, but that the model thinks are related due to the presence of some term in a publication field.

I hope oddities like these will be overwhelmed by the mass of the data -- our sample database only includes 5% of the book records. Yet seeing all the early modern English spellings in all the sermons and such, I should prepare myself for a plethora of oddities once the full system is in place.

iones minoan crete #1 of 3 -- the other two definitely have to do with ancient Greece.
minoan #73
iones minoan #1 of 83
iones crete #1 of 6
crete #267
donne crete #1
saint Pauls crete #3 of 11 (One of the titles lists the sermon as delivered at Saint Pauls)
saint crete #8 of 16
Iohn Iones #3 of 36 -- lots of sermons, nothing on Crete
Maiesties #37 of 360 -- lots of sermons and very distracting titles.
isaiah crete -- three results, no Donne
iones -- many results, Donne isn't one of them

The Donne records include:
Record ID:ILNGAEY1101B-B
Record ID:CUBGGLAD184553073-B
Record ID:PASG544243-B

Posted by judielaine at 12:25 PM | Comments (1) | TrackBack

July 22, 2003

Testing our rankings

One of the many issues we've struggled with as we've designed this system is how to deal with the fact that we can't provide the users with the thing itself. We do have a measure of how widely cataloged items are, and we promote title-clusters that are widely held above items that may have a higher search-term relevancy but are less widely held. That widely-heldness, we believe, is an indication of how important and respected the work is, in a strong analogy to Google's link measures. In fact, since it costs money to add a book to a library collection, we expect that widely-heldness does measure authority more than a link score -- how often do libraries include books just so that they can be refuted, compared to a hyperlink to something an author intends to refute?

But what if our ranking is always putting things in front of users that they can't get?

We proposed yesterday to use the actual RedLightGreen searches to identify the sample of highly ranked items that we would then test for in our pilot partner's catalog. To facilitate that test, I will log the top five results for each search: the title-cluster ID, the search-term relevancy, and the widely-heldness score.

The test program would then be a perl script with a list of title-cluster IDs as input. The program would query our database for the title of the title-cluster (and perhaps all the identifying numbers for the editions in that title-cluster), and then search the partners' catalogs for those items. Actually, it seems that that process migh be best separated into two scripts: one to derive the search terms from our data base, another to do the searches.

By having the top five results for each search recorded in the log, I will also be able to test how often a classic title is a result. Hamlet, for example, has 1455 editions with a widely-heldedness measure of 3597. In our current sample data set, if you search on Denmark, Hamlet is the second result, the first result if you want an English language item.

Posted by judielaine at 10:38 AM | Comments (0) | TrackBack