Tuna Breath: Crunch

September 08, 2003

Crunch

The DB2 tuning week pointed out that we're just asking way too much. I think of the way we're presenting data as doing the three or four searches an expert might do when using Eureka as a discover tool all at once. Of course, an expert knows which way to narrow their search -- we have to narrow in multiple dimensions, so it's thirty or forty searches all at once.

We can normalize all the subjects to improve performance. Currently we do group by's and unique's on textually normalized subject strings (diacritics, case, most punctuation stripped away). We'll likely assign each unique subject string a key and replace the subject string by the key in the index table. This seems really stupid when you're aware how many unique subjects exist. However, the data manipulations of long ints will be much faster than the manipulations of the excessively long strings we allow for subjects.

But to build that -- how long? So, we might launch without much subject analysis at all. No disambiguation. No representative subject for each work. (I am annoyed by how novels often get "Large Print Edition" subjects as they rarely have subjects so the large print edition's subject floats to the top.)

I'm not happy about dropping these features -- they seem necessary for the engine to be useful. Yet, for launch....

Today is training on the Recommind MindServer.

Posted by judielaine at September 8, 2003 07:35 AM | TrackBack

Comments