JCDL 2009: Tuesday paper session 2
I entered Paper Session 2 a little late. This session interested me beacuase of my use of Recommind’s Probabilistic Latent Semantic Analysis (PLSA) classifer engine when working on RedLightGreen. I seem to have missed some shift in the area, with the first paper discussing the Random Forest method. The second paper compared their work to Hierarchical Agglomerative Clustering (HAC), K-way spectral clustering (vector), Support Vector Machine (SVM). The fourth discussed latent Dirichlet
allocation.
The most interesting paper it seemed, was Whetting the Appetite of Scientists: Producing Summaries Tailored to the Citation Context, probably because everyone wants the tool on their desks right now.
Paper Session 2
Disambiguating Authors in Academic Publications using Random Forests.
Pucktada Treeratpituk and C. Lee Giles [Presented by Treeratpituk, i believe]
From the abstract, i gather the point is to disambiguate between authors with similar names. The target data set is MEDLINE, but the main gist seems to be to compare the SVM method to the Random Forest method. From the abstract, “Our experiments on the Medline database show
that the random forest model outperforms other previously
proposed techniques such as those using support-vector machines
(SVM).” Random forest more scalable and higher accuracy than SVN, higher accuracy than other methods.
Using Web Information for Author Name Disambiguation
Denilson Alves Pereira, Berthier Ribeiro-Neto, Nivio Ziviani, Alberto H. F. Laender, Marcos André Gonçalves and Anderson A. Ferreira [Presented by Pereira.]
This addresses a similar issue as the first: “the existence of multiple authors with the same name (polysemes) or different name variations for the same author (synonyms).” The novel of this research seems to be the additional analysis of web search results, finding authors’ home pages, to disambiguate the polysemes and correlate the synonyms, referred to (regrettably) as WAD – Web Author Disambiguation. “Our results indicate large gains in the quality of disambiguation when compared to unsupervised methods, and it is statistically tied with supervised learning methods which requires human labeling and expensive training time. ”
Whetting the Appetite of Scientists: Producing Summaries Tailored to the Citation Context
Stephen Wan, Cecile Paris and Robert Dale [Paris presenting]
This is an interesting issue: given a paper with a claim followed by a citation, a reader may wish to review the cited article. One can download those papers, but the context in the original paper may have been lost. This system, Citation-Sensitive In-Browser Summariser, will preview the cited document with a summary tailored to the point of citation. From a survey researchers, the desired summary should contain catalog data, general overview of the work, and *specific data related to the citation context*. The first two may be general metadata, available as the document is produced, but the last is specific to the relationship between an article produced after the cited article.
I *want* this, not just for academic articles, but for the informal citations that web links support. It seems, from the questions, that there’s much interest. The tool is being incorporated into Elsevier’s (boo, hiss) Science Direct. Success has been helped by having access to the publisher’s XML. She hopes, though, that it can be made open. She admits to Lesk that the citation-context summaries are very simplistic: it’s a choice made for speed. She answers Christine Borgman’s assertion about the value of such a tool being open with a sad allusion to the lawyers and her hope to open it. There is a technical issue
It also reminded me of the Mac service “summary.” I opened the pdf of the speakers paper in Preview and used the summarize service: this was disappointing as the end of paragraph parsing of this paper seemed off. I remembered a PDF reader called skim: i’ve never really understood how that was different from the earlier version of preview. It doesn’t support summarize any better.
Finding Topic Trends in Digital Libraries
Levent Bolelli, Seyda Ertekin, Ding Zhou and C. Lee Giles [Presented on behalf of the authors] Google & Facebook folks researching along with the Penn State folks.
He cites CiteSeer, possibly more recent than Kurt Bollacker’s work, and PLSA, so i begin to feel less out of date, but then he dives in and i’m back to feeling like i’ve lots to review.
This research reminds me of the second paper in that, instead of limiting the analysis to the unit, academic research paper document, they include the evolving accumulation of additional data. The second paper used the live authors’ home pages (i wonder about archiving those reference documents, as webpages experience varieties of bit rot); this paper adds the terms from “CitrULike” tags, keywords in queries, and terms used in citations back to those papers.
The importance here is elevated by Christine Borgman’s talk this morning where she showed how the term “digital library” has become far less used, but the concepts embodied by that term continue in active work.
CEBBIP: A Parser of Bibliographic Information in Chinese Electronic Books
Liangcai Gao, Zhi Tang and Xiaofan Lin [Gao presenting]
apabi.com.cn biggest provider of Chinese Books with a need to extract the bibliographic data from the pdf files. Manually, it would take too long, so this paper automates the process. I remember seeing similar papers addressing the general difficulties of extracting structural metadata from pdf documents in the Houston JCDL; a question to Cecile Paris about how her team was able to extract section context revealed just how challenging the problem continues to be. (She got the XML, lucky her.)