Please use this identifier to cite or link to this item:
Title: Discovering latent topical structure by second-order similarity analysis
Authors: Cribbin, T
Issue Date: 2011
Publisher: American Society for Information Science and Technology
Citation: Journal of the American Society for Information Science and Technology, 62(6): 1188 - 1207, Jun 2011
Abstract: Document similarity models are typically derived from a term-document vector space representation by comparing all vector-pairs using some similarity measure. Computing similarity directly from a ‘bag of words’ model can be problematic because term independence causes the relationships between synonymous and related terms and the contextual influences that determine the ‘sense’ of polysemous terms to be ignored. This paper compares two methods that potentially address these problems by modelling the higher-order relationships that lie latent within the original vector space. The first is latent semantic analysis (LSA), a dimension reduction method which is a well known means of addressing the vocabulary mismatch problem in information retrieval systems. The second is the lesser known, yet conceptually simple approach of second-order similarity (SOS) analysis, where similarity is measured in terms of profiles of first-order similarities as computed directly from the term-document space. Nearest neighbour tests show that SOS analysis produces similarity models that are consistently better than both first-order and LSA derived models at resolving both coarse and fine level semantic clusters. SOS analysis has been criticised for its cubic complexity. A second contribution is the novel application of vector truncation to reduce the run-time by a constant factor. Speed-ups of four to ten times are found to be easily achievable without losing the structural benefits associated with SOS analysis.
Description: This is the post-print of the Article - Copyright @ 2011 ASIS&T
ISSN: 1532-2882
Appears in Collections:Publications
Computer Science
Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
Fulltext.pdf1.16 MBAdobe PDFView/Open

Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.