Investigation of Luhn's claim on information retrieval

Authors: İLKER KOCABAŞ, BEKİR TANER DİNÇER, BAHAR KARAOĞLAN

Abstract: In this study, we show how Luhn's claim about the degree of importance of a word in a document can be related to information retrieval. His basic idea is transformed into z-scores as the weights of terms for the purpose of modeling term frequency (tf) within documents. The Luhn-based models represented in this paper are considered as the TF component of proposed TF \times IDF weighing schemes. Moreover, the final term weighting functions appropriate for the TF \times IDF weighting scheme are applied to TREC-6, -7, and -8 databases. The experimental results show relevance to Luhn's claim by having high mean average precision (MAP) for the terms with frequencies around the mean frequency of terms within a document. On the other hand, the weighting, which significantly discriminates the importance between low/high frequencies and medium frequencies, degrades the retrieval performance. Therefore, any weighting scheme (TF) that is directly proportional to tf has a probability of high retrieval performance, if this can optimally indicate the difference of the importance regarding tf values and also optimally eliminate the terms that have high frequencies.

Keywords: Luhn, information retrieval, term weighting, indexing

Full Text: PDF