An index-based joint multilingual/cross-lingual text categorization using topic expansion via BabelNet

Authors: ENIAFE FESTUS AYETIRAN

Abstract: The majority of the state-of-the-art text categorization algorithms are supervised and therefore require prior training. Besides the rigor involved in developing training datasets and the requirement for repetition of training for different texts, working with multilingual texts poses additional unique challenges. One of these challenges is that the developer is required to have many different languages involved. Term expansion such as query expansion has been applied in numerous applications; however, a major drawback of most of these applications is that the actual meaning of terms is not usually taken into consideration. Considering the semantics of terms is necessary because of the polysemous nature of most natural language words. In this paper, as a specific contribution to the document index approach for text categorization, we present a joint multilingual/cross-lingual text categorization algorithm (JointMC) based on semantic term expansion of class topic terms through an optimized knowledge-based word sense disambiguation. The lexical knowledge in BabelNet is used for the word sense disambiguation and expansion of the topics' terms. The categorization algorithm computes the distributed semantic similarity between the expanded class topics and the text documents in the test corpus. We evaluate our categorization algorithm using a multilabel text categorization problem. The multilabel categorization task uses the JRC-Acquis dataset. The JRC-Acquis dataset is based on subject domain classification of the European Commission's EuroVoc microthesaurus. We compare the performance of the classifier with a model of it using the original class topics. Furthermore, we compare the performance of our classifier with two state-of-the-art supervised algorithms (each for multilingual and cross-lingual tasks) using the same dataset. Empirical results obtained on five experimental languages show that categorization with expanded topics shows a very wide performance margin when compared to usage of the original topics. Our algorithm outperforms the existing supervised technique, which used the same dataset. Cross-language categorization surprisingly shows similar performance and is marginally better for some of the languages.

Keywords: Topic expansion, topic model, distributional semantic model, word sense disambiguation

Full Text: PDF