Authors: ATEFE PAKZAD, MORTEZA ANALOUI

Abstract: The present study aims to generate low-dimensional explicit distributional semantic vectors. In explicit semantic vectors, each dimension corresponds to a word, which makes word vectors interpretable. In this study, a new approach is proposed to obtain low-dimensional explicit semantic vectors. Firstly, the suggested approach considers three criteria, namely, word similarity, number of zeros, and word frequency as features for words in a corpus. Next, some rules are extracted to obtain the initial basis words using a decision tree which is drawn based on the three features. Secondly, a binary weighting method is proposed based on the binary particle swarm optimization algorithm which obtains NB = 1000 context words. In addition, a word selection method is used to provide NS = 1000 context words. Thirdly, the golden words of the corpus are extracted based on the binary weighting method. Subsequently, the extracted golden words are added to the context words which are selected by the word selection method as the golden context words. The ukWaC corpus is utilized for constructing the word vectors. MEN, RG-65, and SimLex-999 test sets are used to evaluate the word vectors. Next, the results are compared to a baseline which uses 5K most frequent words in the corpus as the context words. The baseline method uses a fixed window to count the cooccurrences. The word vectors are obtained using the 1000 selected context words along with the golden context words. Compared to the baseline method, the suggested approach can increase Spearman?s correlation coefficient for the MEN, RG-65, and SimLex-999 test sets by 4.66%, 14.73%, and 1.08%, respectively.

Keywords: Explicit word vectors, rule-based selection method, golden context words, final basis words

Full Text: PDF