Exploring bigram character features for Arabic text clustering

Authors: DIA EDDIN ABUZEINA

Abstract: The vector space model (VSM) is an algebraic model that is widely used for data representation in text mining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space. Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes, and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently, the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of the Arabic text clustering using two feature types: the complete word form and the microword form. Hence, the microword is two consecutive characters which are also known as the Bigram character feature. In the experiment, the principal component analysis (PCA) is used to reduce the feature vector dimensions while the k-means algorithm is used for the clustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words, whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word form score accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same; however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments might be a significant indication of the necessity to consider the Bigram character feature in the future text processing and natural language processing applications.

Keywords: Arabic, text, clustering, features, dimensionality reduction, k-means, principal component analysis, vector space model

Full Text: PDF