Authors: BURAK AYTAN, CEMAL OKAN ŞAKAR
Abstract: Spell checking and correction is an important step in the text normalization process. These tasks are more challenging in agglutinative languages such as Turkish since many words can be derived from the root word by combining many suffixes. In this study, we propose a two-step deep learning-based model for misspelled word detection in the Turkish language. A false positive reduction model is integrated into the system to reduce the false positive predictions originating from the use of foreign words and abbreviations that are commonly used in Internet sharing platforms. For this purpose, we create a multi-class dataset by developing a mobile application for labeling. We compare the effect of using different types of tokenizers including character-based, syllable-based, and byte-pair encoding (BPE) approaches together with Long Short-Term Memory (LSTM) and Bi-directional LSTM (Bi-LSTM) networks. The findings show that the proposed Bi-LSTM-based model with the BPE tokenizer is superior to the benchmarking methods. The results also indicate that the false positive reduction step significantly increased the precision of the base detection model in exchange for a comparably less drop in its recall.
Keywords: Text normalization, spell checker, tokenizers, long short-term memory, agglutinative languages
Full Text: PDF