A new dictionary-based preprocessor that uses radix-190 numbering

Authors: METE ERAY ŞENERGİN, ERHAN ALİRİZA İNCE

Abstract: Various scholarly works in the literature have pointed out that placing a preprocessor in front of a standard postcompressor would help achieve higher gains while compressing natural-language text files. Ever since, there has been much research on preprocessors to improve the gain attained by concatenated systems. With the same goal in mind our paper proposes a new word-based preprocessor named METEHAN190 (M190) and contrasts its performance with four other state-of-the-art preprocessors. Throughout the experiments source files from the Wall Street Journal (WSJ) archive, and the Calgary, Canterbury, Gutenberg, and Pizza and Chili corpora were used. Postcompressors adapted were Prediction by Partial Matching compressor using method-D (PPMD) and Monstrous PPM II compressor (PPMonstr). It was observed that in all three experiments WRT and M190 would achieve the two highest compression gains. For small text and transcription files from the Calgary corpus, M190 would outperform all preprocessors including WRT. On the other hand, a look at average encoding and decoding times shows that the semistatic byte-oriented methods are much faster in comparison to the static dictionary-based methods that encode words with characters.

Keywords: Lossless text compression, preprocessing, postcompressor, dictionary, semistatic byte-oriented preprocessors, METEHAN 190

Full Text: PDF