A MapReduce-based distributed SVM algorithm for binary classification

Authors: FERHAT ÖZGÜR ÇATAK, MEHMET ERDAL BALABAN

Abstract: Although the support vector machine (SVM) algorithm has a high generalization property for classifying unseen examples after the training phase~and a small loss value, the algorithm is not suitable for real-life classification and regression problems. SVMs cannot solve hundreds of thousands of examples in a training dataset. In previous studies on distributed machine-learning algorithms, the SVM was trained in a costly and preconfigured computer environment. In this research, we present a MapReduce-based distributed parallel SVM training algorithm for binary classification problems. This work shows how to distribute optimization problems over cloud computing systems with the MapReduce technique. In the second step of this work, we used statistical learning theory to find the predictive hypothesis that would minimize the empirical risks from hypothesis spaces that were created with the Reduce function of MapReduce. The results of this research are important for the training of big datasets for SVM algorithm-based classification problems. We provided the iterative training of the split dataset with the MapReduce technique; the accuracy of the classifier function will converge to global optimal classifier function accuracy in finite iteration size. The algorithm performance was measured on samples from letter recognition and pen-based recognition of a handwritten digits dataset.

Keywords: Support vector machine, machine learning, cloud computing, MapReduce, large-scale dataset

Full Text: PDF