Authors: BAKSHI ROHIT PRASAD, SONALI AGARWAL
Abstract: Machine learning (ML) on Big Data has gone beyond the capacity of traditional machines and technologies. ML for large scale datasets is the current focus of researchers. Most of the ML algorithms primarily suffer from memory constraints, complex computation, and scalability issues.The least square twin support vector machine (LSTSVM) technique is an extended version of support vector machine (SVM). It is much faster as compared to SVM and is widely used for classification tasks. However, when applied to large scale datasets having millions or billions of samples and/or large number of classes, it causes computational and storage bottlenecks. This paper proposes a novel scalable design for LSTSVM named distributed LSTSVM (DLSTSVM). This design exploits distributed computation on cluster of machines to provide a scalable solution to LSTSVM. Very large datasets are partitioned and distributed in the form of resilient distributed datasets on top of Spark cluster computing engine. LSTSVM is trained to generate two nonparallel hyper-planes. These hyper-planes are achieved by solving two systems of linear equations each of which involves data instances from either class. While designing DLSTSVM we employed distributed matrix operations using the MapReduce paradigm of computing to distribute the tasks over multiple machines in the cluster. Thus, memory constraints with extremely large datasets are averted. Experimental results show the reduction in time complexity as compared to existing scalable solutions to SVM and its variants. Moreover, detailed experiments depict the scalability of the proposed design with respect to large datasets.
Keywords: Distributed machine learning, Big Data, cluster computing, least square twin support vector machine (LSTSVM), MapReduce, parallel processing
Full Text: PDF