New metrics for clustering of identical products over imperfect data

Authors: ZEKİ YETGİN, FURKAN GÖZÜKARA

Abstract: This paper introduces the concept of product identity-clustering based on new similarity metrics and new performance metrics for web-crawled products. Product identity-clustering is defined here as the clustering of identical products, e.g., for price comparison purposes. Products blindly crawled over web sources, e.g., online marketplaces, have different description formats, where the features describing the same products differ in both number and representation formats. This problem causes imperfect feature vectors, where the vectors are considered to be not uniform in length and structure, with the features of various data types (numeric, categorical), and unknown vector structures. Furthermore, the product information usually contains redundant, missing, or faulty data, which are regarded as noise here. Product identity-clustering becomes a challenge when the vectors' metadata are previously unknown and the imperfect nature of the feature vectors is considered with the occurrence of noise. In this paper, the product identity-clustering concept is introduced as a new mining metric in e-commerce. Then novel similarity metrics are introduced to improve the product identity-clustering performance of legacy metrics. Finally, novel performance metrics are proposed to measure the performance of the identity-clustering algorithms. Using these metrics, a comparison of the legacy-based similarity metrics (Euclidian, cosine, etc.) and the proposed similarity metrics is given. The results show that legacy metrics are not successful in discriminating identical web-crawled products and the proposed metrics enable better achievement in the product identity-clustering problem.

Keywords: Product clustering, similarity metrics, identity clustering, performance metrics, web mining

Full Text: PDF