Malaysian Journal of Mathematical Sciences, May 2020, Vol. 14, No. 2

Modified Statistical Approach for Data Preprocessing to Improve Heterogeneous Distance Functions

Dalatu, P. I. and Midi, H.

Corresponding Email:

Received date: 3 January 2019
Accepted date: 27 March 2020

Clustering is one of the most important techniques used in data mining. The major aim of clustering is to partition a set of data objects into clusters such that data objects in the same cluster are more similar to each other than those in the other clusters. We proposed a modified statistical approach for data preprocessing to improve heterogeneous distance functions from Heterogeneous Euclidean-Overlap Metric (HEOM) by replacing the range function which serves as a local normalization by interquartile range function, and the approach is called Interquartile Range-Heterogeneous Euclidean-Overlap Metric (IQR-HEOM). The proposed approach is used to overcome the weakness of using range function as local normalization in HEOM. However, using range function and dividing it by range allows outliers to have big effect on the contribution of the attributes. In addition, cohesion measures how closely related objects are in a cluster. While, silhouette measures how distinct or well-separated a cluster is from other clusters. To evaluate the performance of the proposed approach, simulation study and real-life data sets were considered. Therefore, comparing the performance of the proposed approach and the existing methods, it is evidently clear that the suggested approach outperformed the existing methods, even with the contamination of the data, still the proposed approach had shown better performance.

Keywords: K-Means, simulation, interquartile, heterogeneous and clustering



SCImago Journal & Country Rank

Flag Counter