In:
Journal of Information Science, SAGE Publications
Abstract:
The original K-nearest neighbour ( KNN) algorithm was meant to classify homogeneous complete data, that is, data with only numerical features whose values exist completely. Thus, it faces problems when used with heterogeneous incomplete (HI) data, which has also categorical features and is plagued with missing values. Many solutions have been proposed over the years but most have pitfalls. For example, some solve heterogeneity by converting categorical features into numerical ones, inflicting structural damage. Others solve incompleteness by imputation or elimination, causing semantic disturbance. Almost all use the same K for all query objects, leading to misclassification. In the present work, we introduce KNN HI , a KNN-based algorithm for HI data classification that avoids all these pitfalls. Leveraging rough set theory, KNN HI preserves both categorical and numerical features, leaves missing values untouched and uses a different K for each query. The end result is an accurate classifier, as demonstrated by extensive experimentation on nine datasets mostly from the University of California Irvine repository, using a 10-fold cross-validation technique. We show that KNN HI outperforms six recently published KNN-based algorithms, in terms of precision, recall, accuracy and F-Score. In addition to its function as a mighty classifier, KNN HI can also serve as a K calculator, helping KNN-based algorithms that use a single K value for all queries that find the best such value. Sure enough, we show how four such algorithms improve their performance using the K obtained by KNN HI . Finally, KNN HI exhibits impressive resilience to the degree of incompleteness, degree of heterogeneity and the metric used to measure distance.
Type of Medium:
Online Resource
ISSN:
0165-5515
,
1741-6485
DOI:
10.1177/01655515211069539
Language:
English
Publisher:
SAGE Publications
Publication Date:
2022
detail.hit.zdb_id:
439125-1
detail.hit.zdb_id:
2025062-9
SSG:
24,1
Permalink