The research presented in this thesis aims to address the issue of undiagnosed diabetes cases. The current state of knowledge is that one in seventy people in the United Kingdom are living with undiagnosed diabetes, and only one in a hundred people could identify the main signs of diabetes. Some of the tools available for predicting diabetes are either too simplistic and/or rely on superficial data for inference. On the positive side, the National Health Service (NHS) are improving data recording in this domain by offering health check to adults aged 40 - 70. Data from such programme could be utilised to mitigate the issue of superficial data; but also help to develop a predictive tool that facilitates a change from the current reactive care, onto one that is proactive. This thesis presents a tool based on a machine learning ensemble for predicting diabetes onset. Ensembles often perform better than a single classifier, and accuracy and diversity have been highlighted as the two vital requirements for constructing good ensemble classifiers. Experiments in this thesis explore the relationship between diversity from heterogeneous ensemble classifiers and the accuracy of predictions through feature subset selection in order to predict diabetes onset. Data from a national health check programme (similar to NHS health check) was used. The aim is to predict diabetes onset better than other similar studies within the literature. For the experiments, predictions from five base classifiers (Sequential Minimal Optimisation (SMO), Radial Basis Function (RBF), Naïve Bayes (NB), Repeated Incremental Pruning to Produce Error Reduction (RIPPER) and C4.5 decision tree), performing the same task, are exploited in all possible combinations to construct 26 ensemble models. The training data feature space was searched to select the best feature subset for each classifier. Selected subsets are used to train the classifiers and their predictions are combined using k-Nearest Neighbours algorithm as meta-classifier. Results are analysed using four performance metrics (accuracy, sensitivity, specificity and AUC) to determine (i) if ensembles always perform better than single classifier; and (ii) the impact of diversity (from heterogeneous classifiers) and accuracy (through feature subset selection) on ensemble performance. At base classification level, RBF produced better results than the other four classifiers with 78%accuracy, 82% sensitivity, 73% specificity and 85% AUC. A comparative study shows that RBF model is more accurate than 9 ensembles, more sensitive than 13 ensembles, more specific than 9 ensembles; and produced better AUC than 25 ensembles. This means that ensembles do not always perform better than its constituent classifiers. Of those ensembles that performed better than RBF, the combination of C4.5, RIPPER and NB produced the highest results with 83% accuracy, 87% sensitivity, 79% specificity, and 86% AUC. When compared to the RBF model, the result shows 5.37% accuracy improvement which is significant (p = 0.0332). The experiments show how data from medical health examination can be utilised to address the issue of undiagnosed cases of diabetes. Models constructed with such data would facilitate the much desired shift from preventive to proactive care for individuals at high risk of diabetes. From the machine learning view point, it was established that ensembles constructed based on diverse and accurate base learners, have the potential to produce significant improvement in accuracy, compared to its individual constituent classifiers. In addition, the ensemble presented in this thesis is at least 1% and at most 23% more accurate than similar research studies found within the literature. This validates the superiority of the method implemented.
|Published - 26 Mar 2018