Diabetes is one very prevelent condition especially in adults, A diabetic patient has life long high blood glucose level. There are two types of Diabetes viz type 1 and type 2. In type 1 diabetes, the body's immune system attacks the cells that produce insulin while in type 2 diabetes the body is unable to produce enough insulin or the body's cell do not react to the insulin. Type 2 diabetes is more common in the United Kingdom, about 90%o of diabetes patients are having type 2. Further details of diabetes and its early identification as well how to live with it are available on National Health Services UK website (https://www.nhs.uk/conditions/diabetes/).
The objective to this project is explore a PimaIndiansDiabetes dataset containing data on diabetes diagnosis of 768 female patients using eight features. This dataset contains missing values. This dataset is availabe at Kaggle (https://www.kaggle.com/uciml/pima-indians-diabetes-database) or it is also found in package "mlbench". In mlbench package, there are two versions of this dataset; PimaIndiansDiabetes1 which contains "0" interger in place of missing values and PimaIndiansDiabetes2 which has "NA" in place of missing values. For this study, the dataset PimaIndiansDiabetes2 has been used. The code included in this report will install the package and download the dataset.
This report will first explore the dataset and defines all the variables. there are eight independent variables and one dependent variable. The reponse variable is "diabetes" and it is a categorical variable and has "pos" value for presence of disease and "neg" for absence. After exploring dataset we perfom data visualization analysis and identifies the missing value cases. The report then discusses apporaches to handle missing data and uses two methods to impute the missing values. Later on both imputed dataset are used to make two-train and two-test sets, each identified clearly. Machine Learning techniqes related to classifcation are used in developing three training models, which are KNN,SVM and GLM. There are two models of each type trained on two different imputed datasets, hence six training models. These models then predict both the test datasets and the resulting confusionn matrices and accuracies are compared and discussed.
Following code installs the requied packages and load the libaraies for obtaining dataset, exploring, visualising and developing machine learning models.
# installing the required packages for analysis
if(!require(mlbench)) install.packages("mlbench", repos = "http://cran.us.r-project.org")
## Loading required package: mlbench
if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
if(!require(knitr)) install.packages("knitr", repos = "http://cran.us.r-project.org")
## Loading required package: knitr
if(!require(mice)) install.packages("mice", repos = "http://cran.us.r-project.org")
## Loading required package: mice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
if(!require(corrplot)) install.packages("corrplot", repos = "http://cran.us.r-project.org")
## Loading required package: corrplot
## corrplot 0.84 loaded
if(!require(VIM)) install.packages("VIM", repos = "http://cran.us.r-project.org")
## Loading required package: VIM
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
# loading the required libraries for analysis
library(mlbench)
library(tidyverse)
library(caret)
library(knitr)
library(mice)
library(randomForest)
library(corrplot)
library(VIM)
Following code will load the dataset PimaIndiansDiabetes2 and store it in the object "diab".
# loading the data and saving it in object diab
data("PimaIndiansDiabetes2")
diab<-PimaIndiansDiabetes2
The dimensions of the dataset are as
# checking the dimensions of dataset
dim(diab)
## [1] 768 9
There are 768 instances of 9 variables in the dataset.
The structure of dataset, which gives information of each variable and its type is obtained by following code.
# checking structure of dataset
str(diab)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 NA 70 96 ...
## $ triceps : num 35 29 NA 23 35 NA 32 NA 45 NA ...
## $ insulin : num NA NA NA 94 168 NA 88 NA 543 NA ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
We can see the first few lines of the dataset to get some intuition about it using head() function as below
# to have a look at first few lines of the dataset
head(diab)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 NA NA 25.6 0.201 30 neg
All eight independent variable are numeric, while the dependent variable "diabetes" is factor with two levels.
Following code will make a table, which gives description of each variable.
# explaining each variable by making a table of details
variable.name <- c("pregnant","glucose","pressure","triceps","insulin",
"mass","pedigree", "age","diabetes")
variable.type<-c("numeric","numeric","numeric","numeric","numeric",
"numeric","numeric","numeric","factor")
variable.description <- c("Number of times pregnant",
"Plasma glucose concentration at 2 hours in an oral glucose tolerance test",
"Diastolic blood pressure", "Triceps skin fold thickness", "2-hour serum insulin (µU/ml)",
"Body Mass Index", "History of Diabetes Mellitus in relatives & relationship with relatives",
"Age of the Patient", "Presence of Diabetes")
datadesc <- data.frame(cbind(variable.name, variable.type, variable.description))
kable(datadesc)
| variable.name | variable.type | variable.description |
|---|---|---|
| pregnant | numeric | Number of times pregnant |
| glucose | numeric | Plasma glucose concentration at 2 hours in an oral glucose tolerance test |
| pressure | numeric | Diastolic blood pressure |
| triceps | numeric | Triceps skin fold thickness |
| insulin | numeric | 2-hour serum insulin (µU/ml) |
| mass | numeric | Body Mass Index |
| pedigree | numeric | History of Diabetes Mellitus in relatives & relationship with relatives |
| age | numeric | Age of the Patient |
| diabetes | factor | Presence of Diabetes |
We can find how many missing values are in each variable using the following code.
# checking if the dataset is having missing values
sapply(diab, function(x) sum(is.na(x)))
## pregnant glucose pressure triceps insulin mass pedigree age
## 0 5 35 227 374 11 0 0
## diabetes
## 0
We can see that three features have no missing values, while five features have missing values. Insulin has highest 374 missing values, followed by triceps which has 227 missing values. There are no NAs in response variable.
Let us get summary of the dataset using following code.
# summary table of the dataset
summary(diab)
## pregnant glucose pressure triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin mass pedigree age diabetes
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00 neg:500
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00 pos:268
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
Let us see how many complete cases we have using following code.
# checking how many complete instances are in the dataset
dim(na.omit(diab))
## [1] 392 9
We can see that only 392 observations (cases) out of 768 are complete. So complete cases are only 51% of all.
We want to see proption of "pos" and "neg" cases in our dataset, using following code we can get the proportion table.
# getting proportion of positive and negative diabetes cases in the dataset
prop.table(table(diab$diabetes))
##
## neg pos
## 0.6510417 0.3489583
The proportion of neg which is absence of disease is higher.
Following is the bar chart of response variable indicating their frequencies in the dataset.
# Bar chart for positive and negative cases in the dataset
diab%>%ggplot(aes(diabetes))+geom_bar()+ggtitle("Diabetes Diagnosis", subtitle = "Figure 1")
Following graph analyses distribution of positive and negative cases for pregnent women in relation to how many times they have been pregnent.
diab%>%ggplot(aes(pregnant))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Pregnent and Diabetes", subtitle = "Figure 2")
It is found from figure 2 that neg cases have low number of pregnencies.
Following codes will make the bar chart showing distribution of glucose levels in pos and neg cases.
diab%>%ggplot(aes(glucose))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Glucose and Diabetes", subtitle = "Figure 3")
As per diagram 3, the neg cases have lower values of sugar in the blood.
Following codes will make the bar chart showing distribution of blood pressure values in pos and neg cases.
diab%>%ggplot(aes(pressure))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Blood Pressure and Diabetes", subtitle = "Figure 4")
We observe that in pos cases, frequency of high blood pressure.
Following codes will make the bar chart showing skin thickness values in pos and neg cases.
diab%>%ggplot(aes(triceps))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Skin Thickness and Diabetes", subtitle = "Figure 5")
Tricep which is a measure of skin fold thickness, as per figure 5, is high in pos cases.
Following codes will make the bar chart showing Insulin levels in pos and neg cases.
diab%>%ggplot(aes(insulin))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Insulin and Diabetes", subtitle = "Figure 6")
As per Figure 6, it is observed that insulin levels are high in pos cases.
Following codes will make the bar chart showing BMI values in pos and neg cases.
diab%>%ggplot(aes(mass))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("BMI and Diabetes", subtitle = "Figure 7")
As per the distribution of BMI, pos cases have many values with higer BMI.
Following codes will make the bar chart showing Insulin levels in pos and neg cases.
diab%>%ggplot(aes(pedigree))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Diabetes pedigree function", subtitle = "Figure 8")
Figure 8 shows that like many other diseases, diabetes also links to genetics, positive cases have higher values indicating their relatives has history of diabetes.
Following codes will make the bar chart showing distribution of Age in pos and neg cases.
diab%>%ggplot(aes(age))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
ggtitle("Age and Diabetes", subtitle = "Figure 9")
As per figure 9, we can see that people with low age tend not to have diabetes.
Following figure and table summarises pattern of missing values and so they can be grouped and processed further.
# Exploring cases of missing values
md.pattern(diab,rotate.names = TRUE)
## pregnant pedigree age diabetes glucose mass pressure triceps insulin
## 392 1 1 1 1 1 1 1 1 1 0
## 140 1 1 1 1 1 1 1 1 0 1
## 192 1 1 1 1 1 1 1 0 0 2
## 2 1 1 1 1 1 1 0 1 0 2
## 26 1 1 1 1 1 1 0 0 0 3
## 1 1 1 1 1 1 0 1 1 1 1
## 1 1 1 1 1 1 0 1 1 0 2
## 2 1 1 1 1 1 0 1 0 0 3
## 7 1 1 1 1 1 0 0 0 0 4
## 1 1 1 1 1 0 1 1 1 1 1
## 4 1 1 1 1 0 1 1 1 0 2
## 0 0 0 0 5 11 35 227 374 652
There are 11 distinct pattren of missing values, this information is used for imputation of missing values to complete the dataset for onward analsys.
The graphical representation of features with missing values and the proportionn of missing values in these feature is shown in the following plot using this code.
# Visualizing pattren in missing values (NAs)
mice_plot <- aggr(diab, col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(diab), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## insulin 0.486979167
## triceps 0.295572917
## pressure 0.045572917
## mass 0.014322917
## glucose 0.006510417
## pregnant 0.000000000
## pedigree 0.000000000
## age 0.000000000
## diabetes 0.000000000
It is found that 48.7% of Insulin data and 29.9% of tricep varibale data is missing.
Le us make a subset of our diab dataset with complete cases to analyse the corelation between them.
#complete cases of diab dataset
diab_cc<-diab[complete.cases(diab),]
Following code will make the corelation plot.
# Exploring corelation and visualiation of diab_cc #
corrplot(cor(diab_cc[, -9]), type = "upper", method = "number")
We can see that between pregnent and age there is moderately high positive corelation and its value is 0.68, this r value (corelation constant) is followed by 0.66 value between BMI and tricep (skin fold thickness) and similary another moderate positive corelation is between glucose and insulin variables having value 0.58.
We will be comparig these corelation values with the complete datasets after imputation later in this report. If the corelations are similar in the imputed datasets, that will indicate that the distribution of imputed datasets for varius features are not significantly changed due to imputation process.
There are many techniques to handle missing data, details are given below under each option's caption.
This method may be used if dataset is very big and only very small number of NA are preseent. This is not possible in our case.
Mean of numerical feature can be used to replace NA, this is common but can lead to change of distribution and add bias in the data.
Median of numerial feature can be used instead of mean to replace NA if data has some outliers and can influence the mean, but again it can change the distribution of the variable and may add bias.
Mode can be used in categorical features as a replacement of NA, but it can have impact on the distribution of feature.
Using regression model, the missing values can be predicted and this is also quite common method. The missing values will lie on the regression line developed using the complete sets and available features of incomplete cases.
This method uses regression to impute NA is the dataset but adds some small random variablity in the values found using regression. This method tries to preserve the variablity in the dataset. I have used this imputation method to complete the dataset.
This method is very common, as random forest model starts with the most common value mode or median (in our case median) in that feature as the replacement of NA and then improves it by making random forests and run all the data on this forest until it achives the accuracy. I have used this method as my second method to impute data.
So, there are two imputed datasets.
Following code will produce stochastic regression imputation and make diab_imp_sr dataset.
# regression imputation of missing values #
set.seed(100, sample.kind="Rounding")
## Warning in set.seed(100, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
imp <- mice(diab, method = "norm.nob", m = 1) # Impute data
##
## iter imp variable
## 1 1 glucose pressure triceps insulin mass
## 2 1 glucose pressure triceps insulin mass
## 3 1 glucose pressure triceps insulin mass
## 4 1 glucose pressure triceps insulin mass
## 5 1 glucose pressure triceps insulin mass
diab_imp_sr <- complete(imp) # Store data
Following is the corelation plot using the diab_imp_sr dataset (Imputed using stohachastic regression)
# visulazing and checking the correlation between features after imputation #
corrplot(cor(diab_imp_sr[, -9]), type = "upper", method = "number")
As we can see that the corelation between the variables has changed very slightly, so this shows that data imputed is in line with the other values in the complete cases.
Following code will produce random forest imputation and make diab_imp_rf dataset.
# random forest imputation of missing values #
set.seed(101, sample.kind="Rounding")
## Warning in set.seed(101, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
diab_imp_rf<-rfImpute(diabetes ~ ., diab, iter=5, ntree=500)
## ntree OOB 1 2
## 500: 23.31% 15.00% 38.81%
## ntree OOB 1 2
## 500: 24.22% 16.20% 39.18%
## ntree OOB 1 2
## 500: 24.09% 15.60% 39.93%
## ntree OOB 1 2
## 500: 23.96% 15.80% 39.18%
## ntree OOB 1 2
## 500: 24.22% 16.20% 39.18%
Following is the corelation plot using the diab_imp_rf dataset (Imputed using random forest method)
# visualizing and checking the correlation between features after random forest imputation #
corrplot(cor(diab_imp_rf[, -1]), type = "upper", method = "number")
The corelation values are similar those of orignal complete cases dataset, hence imputed values seem to corretly predicted.
As we have two imputed datasets, so we will divide each of them into train and test set. train1 is the traing set obtained from the diab_imp_sr dataset (Imputed using stohachastic regression) test1 is the test set obtained from the diab_imp_sr dataset (Imputed using stohachastic regression)
Followig code will create train1 and test1 sets.
# test and train set from stochastic regression imputed dataset #
set.seed(102, sample.kind="Rounding")
## Warning in set.seed(102, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
ind<-createDataPartition(y=diab$diabetes,times=1, p=0.2, list=FALSE)
train1<-diab_imp_sr[-ind,]
test1<-diab_imp_sr[ind,]
Now we make train and test sets using diab_imp_rf dataset (Imputed using random forest method) train2 is the traing set obtained from diab_imp_rf dataset (Imputed using random forest method) test2 is the test set obtained from diab_imp_rf dataset (Imputed using random forest method)
Following code wil produce train2 and test2 dataset.
# test and train set from random forest imputed dataset #
set.seed(103, sample.kind="Rounding")
## Warning in set.seed(103, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
ind<-createDataPartition(y=diab$diabetes,times=1, p=0.2, list=FALSE)
train2<-diab_imp_rf[-ind,]
test2<-diab_imp_rf[ind,]
Using train1 set we will make three traing models using machine learning procedures of KNN, SVM and Logistic Regression. We will call these models as train_knn1, train_svm1, train_glm1.
And similary we will use train2 dataset to train three models using same method and these will be called train_knn2, train_svm2 and train_glm2.
We will use test1 and test2 datasets to check accuracy of all of these six models and will evaluate their performance.
We will use 10-fold cross validation to develop KNN and SVM models and selct the tuning parameters based on cross validation.
Following code will make train_knn1 model using 10-fold cross valiation.
######### MODELS TRAINING ###############
#### KNN MODEL ####
# train knn model using 10-fold cross validation for train1 dataset
set.seed(104, sample.kind="Rounding")
## Warning in set.seed(104, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
train_knn1 <- train(diabetes ~ .,
method = "knn",
data = train1,
tuneGrid = data.frame(k = seq(3, 51, 2)),
trControl = trainControl(method = "cv", number = 10, p = 0.9))
Using the below code we can find the best tuning parameter (number of neighbors)
# checking the best tuning parameter for knn
ggplot(train_knn1)
train_knn1$bestTune
## k
## 17 35
Similarly we train the train_knn2 model using similar procedure.
# train knn model using 10-fold cross validation for train2 dataset
set.seed(105, sample.kind="Rounding")
## Warning in set.seed(105, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
train_knn2 <- train(diabetes ~ .,
method = "knn",
data = train2,
tuneGrid = data.frame(k = seq(3, 51, 2)),
trControl = trainControl(method = "cv", number = 10, p = 0.9))
# checking the best tuning parameter for knn
ggplot(train_knn2)
train_knn2$bestTune
## k
## 4 9
We use 10 fold cross-validation, and accuracy as our test metric to train for Support Vector Machines model. As our data has 8 features, we use radial kernal instead of linear kernal.
Following code will develop train_svm1 model
#### SVM MODEL ####
# tain svm model using 10-fold cross validation using train1 dataset #
set.seed(106, sample.kind="Rounding")
## Warning in set.seed(106, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
k <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
train_svm1 <- train(diabetes~., data=train1, method="svmRadial", metric=metric, trControl=k)
Followig code will develop train_svm2 model.
# tain svm model using 10-fold cross validation using train2 dataset #
set.seed(107, sample.kind="Rounding")
## Warning in set.seed(107, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
k <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
train_svm2 <- train(diabetes~., data=train2, method="svmRadial", metric=metric, trControl=k)
As our response variable is dichotomous, we can use Logistic Regression model for it. Following code will develop train_glm1 model.
#### LOGISTIC REGRESSION MODEL ####
# train logistic regression model using train1 dataset #
train_glm1<-train(diabetes~.,data=train1,method="glm")
Following code will develop train_glm2 model.
# train logistic regression model using train2 dataset #
train_glm2<-train(diabetes~.,data=train2,method="glm")
we use test1 dataset to test train_knn1 model, following code will do the testing will give accuracy as well confusion matrix.
#### predictions of models developed using train1 dataset ####
## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_knn1
pred_knn1<-predict(train_knn1,test1)
mean(pred_knn1==test1$diabetes)
## [1] 0.7467532
cm1<-confusionMatrix(pred_knn1,test1$diabetes)
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 86 25
## pos 14 29
##
## Accuracy : 0.7468
## 95% CI : (0.6705, 0.8133)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.006192
##
## Kappa : 0.4166
##
## Mcnemar's Test P-Value : 0.109315
##
## Sensitivity : 0.8600
## Specificity : 0.5370
## Pos Pred Value : 0.7748
## Neg Pred Value : 0.6744
## Prevalence : 0.6494
## Detection Rate : 0.5584
## Detection Prevalence : 0.7208
## Balanced Accuracy : 0.6985
##
## 'Positive' Class : neg
##
cm1$overall[1]
## Accuracy
## 0.7467532
cm1$byClass[1:2]
## Sensitivity Specificity
## 0.860000 0.537037
We have used train_knn1 model to test test1 dataset, both training and test sets are from data Imputed by Stochatic Regression model. As per the above results, the accuracy is 74.68%. All other parameters are also given in the confusion matrix output.
Now, we use test2 data set for prediction using train_knn1.
##predicting test2 (test part of dataset imputed using random forest) usnig train_knn1 #
pred_knn12<-predict(train_knn1,test2)
mean(pred_knn12==test2$diabetes)
## [1] 0.7597403
cm2<-confusionMatrix(pred_knn12,test2$diabetes)
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 87 24
## pos 13 30
##
## Accuracy : 0.7597
## 95% CI : (0.6844, 0.8248)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.002122
##
## Kappa : 0.4465
##
## Mcnemar's Test P-Value : 0.100178
##
## Sensitivity : 0.8700
## Specificity : 0.5556
## Pos Pred Value : 0.7838
## Neg Pred Value : 0.6977
## Prevalence : 0.6494
## Detection Rate : 0.5649
## Detection Prevalence : 0.7208
## Balanced Accuracy : 0.7128
##
## 'Positive' Class : neg
##
cm2$overall[1]
## Accuracy
## 0.7597403
cm2$byClass[1:2]
## Sensitivity Specificity
## 0.8700000 0.5555556
In the above case, the test set is test2, which is part of the data imputed using random forest regression. The accuracy is now 75.97%. The predicted accuracy is higher on test2 then test1 but the difference is only about 1.5%.
Following code will use train_svm1 on test1 dataset.
## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_svm1 #
pred_svm1<-predict(train_svm1,test1)
mean(pred_svm1==test1$diabetes)
## [1] 0.7077922
cm3<-confusionMatrix(pred_svm1,test1$diabetes)
cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 80 25
## pos 20 29
##
## Accuracy : 0.7078
## 95% CI : (0.6292, 0.7782)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.07413
##
## Kappa : 0.3444
##
## Mcnemar's Test P-Value : 0.55098
##
## Sensitivity : 0.8000
## Specificity : 0.5370
## Pos Pred Value : 0.7619
## Neg Pred Value : 0.5918
## Prevalence : 0.6494
## Detection Rate : 0.5195
## Detection Prevalence : 0.6818
## Balanced Accuracy : 0.6685
##
## 'Positive' Class : neg
##
cm3$overall[1]
## Accuracy
## 0.7077922
cm3$byClass[1:2]
## Sensitivity Specificity
## 0.800000 0.537037
The accuracy is 70.78%
Following code will use train_svm1 on test2 dataset.
## predicting test2 (test part of dataset imputed using random forest method)
## usnig train_svm1
pred_svm12<-predict(train_svm1,test2)
mean(pred_svm12==test2$diabetes)
## [1] 0.7467532
cm4<-confusionMatrix(pred_svm12,test2$diabetes)
cm4
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 87 26
## pos 13 28
##
## Accuracy : 0.7468
## 95% CI : (0.6705, 0.8133)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.006192
##
## Kappa : 0.4113
##
## Mcnemar's Test P-Value : 0.054664
##
## Sensitivity : 0.8700
## Specificity : 0.5185
## Pos Pred Value : 0.7699
## Neg Pred Value : 0.6829
## Prevalence : 0.6494
## Detection Rate : 0.5649
## Detection Prevalence : 0.7338
## Balanced Accuracy : 0.6943
##
## 'Positive' Class : neg
##
cm4$overall[1]
## Accuracy
## 0.7467532
cm4$byClass[1:2]
## Sensitivity Specificity
## 0.8700000 0.5185185
The accuracy is 74.68%
Following code will use train_glm1 on dataset test1,
## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_glm1
pred_glm1<-predict(train_glm1,test1)
mean(pred_glm1==test1$diabetes)
## [1] 0.7337662
cm5<-confusionMatrix(pred_glm1,test1$diabetes)
cm5
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 81 22
## pos 19 32
##
## Accuracy : 0.7338
## 95% CI : (0.6566, 0.8017)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.01593
##
## Kappa : 0.4078
##
## Mcnemar's Test P-Value : 0.75478
##
## Sensitivity : 0.8100
## Specificity : 0.5926
## Pos Pred Value : 0.7864
## Neg Pred Value : 0.6275
## Prevalence : 0.6494
## Detection Rate : 0.5260
## Detection Prevalence : 0.6688
## Balanced Accuracy : 0.7013
##
## 'Positive' Class : neg
##
cm5$overall[1]
## Accuracy
## 0.7337662
cm5$byClass[1:2]
## Sensitivity Specificity
## 0.8100000 0.5925926
The prediction accuracy is now 73.38%
Following code will use train_glm1 on dataset test2,
## predicting test2 (test part of dataset imputed using random forest method)
## usnig train_glm1
pred_glm12<-predict(train_glm1,test2)
mean(pred_glm12==test2$diabetes)
## [1] 0.7467532
cm6<-confusionMatrix(pred_glm12,test2$diabetes)
cm6
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 84 23
## pos 16 31
##
## Accuracy : 0.7468
## 95% CI : (0.6705, 0.8133)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.006192
##
## Kappa : 0.4268
##
## Mcnemar's Test P-Value : 0.336668
##
## Sensitivity : 0.8400
## Specificity : 0.5741
## Pos Pred Value : 0.7850
## Neg Pred Value : 0.6596
## Prevalence : 0.6494
## Detection Rate : 0.5455
## Detection Prevalence : 0.6948
## Balanced Accuracy : 0.7070
##
## 'Positive' Class : neg
##
cm6$overall[1]
## Accuracy
## 0.7467532
cm6$byClass[1:2]
## Sensitivity Specificity
## 0.8400000 0.5740741
The test2 set accuracy is 74.68%
We now use models develped using train2 dataset for testing of both test sets. Following code will use train_knn2 model to test dataset test2.
#### predictions of models developed using train2 dataset ####
## predicting test2 (test part of dataset imputed using random forest)
## usnig train_knn2
pred_knn2<-predict(train_knn2,test2)
mean(pred_knn2==test2$diabetes)
## [1] 0.7337662
cm7<-confusionMatrix(pred_knn2,test2$diabetes)
cm7
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 78 19
## pos 22 35
##
## Accuracy : 0.7338
## 95% CI : (0.6566, 0.8017)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.01593
##
## Kappa : 0.4227
##
## Mcnemar's Test P-Value : 0.75478
##
## Sensitivity : 0.7800
## Specificity : 0.6481
## Pos Pred Value : 0.8041
## Neg Pred Value : 0.6140
## Prevalence : 0.6494
## Detection Rate : 0.5065
## Detection Prevalence : 0.6299
## Balanced Accuracy : 0.7141
##
## 'Positive' Class : neg
##
cm7$overall[1]
## Accuracy
## 0.7337662
cm7$byClass[1:2]
## Sensitivity Specificity
## 0.7800000 0.6481481
The accuracy is 73.38%, here test set and train dataset used are based on Imputed data using random forest method.
Following code will use train_knn2 model to test dataset test1.
##predicting test1 (test part of dataset imputed using stochastic regression)
##usnig train_knn2
pred_knn21<-predict(train_knn2,test1)
mean(pred_knn21==test1$diabetes)
## [1] 0.7402597
cm8<-confusionMatrix(pred_knn21,test1$diabetes)
cm8
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 80 20
## pos 20 34
##
## Accuracy : 0.7403
## 95% CI : (0.6635, 0.8075)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.01009
##
## Kappa : 0.4296
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8000
## Specificity : 0.6296
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.6296
## Prevalence : 0.6494
## Detection Rate : 0.5195
## Detection Prevalence : 0.6494
## Balanced Accuracy : 0.7148
##
## 'Positive' Class : neg
##
cm8$overall[1]
## Accuracy
## 0.7402597
cm8$byClass[1:2]
## Sensitivity Specificity
## 0.8000000 0.6296296
The accuracy of the model is 74.03% on test1 dataset. Both test datasets are predicted with close accuracies which shows that the model has not over trained.
Following code will use train_svm2 model to test dataset test2.
## predicting test2 (test part of dataset imputed using random forest)
## usnig train_svm2
pred_svm2<-predict(train_svm2,test2)
mean(pred_svm2==test2$diabetes)
## [1] 0.7142857
cm9<-confusionMatrix(pred_svm2,test2$diabetes)
cm9
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 83 27
## pos 17 27
##
## Accuracy : 0.7143
## 95% CI : (0.636, 0.7841)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.05265
##
## Kappa : 0.3447
##
## Mcnemar's Test P-Value : 0.17484
##
## Sensitivity : 0.8300
## Specificity : 0.5000
## Pos Pred Value : 0.7545
## Neg Pred Value : 0.6136
## Prevalence : 0.6494
## Detection Rate : 0.5390
## Detection Prevalence : 0.7143
## Balanced Accuracy : 0.6650
##
## 'Positive' Class : neg
##
cm9$overall[1]
## Accuracy
## 0.7142857
cm9$byClass[1:2]
## Sensitivity Specificity
## 0.83 0.50
The prediction accuracy is 71.43%
Following code will use train_svm2 model to test dataset test1.
## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_svm2
pred_svm21<-predict(train_svm2,test1)
mean(pred_svm21==test1$diabetes)
## [1] 0.7532468
cm10<-confusionMatrix(pred_svm21,test1$diabetes)
cm10
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 83 21
## pos 17 33
##
## Accuracy : 0.7532
## 95% CI : (0.6774, 0.8191)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.003683
##
## Kappa : 0.4488
##
## Mcnemar's Test P-Value : 0.626496
##
## Sensitivity : 0.8300
## Specificity : 0.6111
## Pos Pred Value : 0.7981
## Neg Pred Value : 0.6600
## Prevalence : 0.6494
## Detection Rate : 0.5390
## Detection Prevalence : 0.6753
## Balanced Accuracy : 0.7206
##
## 'Positive' Class : neg
##
cm10$overall[1]
## Accuracy
## 0.7532468
cm10$byClass[1:2]
## Sensitivity Specificity
## 0.8300000 0.6111111
The prediction accuracy is 75.32% on test dataset test1.
Following code will use train_glm2 model to test dataset test2.
## predicting test2 (test part of dataset imputed using random forest)
## usnig train_glm2
pred_glm2<-predict(train_glm2,test2)
mean(pred_glm2==test2$diabetes)
## [1] 0.7467532
cm11<-confusionMatrix(pred_glm2,test2$diabetes)
cm11
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 83 22
## pos 17 32
##
## Accuracy : 0.7468
## 95% CI : (0.6705, 0.8133)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.006192
##
## Kappa : 0.4318
##
## Mcnemar's Test P-Value : 0.521839
##
## Sensitivity : 0.8300
## Specificity : 0.5926
## Pos Pred Value : 0.7905
## Neg Pred Value : 0.6531
## Prevalence : 0.6494
## Detection Rate : 0.5390
## Detection Prevalence : 0.6818
## Balanced Accuracy : 0.7113
##
## 'Positive' Class : neg
##
cm11$overall[1]
## Accuracy
## 0.7467532
cm11$byClass[1:2]
## Sensitivity Specificity
## 0.8300000 0.5925926
The prediction accuracy is 74.68%.
Following code will use train_glm2 model to test dataset test1.
## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_glm2
pred_glm21<-predict(train_glm2,test1)
mean(pred_glm21==test1$diabetes)
## [1] 0.7597403
cm12<-confusionMatrix(pred_glm21,test1$diabetes)
cm12
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 82 19
## pos 18 35
##
## Accuracy : 0.7597
## 95% CI : (0.6844, 0.8248)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.002122
##
## Kappa : 0.4702
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8200
## Specificity : 0.6481
## Pos Pred Value : 0.8119
## Neg Pred Value : 0.6604
## Prevalence : 0.6494
## Detection Rate : 0.5325
## Detection Prevalence : 0.6558
## Balanced Accuracy : 0.7341
##
## 'Positive' Class : neg
##
cm12$overall[1]
## Accuracy
## 0.7597403
cm12$byClass[1:2]
## Sensitivity Specificity
## 0.8200000 0.6481481
The prediction accuracy is 75.97%
Following table summarise all the models and their test and train datasets as well as their accuracies.
# making a dataframe of all the results and presnting them in a nice table
results<-data.frame(Model=c("KNN1","KNN2","SVM1","SVM2","GLM1","GLM2"),
Training_Data = c("train1",
"train2","train1","train2","train1","train2"),
NA_imputation_method =c("Stochastic Regression",
"Random Forest", "Stochastic Regression", "Random Forest",
"Stochastic Regression","Random Forest"),
Accuracy_test1=c(mean(pred_knn1==test1$diabetes),
mean(pred_knn21==test1$diabetes),mean(pred_svm1==test1$diabetes),
mean(pred_svm21==test1$diabetes),mean(pred_glm1==test1$diabetes),
mean(pred_glm21==test1$diabetes)),
Accuracy_test2=c(mean(pred_knn12==test2$diabetes),
mean(pred_knn2==test2$diabetes),
mean(pred_svm12==test2$diabetes),
mean(pred_svm2==test2$diabetes),
mean(pred_glm12==test2$diabetes),
mean(pred_glm2==test2$diabetes)))
results%>%knitr::kable()
| Model | Training_Data | NA_imputation_method | Accuracy_test1 | Accuracy_test2 |
|---|---|---|---|---|
| KNN1 | train1 | Stochastic Regression | 0.7467532 | 0.7597403 |
| KNN2 | train2 | Random Forest | 0.7402597 | 0.7337662 |
| SVM1 | train1 | Stochastic Regression | 0.7077922 | 0.7467532 |
| SVM2 | train2 | Random Forest | 0.7532468 | 0.7142857 |
| GLM1 | train1 | Stochastic Regression | 0.7337662 | 0.7467532 |
| GLM2 | train2 | Random Forest | 0.7597403 | 0.7467532 |
From the accuracies values, we can see that Logistic Regression and KNN models have peroformed better than SVM model on this dataset. It is also noted that some values of accuracies are same for two different methods, for example on KNN1 model using test2 dataset and on GLM2 model using test1 dataset the accuray is 75.97%, which is the heighest among all other cases. A close analysis of these two models show that, although they have same accuracies but their confusion matrices are different. For KNN1/test2 sensitivity is 87%, Specificvity is 55.56% while GLM2/test1 sensivity is 82%, Specficity is 64.81%. An important point here is that it may not be a very good decision to select a model just based on accuracy, but it is important to analyse all the information given in the confusion matrix as well as statistical and substantial significacne of model parameters. In the current analysis, If we just compare these two high accuracy models, then we need to see which one is more important- identifying ture positive or predicting negatives correctly. In our case, the postive class in all of the confusion matrices is "neg", which is absense of diabetes. A high accuracy means, we will be abe to correctly identify a person who is not likely to have diabetes.
A close look at all the confusion matrices indicate that in all othe models, sensitivity is high and usually around 80% or higher but the specificity is low. All of these models have one thing in common that they have high postive prediction power (postive case here is "neg" means identifying correctly patient with no diabetes). These models are not suitable for identifiying "pos" cases as the negative prediction power is low. More machine learning methods like Random Foret, Neural Network etc can be applied to develop a model with high power to identify presence of diabetes.
It is also noted that during data collection and compilation, special efforts are needed to ensure it is free from missing values or errors. Presence of missing values may be handled using the techniques discussed in this report or using more advance statistical procedures like hybrid classifiers LANFIS - Logic Adaptive Network-based Fuzzy Inference System which is based on combination of Logistic Regression and Adaptive Network-based Fuzzy Inference System (Ramezani et.al, 2017) where they have achieved an accuracy of 88.07% on the same dataset.
In this project, PimaIndiansDiabetes dataset was used to develop classfication model. The dataset contains missing values, only 51% of the instances are complete cases. Ignoring the incomplete cases and using the remaining data is not a suitable option, so missing data was imputed using stochasic regression and random forest model. The two imputed datasets were used to train models based on KNN, SVM and Logestic regression. KNN and SVM model were trained using 10-fold cross validation and using accuracy metric. The trained models were tested on test part of both imputed datasets. These models have shown accuracies in general from 70%-76%, with high prediction power to identify the cases of absene of disease. The best accuracy was found for KNN and GLM model and the value is 75.97%. A comparatively better model identified from these six models was GLM as it has similar accuracy to KNN with high sensivtivity and specificity values.