INTRODUCTION

Diabetes

Diabetes is one very prevelent condition especially in adults, A diabetic patient has life long high blood glucose level. There are two types of Diabetes viz type 1 and type 2. In type 1 diabetes, the body's immune system attacks the cells that produce insulin while in type 2 diabetes the body is unable to produce enough insulin or the body's cell do not react to the insulin. Type 2 diabetes is more common in the United Kingdom, about 90%o of diabetes patients are having type 2. Further details of diabetes and its early identification as well how to live with it are available on National Health Services UK website (https://www.nhs.uk/conditions/diabetes/).

Dataset

The objective to this project is explore a PimaIndiansDiabetes dataset containing data on diabetes diagnosis of 768 female patients using eight features. This dataset contains missing values. This dataset is availabe at Kaggle (https://www.kaggle.com/uciml/pima-indians-diabetes-database) or it is also found in package "mlbench". In mlbench package, there are two versions of this dataset; PimaIndiansDiabetes1 which contains "0" interger in place of missing values and PimaIndiansDiabetes2 which has "NA" in place of missing values. For this study, the dataset PimaIndiansDiabetes2 has been used. The code included in this report will install the package and download the dataset.

Structure of the Report

This report will first explore the dataset and defines all the variables. there are eight independent variables and one dependent variable. The reponse variable is "diabetes" and it is a categorical variable and has "pos" value for presence of disease and "neg" for absence. After exploring dataset we perfom data visualization analysis and identifies the missing value cases. The report then discusses apporaches to handle missing data and uses two methods to impute the missing values. Later on both imputed dataset are used to make two-train and two-test sets, each identified clearly. Machine Learning techniqes related to classifcation are used in developing three training models, which are KNN,SVM and GLM. There are two models of each type trained on two different imputed datasets, hence six training models. These models then predict both the test datasets and the resulting confusionn matrices and accuracies are compared and discussed.

Following code installs the requied packages and load the libaraies for obtaining dataset, exploring, visualising and developing machine learning models.

# installing the required packages for analysis

if(!require(mlbench)) install.packages("mlbench", repos = "http://cran.us.r-project.org")

## Loading required package: mlbench

if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")

## Loading required package: tidyverse

## -- Attaching packages ------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.1
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")

## Loading required package: caret

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

if(!require(knitr)) install.packages("knitr", repos = "http://cran.us.r-project.org")

## Loading required package: knitr

if(!require(mice)) install.packages("mice", repos = "http://cran.us.r-project.org")

## Loading required package: mice

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")

## Loading required package: randomForest

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

if(!require(corrplot)) install.packages("corrplot", repos = "http://cran.us.r-project.org")

## Loading required package: corrplot

## corrplot 0.84 loaded

if(!require(VIM)) install.packages("VIM", repos = "http://cran.us.r-project.org")

## Loading required package: VIM

## Loading required package: colorspace

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

# loading the required libraries for analysis

library(mlbench)
library(tidyverse)
library(caret)
library(knitr)
library(mice)
library(randomForest)
library(corrplot)
library(VIM)

DATA EXPLORATION

Following code will load the dataset PimaIndiansDiabetes2 and store it in the object "diab".

# loading the data and saving it in object diab
data("PimaIndiansDiabetes2")
diab<-PimaIndiansDiabetes2

The dimensions of the dataset are as

# checking the dimensions of dataset
dim(diab)

## [1] 768   9

There are 768 instances of 9 variables in the dataset.

The structure of dataset, which gives information of each variable and its type is obtained by following code.

# checking structure of dataset
str(diab)

## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 NA 70 96 ...
##  $ triceps : num  35 29 NA 23 35 NA 32 NA 45 NA ...
##  $ insulin : num  NA NA NA 94 168 NA 88 NA 543 NA ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

We can see the first few lines of the dataset to get some intuition about it using head() function as below

# to have a look at first few lines of the dataset
head(diab)

##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA 33.6    0.627  50      pos
## 2        1      85       66      29      NA 26.6    0.351  31      neg
## 3        8     183       64      NA      NA 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74      NA      NA 25.6    0.201  30      neg

All eight independent variable are numeric, while the dependent variable "diabetes" is factor with two levels.

Following code will make a table, which gives description of each variable.

# explaining each variable by making a table of details

variable.name <- c("pregnant","glucose","pressure","triceps","insulin",
                   "mass","pedigree", "age","diabetes")
variable.type<-c("numeric","numeric","numeric","numeric","numeric",
                 "numeric","numeric","numeric","factor")
variable.description <- c("Number of times pregnant", 
  "Plasma glucose concentration at 2 hours in an oral glucose tolerance test",
  "Diastolic blood pressure", "Triceps skin fold thickness", "2-hour serum insulin (µU/ml)",
  "Body Mass Index", "History of Diabetes Mellitus in relatives & relationship with relatives",
  "Age of the Patient",  "Presence of Diabetes")

datadesc <- data.frame(cbind(variable.name, variable.type, variable.description))
kable(datadesc)

variable.name	variable.type	variable.description
pregnant	numeric	Number of times pregnant
glucose	numeric	Plasma glucose concentration at 2 hours in an oral glucose tolerance test
pressure	numeric	Diastolic blood pressure
triceps	numeric	Triceps skin fold thickness
insulin	numeric	2-hour serum insulin (µU/ml)
mass	numeric	Body Mass Index
pedigree	numeric	History of Diabetes Mellitus in relatives & relationship with relatives
age	numeric	Age of the Patient
diabetes	factor	Presence of Diabetes

We can find how many missing values are in each variable using the following code.

# checking if the dataset is having missing values 
sapply(diab, function(x) sum(is.na(x)))

## pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
##        0        5       35      227      374       11        0        0 
## diabetes 
##        0

We can see that three features have no missing values, while five features have missing values. Insulin has highest 374 missing values, followed by triceps which has 227 missing values. There are no NAs in response variable.

Let us get summary of the dataset using following code.

# summary table of the dataset
summary(diab)

##     pregnant         glucose         pressure         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            mass          pedigree           age        diabetes 
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00   neg:500  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00   pos:268  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00            
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24            
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00            
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00            
##  NA's   :374      NA's   :11

Let us see how many complete cases we have using following code.

# checking how many complete instances are in the dataset
dim(na.omit(diab))

## [1] 392   9

We can see that only 392 observations (cases) out of 768 are complete. So complete cases are only 51% of all.

We want to see proption of "pos" and "neg" cases in our dataset, using following code we can get the proportion table.

# getting proportion of positive and negative diabetes cases in the dataset
prop.table(table(diab$diabetes))

## 
##       neg       pos 
## 0.6510417 0.3489583

The proportion of neg which is absence of disease is higher.

DATA VISUALIZATION

Following is the bar chart of response variable indicating their frequencies in the dataset.

# Bar chart for positive and negative cases in the dataset
diab%>%ggplot(aes(diabetes))+geom_bar()+ggtitle("Diabetes Diagnosis", subtitle = "Figure 1")

Following graph analyses distribution of positive and negative cases for pregnent women in relation to how many times they have been pregnent.

diab%>%ggplot(aes(pregnant))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
  ggtitle("Pregnent and Diabetes", subtitle = "Figure 2")

It is found from figure 2 that neg cases have low number of pregnencies.

Following codes will make the bar chart showing distribution of glucose levels in pos and neg cases.

diab%>%ggplot(aes(glucose))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
  ggtitle("Glucose and Diabetes", subtitle = "Figure 3")

As per diagram 3, the neg cases have lower values of sugar in the blood.

Following codes will make the bar chart showing distribution of blood pressure values in pos and neg cases.

diab%>%ggplot(aes(pressure))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+ 
  ggtitle("Blood Pressure and Diabetes", subtitle = "Figure 4")

We observe that in pos cases, frequency of high blood pressure.

Following codes will make the bar chart showing skin thickness values in pos and neg cases.

diab%>%ggplot(aes(triceps))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
  ggtitle("Skin Thickness and Diabetes", subtitle = "Figure 5")

Tricep which is a measure of skin fold thickness, as per figure 5, is high in pos cases.

Following codes will make the bar chart showing Insulin levels in pos and neg cases.

diab%>%ggplot(aes(insulin))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
  ggtitle("Insulin and Diabetes", subtitle = "Figure 6")

As per Figure 6, it is observed that insulin levels are high in pos cases.

Following codes will make the bar chart showing BMI values in pos and neg cases.

diab%>%ggplot(aes(mass))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
  ggtitle("BMI and Diabetes", subtitle = "Figure 7")

As per the distribution of BMI, pos cases have many values with higer BMI.

Following codes will make the bar chart showing Insulin levels in pos and neg cases.

diab%>%ggplot(aes(pedigree))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+ 
  ggtitle("Diabetes pedigree function", subtitle = "Figure 8")

Figure 8 shows that like many other diseases, diabetes also links to genetics, positive cases have higher values indicating their relatives has history of diabetes.

Following codes will make the bar chart showing distribution of Age in pos and neg cases.

diab%>%ggplot(aes(age))+geom_bar(aes(group=diabetes)) + facet_wrap(~diabetes)+
  ggtitle("Age and Diabetes", subtitle = "Figure 9")

As per figure 9, we can see that people with low age tend not to have diabetes.

ANALYSIS OF MISSING DATA

Following figure and table summarises pattern of missing values and so they can be grouped and processed further.

#  Exploring cases of missing values
md.pattern(diab,rotate.names = TRUE)

##     pregnant pedigree age diabetes glucose mass pressure triceps insulin    
## 392        1        1   1        1       1    1        1       1       1   0
## 140        1        1   1        1       1    1        1       1       0   1
## 192        1        1   1        1       1    1        1       0       0   2
## 2          1        1   1        1       1    1        0       1       0   2
## 26         1        1   1        1       1    1        0       0       0   3
## 1          1        1   1        1       1    0        1       1       1   1
## 1          1        1   1        1       1    0        1       1       0   2
## 2          1        1   1        1       1    0        1       0       0   3
## 7          1        1   1        1       1    0        0       0       0   4
## 1          1        1   1        1       0    1        1       1       1   1
## 4          1        1   1        1       0    1        1       1       0   2
##            0        0   0        0       5   11       35     227     374 652

There are 11 distinct pattren of missing values, this information is used for imputation of missing values to complete the dataset for onward analsys.

The graphical representation of features with missing values and the proportionn of missing values in these feature is shown in the following plot using this code.

# Visualizing pattren in missing values (NAs)
mice_plot <- aggr(diab, col=c('navyblue','yellow'),
                  numbers=TRUE, sortVars=TRUE,
                  labels=names(diab), cex.axis=.7,
                  gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##   insulin 0.486979167
##   triceps 0.295572917
##  pressure 0.045572917
##      mass 0.014322917
##   glucose 0.006510417
##  pregnant 0.000000000
##  pedigree 0.000000000
##       age 0.000000000
##  diabetes 0.000000000

It is found that 48.7% of Insulin data and 29.9% of tricep varibale data is missing.

Le us make a subset of our diab dataset with complete cases to analyse the corelation between them.

#complete cases of diab dataset

diab_cc<-diab[complete.cases(diab),]

Corelation between features (Complete Cases)

Following code will make the corelation plot.

# Exploring corelation and visualiation of diab_cc #

corrplot(cor(diab_cc[, -9]), type = "upper", method = "number")

We can see that between pregnent and age there is moderately high positive corelation and its value is 0.68, this r value (corelation constant) is followed by 0.66 value between BMI and tricep (skin fold thickness) and similary another moderate positive corelation is between glucose and insulin variables having value 0.58.

We will be comparig these corelation values with the complete datasets after imputation later in this report. If the corelations are similar in the imputed datasets, that will indicate that the distribution of imputed datasets for varius features are not significantly changed due to imputation process.

MISSING VALUES IMPUTATION

There are many techniques to handle missing data, details are given below under each option's caption.

Removing incomplete cases

This method may be used if dataset is very big and only very small number of NA are preseent. This is not possible in our case.

Use of Mean, Mode or Median

Mean of numerical feature can be used to replace NA, this is common but can lead to change of distribution and add bias in the data.

Median of numerial feature can be used instead of mean to replace NA if data has some outliers and can influence the mean, but again it can change the distribution of the variable and may add bias.

Mode can be used in categorical features as a replacement of NA, but it can have impact on the distribution of feature.

Regression Imputation

Using regression model, the missing values can be predicted and this is also quite common method. The missing values will lie on the regression line developed using the complete sets and available features of incomplete cases.

Stochastic Regression Imputation

This method uses regression to impute NA is the dataset but adds some small random variablity in the values found using regression. This method tries to preserve the variablity in the dataset. I have used this imputation method to complete the dataset.

Random forest Missing Data Imputation

This method is very common, as random forest model starts with the most common value mode or median (in our case median) in that feature as the replacement of NA and then improves it by making random forests and run all the data on this forest until it achives the accuracy. I have used this method as my second method to impute data.

So, there are two imputed datasets.

Following code will produce stochastic regression imputation and make diab_imp_sr dataset.

# regression imputation of missing values #
set.seed(100, sample.kind="Rounding")

## Warning in set.seed(100, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

imp <- mice(diab, method = "norm.nob", m = 1) # Impute data

## 
##  iter imp variable
##   1   1  glucose  pressure  triceps  insulin  mass
##   2   1  glucose  pressure  triceps  insulin  mass
##   3   1  glucose  pressure  triceps  insulin  mass
##   4   1  glucose  pressure  triceps  insulin  mass
##   5   1  glucose  pressure  triceps  insulin  mass

diab_imp_sr <- complete(imp) # Store data

Following is the corelation plot using the diab_imp_sr dataset (Imputed using stohachastic regression)

# visulazing and checking the correlation between features after imputation #
corrplot(cor(diab_imp_sr[, -9]), type = "upper", method = "number")

As we can see that the corelation between the variables has changed very slightly, so this shows that data imputed is in line with the other values in the complete cases.

Following code will produce random forest imputation and make diab_imp_rf dataset.

# random forest imputation of missing values #
set.seed(101, sample.kind="Rounding")

## Warning in set.seed(101, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

diab_imp_rf<-rfImpute(diabetes ~ ., diab, iter=5, ntree=500)

## ntree      OOB      1      2
##   500:  23.31% 15.00% 38.81%
## ntree      OOB      1      2
##   500:  24.22% 16.20% 39.18%
## ntree      OOB      1      2
##   500:  24.09% 15.60% 39.93%
## ntree      OOB      1      2
##   500:  23.96% 15.80% 39.18%
## ntree      OOB      1      2
##   500:  24.22% 16.20% 39.18%

Following is the corelation plot using the diab_imp_rf dataset (Imputed using random forest method)

# visualizing and checking the correlation between features after random forest imputation #
corrplot(cor(diab_imp_rf[, -1]), type = "upper", method = "number")

The corelation values are similar those of orignal complete cases dataset, hence imputed values seem to corretly predicted.

MODEL DEVELOPMENT

PARTITIONING DATA- TWO TRAIN SETS AND TWO TEST SETS

As we have two imputed datasets, so we will divide each of them into train and test set. train1 is the traing set obtained from the diab_imp_sr dataset (Imputed using stohachastic regression) test1 is the test set obtained from the diab_imp_sr dataset (Imputed using stohachastic regression)

Followig code will create train1 and test1 sets.

# test and train set from stochastic regression imputed dataset #
set.seed(102, sample.kind="Rounding")

## Warning in set.seed(102, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

ind<-createDataPartition(y=diab$diabetes,times=1, p=0.2, list=FALSE)
train1<-diab_imp_sr[-ind,]
test1<-diab_imp_sr[ind,]

Now we make train and test sets using diab_imp_rf dataset (Imputed using random forest method) train2 is the traing set obtained from diab_imp_rf dataset (Imputed using random forest method) test2 is the test set obtained from diab_imp_rf dataset (Imputed using random forest method)

Following code wil produce train2 and test2 dataset.

# test and train set from random forest imputed dataset #
set.seed(103, sample.kind="Rounding")

## Warning in set.seed(103, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

ind<-createDataPartition(y=diab$diabetes,times=1, p=0.2, list=FALSE)
train2<-diab_imp_rf[-ind,]
test2<-diab_imp_rf[ind,]

Analysis Methodology

Using train1 set we will make three traing models using machine learning procedures of KNN, SVM and Logistic Regression. We will call these models as train_knn1, train_svm1, train_glm1.

And similary we will use train2 dataset to train three models using same method and these will be called train_knn2, train_svm2 and train_glm2.

We will use test1 and test2 datasets to check accuracy of all of these six models and will evaluate their performance.

We will use 10-fold cross validation to develop KNN and SVM models and selct the tuning parameters based on cross validation.

KNN1 and KNN2 traing

Following code will make train_knn1 model using 10-fold cross valiation.

######### MODELS TRAINING ###############

#### KNN MODEL ####

# train knn model using 10-fold cross validation for train1 dataset
set.seed(104, sample.kind="Rounding")

## Warning in set.seed(104, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

train_knn1 <- train(diabetes ~ .,
                      method = "knn",
                      data = train1,
                      tuneGrid = data.frame(k = seq(3, 51, 2)),
                      trControl = trainControl(method = "cv", number = 10, p = 0.9))

Using the below code we can find the best tuning parameter (number of neighbors)

# checking the best tuning parameter for knn
ggplot(train_knn1)

train_knn1$bestTune

##     k
## 17 35

Similarly we train the train_knn2 model using similar procedure.

# train knn model using 10-fold cross validation for train2 dataset
set.seed(105, sample.kind="Rounding")

## Warning in set.seed(105, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

train_knn2 <- train(diabetes ~ .,
                    method = "knn",
                    data = train2,
                    tuneGrid = data.frame(k = seq(3, 51, 2)),
                    trControl = trainControl(method = "cv", number = 10, p = 0.9))

# checking the best tuning parameter for knn
ggplot(train_knn2)

train_knn2$bestTune

##   k
## 4 9

SVM1 and SVM2 training

We use 10 fold cross-validation, and accuracy as our test metric to train for Support Vector Machines model. As our data has 8 features, we use radial kernal instead of linear kernal.

Following code will develop train_svm1 model

#### SVM MODEL ####

# tain svm model using 10-fold cross validation using train1 dataset #
set.seed(106, sample.kind="Rounding")

## Warning in set.seed(106, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

k <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
train_svm1 <- train(diabetes~., data=train1, method="svmRadial", metric=metric, trControl=k)

Followig code will develop train_svm2 model.

# tain svm model using 10-fold cross validation using train2 dataset #
set.seed(107, sample.kind="Rounding")

## Warning in set.seed(107, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

k <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
train_svm2 <- train(diabetes~., data=train2, method="svmRadial", metric=metric, trControl=k)

Logistic Regression Model GLM1 and GLM2 training

As our response variable is dichotomous, we can use Logistic Regression model for it. Following code will develop train_glm1 model.

#### LOGISTIC REGRESSION MODEL ####

# train logistic regression model using train1 dataset #

train_glm1<-train(diabetes~.,data=train1,method="glm")

Following code will develop train_glm2 model.

# train logistic regression model using train2 dataset #
train_glm2<-train(diabetes~.,data=train2,method="glm")

Testing of Models using test1 and test2 datasets

we use test1 dataset to test train_knn1 model, following code will do the testing will give accuracy as well confusion matrix.

#### predictions of models developed using train1 dataset ####


## predicting test1 (test part of dataset imputed using stochastic regression) 
## usnig train_knn1 
pred_knn1<-predict(train_knn1,test1)
mean(pred_knn1==test1$diabetes)

## [1] 0.7467532

cm1<-confusionMatrix(pred_knn1,test1$diabetes)
cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  86  25
##        pos  14  29
##                                           
##                Accuracy : 0.7468          
##                  95% CI : (0.6705, 0.8133)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.006192        
##                                           
##                   Kappa : 0.4166          
##                                           
##  Mcnemar's Test P-Value : 0.109315        
##                                           
##             Sensitivity : 0.8600          
##             Specificity : 0.5370          
##          Pos Pred Value : 0.7748          
##          Neg Pred Value : 0.6744          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5584          
##    Detection Prevalence : 0.7208          
##       Balanced Accuracy : 0.6985          
##                                           
##        'Positive' Class : neg             
##

cm1$overall[1]

##  Accuracy 
## 0.7467532

cm1$byClass[1:2]

## Sensitivity Specificity 
##    0.860000    0.537037

We have used train_knn1 model to test test1 dataset, both training and test sets are from data Imputed by Stochatic Regression model. As per the above results, the accuracy is 74.68%. All other parameters are also given in the confusion matrix output.

Now, we use test2 data set for prediction using train_knn1.

##predicting test2 (test part of dataset imputed using random forest) usnig train_knn1 #
pred_knn12<-predict(train_knn1,test2)
mean(pred_knn12==test2$diabetes)

## [1] 0.7597403

cm2<-confusionMatrix(pred_knn12,test2$diabetes)
cm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  87  24
##        pos  13  30
##                                           
##                Accuracy : 0.7597          
##                  95% CI : (0.6844, 0.8248)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.002122        
##                                           
##                   Kappa : 0.4465          
##                                           
##  Mcnemar's Test P-Value : 0.100178        
##                                           
##             Sensitivity : 0.8700          
##             Specificity : 0.5556          
##          Pos Pred Value : 0.7838          
##          Neg Pred Value : 0.6977          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5649          
##    Detection Prevalence : 0.7208          
##       Balanced Accuracy : 0.7128          
##                                           
##        'Positive' Class : neg             
##

cm2$overall[1]

##  Accuracy 
## 0.7597403

cm2$byClass[1:2]

## Sensitivity Specificity 
##   0.8700000   0.5555556

In the above case, the test set is test2, which is part of the data imputed using random forest regression. The accuracy is now 75.97%. The predicted accuracy is higher on test2 then test1 but the difference is only about 1.5%.

Following code will use train_svm1 on test1 dataset.

## predicting test1 (test part of dataset imputed using stochastic regression) 
## usnig train_svm1 #
pred_svm1<-predict(train_svm1,test1)
mean(pred_svm1==test1$diabetes)

## [1] 0.7077922

cm3<-confusionMatrix(pred_svm1,test1$diabetes)
cm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  80  25
##        pos  20  29
##                                           
##                Accuracy : 0.7078          
##                  95% CI : (0.6292, 0.7782)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.07413         
##                                           
##                   Kappa : 0.3444          
##                                           
##  Mcnemar's Test P-Value : 0.55098         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.5370          
##          Pos Pred Value : 0.7619          
##          Neg Pred Value : 0.5918          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5195          
##    Detection Prevalence : 0.6818          
##       Balanced Accuracy : 0.6685          
##                                           
##        'Positive' Class : neg             
##

cm3$overall[1]

##  Accuracy 
## 0.7077922

cm3$byClass[1:2]

## Sensitivity Specificity 
##    0.800000    0.537037

The accuracy is 70.78%

Following code will use train_svm1 on test2 dataset.

## predicting test2 (test part of dataset imputed using random forest method)
## usnig train_svm1 
pred_svm12<-predict(train_svm1,test2)
mean(pred_svm12==test2$diabetes)

## [1] 0.7467532

cm4<-confusionMatrix(pred_svm12,test2$diabetes)
cm4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  87  26
##        pos  13  28
##                                           
##                Accuracy : 0.7468          
##                  95% CI : (0.6705, 0.8133)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.006192        
##                                           
##                   Kappa : 0.4113          
##                                           
##  Mcnemar's Test P-Value : 0.054664        
##                                           
##             Sensitivity : 0.8700          
##             Specificity : 0.5185          
##          Pos Pred Value : 0.7699          
##          Neg Pred Value : 0.6829          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5649          
##    Detection Prevalence : 0.7338          
##       Balanced Accuracy : 0.6943          
##                                           
##        'Positive' Class : neg             
##

cm4$overall[1]

##  Accuracy 
## 0.7467532

cm4$byClass[1:2]

## Sensitivity Specificity 
##   0.8700000   0.5185185

The accuracy is 74.68%

Following code will use train_glm1 on dataset test1,

## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_glm1 
pred_glm1<-predict(train_glm1,test1)
mean(pred_glm1==test1$diabetes)

## [1] 0.7337662

cm5<-confusionMatrix(pred_glm1,test1$diabetes)
cm5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  81  22
##        pos  19  32
##                                           
##                Accuracy : 0.7338          
##                  95% CI : (0.6566, 0.8017)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.01593         
##                                           
##                   Kappa : 0.4078          
##                                           
##  Mcnemar's Test P-Value : 0.75478         
##                                           
##             Sensitivity : 0.8100          
##             Specificity : 0.5926          
##          Pos Pred Value : 0.7864          
##          Neg Pred Value : 0.6275          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5260          
##    Detection Prevalence : 0.6688          
##       Balanced Accuracy : 0.7013          
##                                           
##        'Positive' Class : neg             
##

cm5$overall[1]

##  Accuracy 
## 0.7337662

cm5$byClass[1:2]

## Sensitivity Specificity 
##   0.8100000   0.5925926

The prediction accuracy is now 73.38%

Following code will use train_glm1 on dataset test2,

## predicting test2 (test part of dataset imputed using random forest method)
## usnig train_glm1 
pred_glm12<-predict(train_glm1,test2)
mean(pred_glm12==test2$diabetes)

## [1] 0.7467532

cm6<-confusionMatrix(pred_glm12,test2$diabetes)
cm6

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  84  23
##        pos  16  31
##                                           
##                Accuracy : 0.7468          
##                  95% CI : (0.6705, 0.8133)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.006192        
##                                           
##                   Kappa : 0.4268          
##                                           
##  Mcnemar's Test P-Value : 0.336668        
##                                           
##             Sensitivity : 0.8400          
##             Specificity : 0.5741          
##          Pos Pred Value : 0.7850          
##          Neg Pred Value : 0.6596          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5455          
##    Detection Prevalence : 0.6948          
##       Balanced Accuracy : 0.7070          
##                                           
##        'Positive' Class : neg             
##

cm6$overall[1]

##  Accuracy 
## 0.7467532

cm6$byClass[1:2]

## Sensitivity Specificity 
##   0.8400000   0.5740741

The test2 set accuracy is 74.68%

We now use models develped using train2 dataset for testing of both test sets. Following code will use train_knn2 model to test dataset test2.

#### predictions of models developed using train2 dataset ####
## predicting test2 (test part of dataset imputed using random forest) 
## usnig train_knn2 
pred_knn2<-predict(train_knn2,test2)
mean(pred_knn2==test2$diabetes)

## [1] 0.7337662

cm7<-confusionMatrix(pred_knn2,test2$diabetes)
cm7

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  78  19
##        pos  22  35
##                                           
##                Accuracy : 0.7338          
##                  95% CI : (0.6566, 0.8017)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.01593         
##                                           
##                   Kappa : 0.4227          
##                                           
##  Mcnemar's Test P-Value : 0.75478         
##                                           
##             Sensitivity : 0.7800          
##             Specificity : 0.6481          
##          Pos Pred Value : 0.8041          
##          Neg Pred Value : 0.6140          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5065          
##    Detection Prevalence : 0.6299          
##       Balanced Accuracy : 0.7141          
##                                           
##        'Positive' Class : neg             
##

cm7$overall[1]

##  Accuracy 
## 0.7337662

cm7$byClass[1:2]

## Sensitivity Specificity 
##   0.7800000   0.6481481

The accuracy is 73.38%, here test set and train dataset used are based on Imputed data using random forest method.

Following code will use train_knn2 model to test dataset test1.

##predicting test1 (test part of dataset imputed using stochastic regression)
##usnig train_knn2 
pred_knn21<-predict(train_knn2,test1)
mean(pred_knn21==test1$diabetes)

## [1] 0.7402597

cm8<-confusionMatrix(pred_knn21,test1$diabetes)
cm8

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  80  20
##        pos  20  34
##                                           
##                Accuracy : 0.7403          
##                  95% CI : (0.6635, 0.8075)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.01009         
##                                           
##                   Kappa : 0.4296          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.6296          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.6296          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5195          
##    Detection Prevalence : 0.6494          
##       Balanced Accuracy : 0.7148          
##                                           
##        'Positive' Class : neg             
##

cm8$overall[1]

##  Accuracy 
## 0.7402597

cm8$byClass[1:2]

## Sensitivity Specificity 
##   0.8000000   0.6296296

The accuracy of the model is 74.03% on test1 dataset. Both test datasets are predicted with close accuracies which shows that the model has not over trained.

Following code will use train_svm2 model to test dataset test2.

## predicting test2 (test part of dataset imputed using random forest)
## usnig train_svm2 
pred_svm2<-predict(train_svm2,test2)
mean(pred_svm2==test2$diabetes)

## [1] 0.7142857

cm9<-confusionMatrix(pred_svm2,test2$diabetes)
cm9

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  83  27
##        pos  17  27
##                                          
##                Accuracy : 0.7143         
##                  95% CI : (0.636, 0.7841)
##     No Information Rate : 0.6494         
##     P-Value [Acc > NIR] : 0.05265        
##                                          
##                   Kappa : 0.3447         
##                                          
##  Mcnemar's Test P-Value : 0.17484        
##                                          
##             Sensitivity : 0.8300         
##             Specificity : 0.5000         
##          Pos Pred Value : 0.7545         
##          Neg Pred Value : 0.6136         
##              Prevalence : 0.6494         
##          Detection Rate : 0.5390         
##    Detection Prevalence : 0.7143         
##       Balanced Accuracy : 0.6650         
##                                          
##        'Positive' Class : neg            
##

cm9$overall[1]

##  Accuracy 
## 0.7142857

cm9$byClass[1:2]

## Sensitivity Specificity 
##        0.83        0.50

The prediction accuracy is 71.43%

Following code will use train_svm2 model to test dataset test1.

## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_svm2 
pred_svm21<-predict(train_svm2,test1)
mean(pred_svm21==test1$diabetes)

## [1] 0.7532468

cm10<-confusionMatrix(pred_svm21,test1$diabetes)
cm10

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  83  21
##        pos  17  33
##                                           
##                Accuracy : 0.7532          
##                  95% CI : (0.6774, 0.8191)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.003683        
##                                           
##                   Kappa : 0.4488          
##                                           
##  Mcnemar's Test P-Value : 0.626496        
##                                           
##             Sensitivity : 0.8300          
##             Specificity : 0.6111          
##          Pos Pred Value : 0.7981          
##          Neg Pred Value : 0.6600          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5390          
##    Detection Prevalence : 0.6753          
##       Balanced Accuracy : 0.7206          
##                                           
##        'Positive' Class : neg             
##

cm10$overall[1]

##  Accuracy 
## 0.7532468

cm10$byClass[1:2]

## Sensitivity Specificity 
##   0.8300000   0.6111111

The prediction accuracy is 75.32% on test dataset test1.

Following code will use train_glm2 model to test dataset test2.

## predicting test2 (test part of dataset imputed using random forest)
## usnig train_glm2 
pred_glm2<-predict(train_glm2,test2)
mean(pred_glm2==test2$diabetes)

## [1] 0.7467532

cm11<-confusionMatrix(pred_glm2,test2$diabetes)
cm11

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  83  22
##        pos  17  32
##                                           
##                Accuracy : 0.7468          
##                  95% CI : (0.6705, 0.8133)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.006192        
##                                           
##                   Kappa : 0.4318          
##                                           
##  Mcnemar's Test P-Value : 0.521839        
##                                           
##             Sensitivity : 0.8300          
##             Specificity : 0.5926          
##          Pos Pred Value : 0.7905          
##          Neg Pred Value : 0.6531          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5390          
##    Detection Prevalence : 0.6818          
##       Balanced Accuracy : 0.7113          
##                                           
##        'Positive' Class : neg             
##

cm11$overall[1]

##  Accuracy 
## 0.7467532

cm11$byClass[1:2]

## Sensitivity Specificity 
##   0.8300000   0.5925926

The prediction accuracy is 74.68%.

Following code will use train_glm2 model to test dataset test1.

## predicting test1 (test part of dataset imputed using stochastic regression)
## usnig train_glm2 
pred_glm21<-predict(train_glm2,test1)
mean(pred_glm21==test1$diabetes)

## [1] 0.7597403

cm12<-confusionMatrix(pred_glm21,test1$diabetes)
cm12

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  82  19
##        pos  18  35
##                                           
##                Accuracy : 0.7597          
##                  95% CI : (0.6844, 0.8248)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.002122        
##                                           
##                   Kappa : 0.4702          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8200          
##             Specificity : 0.6481          
##          Pos Pred Value : 0.8119          
##          Neg Pred Value : 0.6604          
##              Prevalence : 0.6494          
##          Detection Rate : 0.5325          
##    Detection Prevalence : 0.6558          
##       Balanced Accuracy : 0.7341          
##                                           
##        'Positive' Class : neg             
##

cm12$overall[1]

##  Accuracy 
## 0.7597403

cm12$byClass[1:2]

## Sensitivity Specificity 
##   0.8200000   0.6481481

The prediction accuracy is 75.97%

RESULTS, DISCUSSION AND REFLECTION

Summary Table

Following table summarise all the models and their test and train datasets as well as their accuracies.

# making a dataframe of all the results and presnting them in a nice table

results<-data.frame(Model=c("KNN1","KNN2","SVM1","SVM2","GLM1","GLM2"), 
                    Training_Data = c("train1",
                    "train2","train1","train2","train1","train2"), 
                    NA_imputation_method =c("Stochastic Regression",
                      "Random Forest", "Stochastic Regression", "Random Forest",
                      "Stochastic Regression","Random Forest"),
                    Accuracy_test1=c(mean(pred_knn1==test1$diabetes),
                    mean(pred_knn21==test1$diabetes),mean(pred_svm1==test1$diabetes),
                    mean(pred_svm21==test1$diabetes),mean(pred_glm1==test1$diabetes),
                    mean(pred_glm21==test1$diabetes)),
                    Accuracy_test2=c(mean(pred_knn12==test2$diabetes),
                                     mean(pred_knn2==test2$diabetes),
                                     mean(pred_svm12==test2$diabetes),
                                     mean(pred_svm2==test2$diabetes),
                                     mean(pred_glm12==test2$diabetes),
                                     mean(pred_glm2==test2$diabetes)))
 

results%>%knitr::kable()

Model	Training_Data	NA_imputation_method	Accuracy_test1	Accuracy_test2
KNN1	train1	Stochastic Regression	0.7467532	0.7597403
KNN2	train2	Random Forest	0.7402597	0.7337662
SVM1	train1	Stochastic Regression	0.7077922	0.7467532
SVM2	train2	Random Forest	0.7532468	0.7142857
GLM1	train1	Stochastic Regression	0.7337662	0.7467532
GLM2	train2	Random Forest	0.7597403	0.7467532

From the accuracies values, we can see that Logistic Regression and KNN models have peroformed better than SVM model on this dataset. It is also noted that some values of accuracies are same for two different methods, for example on KNN1 model using test2 dataset and on GLM2 model using test1 dataset the accuray is 75.97%, which is the heighest among all other cases. A close analysis of these two models show that, although they have same accuracies but their confusion matrices are different. For KNN1/test2 sensitivity is 87%, Specificvity is 55.56% while GLM2/test1 sensivity is 82%, Specficity is 64.81%. An important point here is that it may not be a very good decision to select a model just based on accuracy, but it is important to analyse all the information given in the confusion matrix as well as statistical and substantial significacne of model parameters. In the current analysis, If we just compare these two high accuracy models, then we need to see which one is more important- identifying ture positive or predicting negatives correctly. In our case, the postive class in all of the confusion matrices is "neg", which is absense of diabetes. A high accuracy means, we will be abe to correctly identify a person who is not likely to have diabetes.

A close look at all the confusion matrices indicate that in all othe models, sensitivity is high and usually around 80% or higher but the specificity is low. All of these models have one thing in common that they have high postive prediction power (postive case here is "neg" means identifying correctly patient with no diabetes). These models are not suitable for identifiying "pos" cases as the negative prediction power is low. More machine learning methods like Random Foret, Neural Network etc can be applied to develop a model with high power to identify presence of diabetes.

It is also noted that during data collection and compilation, special efforts are needed to ensure it is free from missing values or errors. Presence of missing values may be handled using the techniques discussed in this report or using more advance statistical procedures like hybrid classifiers LANFIS - Logic Adaptive Network-based Fuzzy Inference System which is based on combination of Logistic Regression and Adaptive Network-based Fuzzy Inference System (Ramezani et.al, 2017) where they have achieved an accuracy of 88.07% on the same dataset.

CONCLUSION

In this project, PimaIndiansDiabetes dataset was used to develop classfication model. The dataset contains missing values, only 51% of the instances are complete cases. Ignoring the incomplete cases and using the remaining data is not a suitable option, so missing data was imputed using stochasic regression and random forest model. The two imputed datasets were used to train models based on KNN, SVM and Logestic regression. KNN and SVM model were trained using 10-fold cross validation and using accuracy metric. The trained models were tested on test part of both imputed datasets. These models have shown accuracies in general from 70%-76%, with high prediction power to identify the cases of absene of disease. The best accuracy was found for KNN and GLM model and the value is 75.97%. A comparatively better model identified from these six models was GLM as it has similar accuracy to KNN with high sensivtivity and specificity values.

Predictive Classification model by Imputing missing values in Diabetes Observational Studies

Waseem Ahmad

1 September 2020