About the dataset

The dataset is originally from The National Institute of Diabetes and Kidney Diseases, available in Kaggle.

Content

All patients are females at least 21 years old of Pima Indian Heritage for which the following information is available:

Pregnancies: Number of times pregnant.
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
BloodPressure: Diastolic blood pressure (mm Hg).
SkinThickness: Triceps skin fold thickness (mm).
Insulin: 2-Hour serum insulin (mu U/ml).
BMI: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction: Diabetes pedigree function.
Age: Age (years).
Outcome: Class variable (0 or 1).

Using this dataset a decision tree analysis is made in order to classify whether a female has the decease or not.

Loading the libraries and dataset

library(tidyverse)
library(caret)
library(corrplot)
library(mice)
diabetes <- read.csv(file = "diabetes.csv")

Exploring the dataset

str(diabetes) # Examining the structure of this dasetet

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

slice_sample(diabetes, n=10) # Getting the first observations of this dataset

We first notice that the dataset has 768 rows and 9 features. This data is aggregated at the level of a patient. Thus each example or row is the record of one patient.

md.pattern(diabetes, plot = FALSE) # Looking for missing values

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 768           1       1             1             1       1   1
##               0       0             0             0       0   0
##     DiabetesPedigreeFunction Age Outcome  
## 768                        1   1       1 0
##                            0   0       0 0

We notice that there’s no missing values in this dataset.

diabetes2 <- diabetes
diabetes2$Outcome <- factor(diabetes$Outcome, levels=c(0:1), labels=c("Healthy", "Diabetes"))
freq <- table(diabetes2$Outcome)
freq[2]/(freq[1]+freq[2]) # Looking more in depth at the target variable

##  Diabetes 
## 0.3489583

contrasts(diabetes2$Outcome)

##          Diabetes
## Healthy         0
## Diabetes        1

We see that about 35% of the patients have the decease.

cordiabetes <- cor(diabetes) #Getting the correlation matrix between variables
corrplot(cordiabetes,
         method = "color",
         order = "hclust",
         addCoef.col = "black",
         number.cex = .6) # Visualizing the correlation matrix to identify patterns between variables.

We notice that all variables are positively correlated with the dependent variable. In order to simplify the model, the variables SkinThickness and BloodPressure will not be considered because of the low correlation with the variable of interest.

#diabetes$BloodPressure = NULL
#diabetes$SkinThickness = NULL
diabetes$Outcome <- factor(diabetes$Outcome, levels=c(0:1), labels=c("Healthy", "Diabetes"))

Preparing the data for the Decision Tree Model

partition <- createDataPartition(y=diabetes$Outcome, p =.70, list = FALSE, times =1)
data_train <- diabetes[partition, ] # Creating the dataset for training
data_test <- diabetes[-partition, ] # Creating the dataset for testing
fit_control <- trainControl(method ="cv", number = 5) # Creating a variable for cross validation

Training the model

DTFit <- train(Outcome~., data = data_train, method = "C5.0", trControl = fit_control,  verbose = FALSE)
DTFit

## C5.0 
## 
## 538 samples
##   8 predictor
##   2 classes: 'Healthy', 'Diabetes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 431, 430, 430, 431, 430 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa    
##   rules  FALSE    1      0.7119072  0.3637562
##   rules  FALSE   10      0.7473001  0.4313719
##   rules  FALSE   20      0.7473001  0.4313719
##   rules   TRUE    1      0.7323641  0.3966719
##   rules   TRUE   10      0.7695050  0.4733488
##   rules   TRUE   20      0.7695050  0.4733488
##   tree   FALSE    1      0.7063171  0.3574779
##   tree   FALSE   10      0.7584285  0.4471459
##   tree   FALSE   20      0.7584285  0.4471459
##   tree    TRUE    1      0.7323641  0.3962494
##   tree    TRUE   10      0.7676532  0.4683259
##   tree    TRUE   20      0.7676532  0.4683259
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = rules and
##  winnow = TRUE.

Testing the data with new data

diabetes_prediction <- predict(DTFit, data_test)
confusionMatrix(diabetes_prediction, data_test$Outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Diabetes
##   Healthy      137       39
##   Diabetes      13       41
##                                           
##                Accuracy : 0.7739          
##                  95% CI : (0.7143, 0.8263)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 4.208e-05       
##                                           
##                   Kappa : 0.4608          
##                                           
##  Mcnemar's Test P-Value : 0.0005265       
##                                           
##             Sensitivity : 0.9133          
##             Specificity : 0.5125          
##          Pos Pred Value : 0.7784          
##          Neg Pred Value : 0.7593          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5957          
##    Detection Prevalence : 0.7652          
##       Balanced Accuracy : 0.7129          
##                                           
##        'Positive' Class : Healthy         
##