Content
All patients are females at least 21 years old of Pima Indian Heritage for which the following information is available:
- Pregnancies: Number of times pregnant.
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- BloodPressure: Diastolic blood pressure (mm Hg).
- SkinThickness: Triceps skin fold thickness (mm).
- Insulin: 2-Hour serum insulin (mu U/ml).
- BMI: Body mass index (weight in kg/(height in m)^2).
- DiabetesPedigreeFunction: Diabetes pedigree function.
- Age: Age (years).
- Outcome: Class variable (0 or 1).
Using this dataset a decision tree analysis is made in order to classify whether a female has the decease or not.
Loading the libraries and dataset
library(tidyverse)
library(caret)
library(corrplot)
library(mice)
diabetes <- read.csv(file = "diabetes.csv")
Exploring the dataset
str(diabetes) # Examining the structure of this dasetet
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
slice_sample(diabetes, n=10) # Getting the first observations of this dataset
We first notice that the dataset has 768 rows and 9 features. This data is aggregated at the level of a patient. Thus each example or row is the record of one patient.
md.pattern(diabetes, plot = FALSE) # Looking for missing values
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 768 1 1 1 1 1 1
## 0 0 0 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 768 1 1 1 0
## 0 0 0 0
We notice that there’s no missing values in this dataset.
diabetes2 <- diabetes
diabetes2$Outcome <- factor(diabetes$Outcome, levels=c(0:1), labels=c("Healthy", "Diabetes"))
freq <- table(diabetes2$Outcome)
freq[2]/(freq[1]+freq[2]) # Looking more in depth at the target variable
## Diabetes
## 0.3489583
contrasts(diabetes2$Outcome)
## Diabetes
## Healthy 0
## Diabetes 1
We see that about 35% of the patients have the decease.
cordiabetes <- cor(diabetes) #Getting the correlation matrix between variables
corrplot(cordiabetes,
method = "color",
order = "hclust",
addCoef.col = "black",
number.cex = .6) # Visualizing the correlation matrix to identify patterns between variables.
We notice that all variables are positively correlated with the dependent variable. In order to simplify the model, the variables SkinThickness and BloodPressure will not be considered because of the low correlation with the variable of interest.
#diabetes$BloodPressure = NULL
#diabetes$SkinThickness = NULL
diabetes$Outcome <- factor(diabetes$Outcome, levels=c(0:1), labels=c("Healthy", "Diabetes"))