STEP-1 : Importing Diabetes Data


#####This will import the data
setwd("C:/Users/Shradha/Desktop/SEM-8/R Lab/Mini Project")
diabetes <- read.csv("diabetes.csv")

STEP-2 : Exploratory Data Analysis on Diabetes Data


Head : head()

Head is a function which returns the first 6 observations of the dataset.

head(diabetes)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Summary : summary()

Here we are computing the minimum,1st quartile, median, mean,3rd quartile and the maximum for all numeric variables of a dataset at once using summary() function.

summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Structure : str()

The structure() function displays the internal structure of a data object.

str(diabetes)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Column Names : colnames()

The colname() function displays the column names available in the dataset.

colnames(diabetes)
## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

Minimum : min() & Maximum : max()

Here we will find the minimum and maximum value from Glucose column of diabetes dataset.

min_glucose <- min(diabetes$Glucose)
print(paste("Minimum Glucose Value :",min_glucose))
## [1] "Minimum Glucose Value : 0"
max_glucose <- max(diabetes$Glucose)
print(paste("Maximum Glucose Value :",max_glucose))
## [1] "Maximum Glucose Value : 199"

Range : range()

Range function gives you the minimum and maximum directly in the form of an object and we need to access it as shown below.

range_Glucose <- range(diabetes$Glucose)
print(range_Glucose)
## [1]   0 199
print(paste("Minimum Glucose Value :",range_Glucose[1]))
## [1] "Minimum Glucose Value : 0"
print(paste("Maximum Glucose Value :",range_Glucose[2]))
## [1] "Maximum Glucose Value : 199"

Mean : mean()

Mean is calculated by taking the sum of the values and dividing with the number of values in a data series. Here we will find the mean of Glucose column.

Mean_Glucose <- mean(diabetes$Glucose)
print(paste("Mean of Glucose :",Mean_Glucose))
## [1] "Mean of Glucose : 120.89453125"

Median : median()

The middle most value in a data series is called the median. Let us find this median from Glucose column.

Median_Glucose <- median(diabetes$Glucose)
print(paste("Median of Glucose :",Median_Glucose))
## [1] "Median of Glucose : 117"

Mode : table() & sort()

The mode is the value that has highest number of occurrences in a set of data. There is no function to find the mode of a variable. However, we can easily find it thanks to the functions table() and sort().

Let us find the value that is most repeated in Glucose column.

Mode_Glucose <- table(diabetes$Glucose)
sort(Mode_Glucose,decreasing = TRUE)
## 
##  99 100 106 111 125 129  95 102 105 108 112 109 122  90 107 114 117 119 120 124 
##  17  17  14  14  14  14  13  13  13  13  13  12  12  11  11  11  11  11  11  11 
## 128  84 115  88  91  92  97 101 103 123 126 146  96 136 137 139 158  85  87  93 
##  11  10  10   9   9   9   9   9   9   9   9   9   8   8   8   8   8   7   7   7 
##  94 116 130 144 147  80  81  83  89 104 110 118 121 134 143 151 154 162 173   0 
##   7   7   7   7   7   6   6   6   6   6   6   6   6   6   6   6   6   6   6   5 
## 113 127 131 132 133 138 140 141 142 145 155 179 180 181  71  74  78 135 148 152 
##   5   5   5   5   5   5   5   5   5   5   5   5   5   5   4   4   4   4   4   4 
## 165 168 187 189 197  68  73  79  82  86  98 150 156 161 163 164 166 167 171 183 
##   4   4   4   4   4   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 184 194 196  57  75  76  77 153 157 159 170 174 175 176 188 193 195  44  56  61 
##   3   3   3   2   2   2   2   2   2   2   2   2   2   2   2   2   2   1   1   1 
##  62  65  67  72 149 160 169 172 177 178 182 186 190 191 198 199 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

First & Third Quartile : quantile()

The quantile function divides the data into equal halves, in which the median acts as middle and over that the remaining lower part is lower quartile and upper part is upper quartile.

q1 <- quantile(diabetes$Glucose,0.25) # first quartile
print(paste("First Quartile :",q1))
## [1] "First Quartile : 99"
q3 <- quantile(diabetes$Glucose,0.75) # third quartile
print(paste("Third Quartile :",q3))
## [1] "Third Quartile : 140.25"

Interquartile Range : IQR()

The interquartile range (i.e., the difference between the first and third quartile) can be computed with the IQR()function.So, let’s find the IQR for Glucose column. Alternatively with the quantile() function (as we already used the quantile function and calculated 1st and 3rd quantile we will directly subtract the values).

IQR_Glucose <- IQR(diabetes$Glucose)
print(paste("Interquartile range for Glucose :",IQR_Glucose))
## [1] "Interquartile range for Glucose : 41.25"
#alternative (refer First & Third Quartile section)
iqr_gluco <- q3 -q1
print(paste("Interquartile range for Glucose :",iqr_gluco))
## [1] "Interquartile range for Glucose : 41.25"

Standard Deviation : sd() & Variance : var()

Standard Deviation is a measure of the amount of variation in a set of values. Variance is a measure of how data points differ from the mean.

sd_Glucose <- sd(diabetes$Glucose)
print(paste("Standard Deviation for Glucose Column :",sd_Glucose))
## [1] "Standard Deviation for Glucose Column : 31.9726181951362"
var_Glucose <- var(diabetes$Glucose)
print(paste("Variance for Glucose Column :",var_Glucose))
## [1] "Variance for Glucose Column : 1022.24831425196"

STEP-3 : Predicting Diabetes


#DO NOT MODIFY THIS CODE
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2) #for data visualization
library(grid) # for grids
library(gridExtra) # for arranging the grids
library(corrplot) # for Correlation plot
library(caret) # for confusion matrix
library(e1071) # for naive bayes

Plotting Histograms of Numeric Values

p1 <- ggplot(diabetes, aes(x=Pregnancies)) + ggtitle("Number of times pregnant") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth = 1, colour="black", fill="blue") + ylab("Percentage")
p2 <- ggplot(diabetes, aes(x=Glucose)) + ggtitle("Glucose") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth = 5, colour="black", fill="orange") + ylab("Percentage")
p3 <- ggplot(diabetes, aes(x=BloodPressure)) + ggtitle("Blood Pressure") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth = 2, colour="black", fill="green") + ylab("Percentage")
p4 <- ggplot(diabetes, aes(x=SkinThickness)) + ggtitle("Skin Thickness") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth = 2, colour="black", fill="pink") + ylab("Percentage")
p5 <- ggplot(diabetes, aes(x=Insulin)) + ggtitle("Insulin") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth = 20, colour="black", fill="red") + ylab("Percentage")
p6 <- ggplot(diabetes, aes(x=BMI)) + ggtitle("Body Mass Index") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth = 1, colour="black", fill="yellow") + ylab("Percentage")
p7 <- ggplot(diabetes, aes(x=DiabetesPedigreeFunction)) + ggtitle("Diabetes Pedigree Function") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), colour="black", fill="purple") + ylab("Percentage")
p8 <- ggplot(diabetes, aes(x=Age)) + ggtitle("Age") +
  geom_histogram(aes(y = 100*(..count..)/sum(..count..)), binwidth=1, colour="black", fill="lightblue") + ylab("Percentage")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, ncol=2)
grid.rect(width = 1, height = 1, gp = gpar(lwd = 1, col = "black", fill = NA))

All the variables seem to have reasonable broad distribution, therefore, will be kept for the regression analysis.


Correlation between Numeric Variables

Here, sapply() function will return the columns from the diabetes dataset which have numeric values. cor() function will produce correlation matrix of all those numeric columns returned by sapply(). corrplot() provides a visual representation of correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.

numeric.var <- sapply(diabetes, is.numeric)
corr.matrix <- cor(diabetes[,numeric.var])
corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables", order = "hclust", tl.col = "black", tl.srt=45, tl.cex=0.8, cl.cex=0.8)
box(which = "outer", lty = "solid")

The numeric variables are almost not correlated.


Correlation between Numeric Variables and Outcomes

attach(diabetes)
par(mfrow=c(2,4))
boxplot(Pregnancies~Outcome, main="No. of Pregnancies vs. Diabetes", 
        xlab="Outcome", ylab="Pregnancies",col="red")
boxplot(Glucose~Outcome, main="Glucose vs. Diabetes", 
        xlab="Outcome", ylab="Glucose",col="pink")
boxplot(BloodPressure~Outcome, main="Blood Pressure vs. Diabetes", 
        xlab="Outcome", ylab="Blood Pressure",col="green")
boxplot(SkinThickness~Outcome, main="Skin Thickness vs. Diabetes", 
        xlab="Outcome", ylab="Skin Thickness",col="orange")
boxplot(Insulin~Outcome, main="Insulin vs. Diabetes", 
        xlab="Outcome", ylab="Insulin",col="yellow")
boxplot(BMI~Outcome, main="BMI vs. Diabetes", 
        xlab="Outcome", ylab="BMI",col="purple")
boxplot(DiabetesPedigreeFunction~Outcome, main="Diabetes Pedigree Function vs. Diabetes", xlab="Outcome", ylab="DiabetesPedigreeFunction",col="lightgreen")
boxplot(Age~Outcome, main="Age vs. Diabetes", 
        xlab="Outcome", ylab="Age",col="lightblue")
box(which = "outer", lty = "solid")

Blood pressure and skin thickness show little variation with diabetes, they will be excluded from the model. Other variables show more or less correlation with diabetes, so will be kept.


Linear Regression

diabetes$BloodPressure <- NULL
diabetes$SkinThickness <- NULL
train <- diabetes[1:540,]
test <- diabetes[541:768,]
model <-glm(Outcome ~.,family=binomial(link='logit'),data=train)
summary(model)
## 
## Call:
## glm(formula = Outcome ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4366  -0.7741  -0.4312   0.8021   2.7310  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.3461752  0.8157916 -10.231  < 2e-16 ***
## Pregnancies               0.1246856  0.0373214   3.341 0.000835 ***
## Glucose                   0.0315778  0.0042497   7.431 1.08e-13 ***
## Insulin                  -0.0013400  0.0009441  -1.419 0.155781    
## BMI                       0.0881521  0.0164090   5.372 7.78e-08 ***
## DiabetesPedigreeFunction  0.9642132  0.3430094   2.811 0.004938 ** 
## Age                       0.0018904  0.0107225   0.176 0.860053    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 700.47  on 539  degrees of freedom
## Residual deviance: 526.56  on 533  degrees of freedom
## AIC: 540.56
## 
## Number of Fisher Scoring iterations: 5

The top three most relevant features are “Glucose”, “BMI” and “Number of times pregnant” because of the low p-values.

“Insulin” and “Age” appear not statistically significant.

anova(model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Outcome
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                       539     700.47              
## Pregnancies               1   26.314       538     674.16 2.901e-07 ***
## Glucose                   1  102.960       537     571.20 < 2.2e-16 ***
## Insulin                   1    0.062       536     571.14  0.803341    
## BMI                       1   36.135       535     535.00 1.841e-09 ***
## DiabetesPedigreeFunction  1    8.414       534     526.59  0.003723 ** 
## Age                       1    0.031       533     526.56  0.860201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the table of deviance, we can see that adding insulin and age have little effect on the residual deviance.


Cross Validation

fitted.results <- predict(model,newdata=test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
(conf_matrix_logi<-table(fitted.results, test$Outcome))
##               
## fitted.results   0   1
##              0 136  34
##              1  14  44
misClasificError <- mean(fitted.results != test$Outcome)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.789473684210526"

Decision Tree

library(rpart)
model2 <- rpart(Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, data=train,method="class")
plot(model2, uniform=TRUE, 
    main="Classification Tree for Diabetes")
text(model2, use.n=TRUE, all=TRUE, cex=.8)
box(which = "outer", lty = "solid")

This means if a person’s BMI less than 45.4 and his/her Diabetes Pedigree function less than 0.8745, then the person is more likely to have diabetes.


Confusion Table and Accuracy

treePred <- predict(model2, test, type = 'class')
(conf_matrix_dtree<-table(treePred, test$Outcome))
##         
## treePred   0   1
##        0 121  29
##        1  29  49
mean(treePred==test$Outcome)
## [1] 0.745614

Naive Bayes

# creating Naive Bayes model
model_naive <- naiveBayes(Outcome ~., data = train)

Confusion Table and Accuracy

# predicting target 
toppredict_set <- test[1:6]
dim(toppredict_set)
## [1] 228   6
preds_naive <- predict(model_naive, newdata = toppredict_set)
(conf_matrix_naive <- table(preds_naive, test$Outcome))
##            
## preds_naive   0   1
##           0 129  29
##           1  21  49
mean(preds_naive==test$Outcome)
## [1] 0.7807018

Conclusion


If we compare accuracy and sensitivity level of our models to see the highest value, we can summarise as followed :

confusionMatrix(conf_matrix_logi)
## Confusion Matrix and Statistics
## 
##               
## fitted.results   0   1
##              0 136  34
##              1  14  44
##                                           
##                Accuracy : 0.7895          
##                  95% CI : (0.7307, 0.8405)
##     No Information Rate : 0.6579          
##     P-Value [Acc > NIR] : 9.506e-06       
##                                           
##                   Kappa : 0.5016          
##                                           
##  Mcnemar's Test P-Value : 0.006099        
##                                           
##             Sensitivity : 0.9067          
##             Specificity : 0.5641          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.7586          
##              Prevalence : 0.6579          
##          Detection Rate : 0.5965          
##    Detection Prevalence : 0.7456          
##       Balanced Accuracy : 0.7354          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(conf_matrix_dtree)
## Confusion Matrix and Statistics
## 
##         
## treePred   0   1
##        0 121  29
##        1  29  49
##                                           
##                Accuracy : 0.7456          
##                  95% CI : (0.6839, 0.8008)
##     No Information Rate : 0.6579          
##     P-Value [Acc > NIR] : 0.002723        
##                                           
##                   Kappa : 0.4349          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8067          
##             Specificity : 0.6282          
##          Pos Pred Value : 0.8067          
##          Neg Pred Value : 0.6282          
##              Prevalence : 0.6579          
##          Detection Rate : 0.5307          
##    Detection Prevalence : 0.6579          
##       Balanced Accuracy : 0.7174          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(conf_matrix_naive)
## Confusion Matrix and Statistics
## 
##            
## preds_naive   0   1
##           0 129  29
##           1  21  49
##                                           
##                Accuracy : 0.7807          
##                  95% CI : (0.7213, 0.8326)
##     No Information Rate : 0.6579          
##     P-Value [Acc > NIR] : 3.562e-05       
##                                           
##                   Kappa : 0.5005          
##                                           
##  Mcnemar's Test P-Value : 0.3222          
##                                           
##             Sensitivity : 0.8600          
##             Specificity : 0.6282          
##          Pos Pred Value : 0.8165          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.6579          
##          Detection Rate : 0.5658          
##    Detection Prevalence : 0.6930          
##       Balanced Accuracy : 0.7441          
##                                           
##        'Positive' Class : 0               
## 

In this project, we compared the performance of Linear Regression, Decision Tree and Naive Bayes algorithms and found that Linear Regression performed better on this standard, unaltered dataset. After, Linear Regression there comes Naive Bayes algorithm with more accuracy than the Decision Tree. Accuracy given by Linear Regression was 79%, Decision Tree was 74% and Naive Bayes was 78%.