Pima indians diabetes dataset

Background :

The data was collected and made available by National Institute of Diabetes and Digestive and Kidney Diseases as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

Business task :

The task is to investigate the factors associated with diabetes (outcome; 1 = Presence of diabetes, 0 =absence of diabetes

Hypothesis

The following hypothesis was formulated in order to achieve the above objective:
* H0: There is no association between having diabetes and risk factors like pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, Diabetes pedigree function and age. VS * H1: There is an association between having diabetes and those risk factors.

Data

The data is comprised of 8 predictors as described above with the response variable of whether or not someone is diabetic. The data was extracted from kaggle https://www.kaggle.com/cjboat/diabetes2

Model used

Logistic regression is the method used because the outcome variable is categorical. The data had 8 dependent variables each of which were continuous variables. The binary logistic regression was used, since the outcome was binary, yes/no to whether someone had diabetes.

Load and clean the data

#LOAD DATA
MyData <- read.csv("C:\\Users\\Lusui\\OneDrive - CM Advocates, LLP\\Documents\\R\\diabetes2.csv")

#Clean data
#Check missing values
any(is.na(MyData))
## [1] FALSE
#check null values
any(is.null(MyData))
## [1] FALSE
#Results is False meaning there are no missing values and no null value

Further cleaning of the data

Notes: I used a histograms to be able to visually represent any outliers The outliers packages is necessary

#Test outliers of each variable
#install.packages("outliers")
library(outliers)
outlier(MyData)
##              Pregnancies                  Glucose            BloodPressure 
##                    17.00                     0.00                     0.00 
##            SkinThickness                  Insulin                      BMI 
##                    99.00                   846.00                    67.10 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                     2.42                    81.00                     1.00
hist(MyData$Pregnancies, xlab= "Pregnancies")

hist(MyData$Glucose, xlab= "Glucose")

hist(MyData$BloodPressure, xlab= "BloodPressure")

hist(MyData$SkinThickness, xlab= "Skin Thickness")

hist(MyData$Insulin, xlab= "Insulin")

hist(MyData$BMI, xlab= "BMI")

hist(MyData$DiabetesPedigreeFunction, xlab= "Diabetes Pedigree Function")

hist(MyData$Age, xlab= "Age")

Replacing the outliers with the media values

Notes: This is to reduce the margin of errors in our data We then confirm to see that there are no outliers in the data

MyData$BloodPressure[MyData$BloodPressure %in% 0]<-median(MyData$BloodPressure)
MyData$Glucose[MyData$Glucose %in% 0]<-median(MyData$Glucose)
MyData$SkinThickness[MyData$SkinThickness%in% 0]<-median(MyData$SkinThickness)
MyData$Insulin[MyData$Insulin %in% 0]<-median(MyData$Insulin)
MyData$BMI[MyData$BMI %in% 0]<-median(MyData$BMI)
summary(MyData)
##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:23.00  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :121.66   Mean   : 72.39   Mean   :27.33  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 30.50   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 31.25   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 94.65   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.25   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
library(outliers)
outlier(MyData)
##              Pregnancies                  Glucose            BloodPressure 
##                    17.00                    44.00                   122.00 
##            SkinThickness                  Insulin                      BMI 
##                    99.00                   846.00                    67.10 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                     2.42                    81.00                     1.00

Notes:The any (is.na ()) function was used to check for null values in the data. The outlier function from the Outliers library was used to determine the outliers far from the mean in the data. Hist function was added to visualize the distribution of the predictive variables. The last process of this process was replacing the outlier values with the medians of their respective predictor variables.

Fitting a model into training and testing set for accuracy improvement

Note: From summary we can see all independent variables are significant except Blood Pressure,Skin Thickness,Insulin and Age.

Final model

Notes: We removed all insignificant variables

finalmodel<-glm(Outcome~.-SkinThickness-Age-BloodPressure-Insulin,family="binomial", data= MyData)
summary(finalmodel)
## 
## Call:
## glm(formula = Outcome ~ . - SkinThickness - Age - BloodPressure - 
##     Insulin, family = "binomial", data = MyData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8223  -0.7247  -0.3997   0.7253   2.4335  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.183754   0.705543 -13.017  < 2e-16 ***
## Pregnancies               0.143331   0.027545   5.204 1.95e-07 ***
## Glucose                   0.036868   0.003487  10.572  < 2e-16 ***
## BMI                       0.088757   0.014726   6.027 1.67e-09 ***
## DiabetesPedigreeFunction  0.881919   0.294717   2.992  0.00277 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 716.34  on 763  degrees of freedom
## AIC: 726.34
## 
## Number of Fisher Scoring iterations: 5

Checking for assumptions of normality,constant variance

#Extracting residuals
Residuals<-residuals(finalmodel)
#checking first assumption Normality/non constant variance
shapiro.test(Residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  Residuals
## W = 0.93571, p-value < 2.2e-16
#checking 2nd assumption  constant variance
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(finalmodel)
## 
##  studentized Breusch-Pagan test
## 
## data:  finalmodel
## BP = 58.028, df = 4, p-value = 7.529e-12
#checking third assumption  auto correlation
library(car)
## Loading required package: carData
dwtest(finalmodel)
## 
##  Durbin-Watson test
## 
## data:  finalmodel
## DW = 1.9651, p-value = 0.3129
## alternative hypothesis: true autocorrelation is greater than 0

Notes: First assumption for normality of the model was tested using the Shapiro test on the residuals. Second assumption for constant variance was tested using the bptest Last assumption of autocorrelation was done using the dwtest from the car library.

Multicollinearity check

#Multicollinerity check
library(car)
vif(finalmodel)
##              Pregnancies                  Glucose                      BMI 
##                 1.025307                 1.001716                 1.019218 
## DiabetesPedigreeFunction 
##                 1.010518

Results and interpretation

Our final formula is Log(p/(1-p) = -9.27321 + 0.15213(Pregnancies)+ 0.03810(Glucose) + 0.08779(BMI)+ 0.90231(DiabetesPedigreeFunction)

The predicted value (probability of diabetes) for pregnancy is Log(p/(1-p)) = -9.27321 + 0.15213(1) + 0.03810(0) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.12108 exp (-9.12108) / (1+ exp (-9.12108)) = 0.000193

The predicted value (probability of diabetes) for glucose level is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(1) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.19911 exp (-9.19911) / (1+ exp (-9.19911)) =0.000101

The predicted value (probability of diabetes) for BMI is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(1) + 0.90231(0) Log(p/(1-p)) = - 9.18542 exp (-9.18542) / (1+ exp (-9.18542)) =0.000103

The predicted value (probability of diabetes) for diabetes pedigree function is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(0) + 0.90231 (1) Log(p/(1-p)) = -8.37079 exp (-8.37079) / (1+ exp (-8.37079)) =0.000231

Conclusion

Checking accuracy of our model, the P-value of pregnancies, glucose, BMI and DiabetesPedigreeFunction were less than the level of significance(0.05). This meant that the predictors were probably an excellent addition to the model.

  • Using the coefficient estimates from our model, we can conclude that when a female is pregnant, has increasing glucose level, insulin, BMI, Diabetes Pedigree Function and skin thickness will likely have diabetes i.e.: With a coefficient estimate of glucose = 0.039808, this shows the probability of being diabetic increases with increase in glucose and the odds of having diabetes are higher with an increase in glucose level. An increased BMI might also indicate a risk of developing diabetes and normally, there is a high risk of developing diabetes as the age of the person increases (given other factors). This shows that the factors do affect diabetes.

With an accuracy of 0.7756098in our model, indicates that 78% of the time our model classified the patients in a high-risk category when they actually had a high risk of getting diabetes. This shows our model is good.

Assumptions of logistic regression hold ie: * Little or no multicollinearity. For our model there is no multicollinearity between the independent variables * Autocorrelation. This assumption holds as the p-value= 0.3747 is greater than the alpha level of 0.005 and with D = 1.9746 shows positive autocorrelation. * Normality. For logistic regression, since it does not require a linear relationship between dependent and independent variables, normality do not need to be normally distributed. With p-values less than alpha level, the normality assumption does not hold.

References

  • Machine learning Essentials: Practical Guide in R by Alboukadeal KASSAMBARA
  • Applied Logistic Regression by David W. Hosmer Jr., Stanley Lemeshow, And Rodney X. Sturdivant