The data was collected and made available by National Institute of Diabetes and Digestive and Kidney Diseases as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.
The task is to investigate the factors associated with diabetes (outcome; 1 = Presence of diabetes, 0 =absence of diabetes
The following hypothesis was formulated in order to achieve the above
objective:
* H0: There is no association between having diabetes and risk factors
like pregnancies, glucose, blood pressure, skin thickness, insulin, BMI,
Diabetes pedigree function and age. VS * H1: There is an association
between having diabetes and those risk factors.
The data is comprised of 8 predictors as described above with the response variable of whether or not someone is diabetic. The data was extracted from kaggle https://www.kaggle.com/cjboat/diabetes2
Logistic regression is the method used because the outcome variable is categorical. The data had 8 dependent variables each of which were continuous variables. The binary logistic regression was used, since the outcome was binary, yes/no to whether someone had diabetes.
#LOAD DATA
MyData <- read.csv("C:\\Users\\Lusui\\OneDrive - CM Advocates, LLP\\Documents\\R\\diabetes2.csv")
#Clean data
#Check missing values
any(is.na(MyData))
## [1] FALSE
#check null values
any(is.null(MyData))
## [1] FALSE
#Results is False meaning there are no missing values and no null value
Notes: I used a histograms to be able to visually represent any outliers The outliers packages is necessary
#Test outliers of each variable
#install.packages("outliers")
library(outliers)
outlier(MyData)
## Pregnancies Glucose BloodPressure
## 17.00 0.00 0.00
## SkinThickness Insulin BMI
## 99.00 846.00 67.10
## DiabetesPedigreeFunction Age Outcome
## 2.42 81.00 1.00
hist(MyData$Pregnancies, xlab= "Pregnancies")
hist(MyData$Glucose, xlab= "Glucose")
hist(MyData$BloodPressure, xlab= "BloodPressure")
hist(MyData$SkinThickness, xlab= "Skin Thickness")
hist(MyData$Insulin, xlab= "Insulin")
hist(MyData$BMI, xlab= "BMI")
hist(MyData$DiabetesPedigreeFunction, xlab= "Diabetes Pedigree Function")
hist(MyData$Age, xlab= "Age")
Notes: This is to reduce the margin of errors in our data We then confirm to see that there are no outliers in the data
MyData$BloodPressure[MyData$BloodPressure %in% 0]<-median(MyData$BloodPressure)
MyData$Glucose[MyData$Glucose %in% 0]<-median(MyData$Glucose)
MyData$SkinThickness[MyData$SkinThickness%in% 0]<-median(MyData$SkinThickness)
MyData$Insulin[MyData$Insulin %in% 0]<-median(MyData$Insulin)
MyData$BMI[MyData$BMI %in% 0]<-median(MyData$BMI)
summary(MyData)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:23.00
## Median : 3.000 Median :117.00 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :121.66 Mean : 72.39 Mean :27.33
## 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 30.50 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median : 31.25 Median :32.00 Median :0.3725 Median :29.00
## Mean : 94.65 Mean :32.45 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.25 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
library(outliers)
outlier(MyData)
## Pregnancies Glucose BloodPressure
## 17.00 44.00 122.00
## SkinThickness Insulin BMI
## 99.00 846.00 67.10
## DiabetesPedigreeFunction Age Outcome
## 2.42 81.00 1.00
Notes:The any (is.na ()) function was used to check for null values in the data. The outlier function from the Outliers library was used to determine the outliers far from the mean in the data. Hist function was added to visualize the distribution of the predictive variables. The last process of this process was replacing the outlier values with the medians of their respective predictor variables.
Note: From summary we can see all independent variables are significant except Blood Pressure,Skin Thickness,Insulin and Age.
Notes: We removed all insignificant variables
finalmodel<-glm(Outcome~.-SkinThickness-Age-BloodPressure-Insulin,family="binomial", data= MyData)
summary(finalmodel)
##
## Call:
## glm(formula = Outcome ~ . - SkinThickness - Age - BloodPressure -
## Insulin, family = "binomial", data = MyData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8223 -0.7247 -0.3997 0.7253 2.4335
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.183754 0.705543 -13.017 < 2e-16 ***
## Pregnancies 0.143331 0.027545 5.204 1.95e-07 ***
## Glucose 0.036868 0.003487 10.572 < 2e-16 ***
## BMI 0.088757 0.014726 6.027 1.67e-09 ***
## DiabetesPedigreeFunction 0.881919 0.294717 2.992 0.00277 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 716.34 on 763 degrees of freedom
## AIC: 726.34
##
## Number of Fisher Scoring iterations: 5
#Extracting residuals
Residuals<-residuals(finalmodel)
#checking first assumption Normality/non constant variance
shapiro.test(Residuals)
##
## Shapiro-Wilk normality test
##
## data: Residuals
## W = 0.93571, p-value < 2.2e-16
#checking 2nd assumption constant variance
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(finalmodel)
##
## studentized Breusch-Pagan test
##
## data: finalmodel
## BP = 58.028, df = 4, p-value = 7.529e-12
#checking third assumption auto correlation
library(car)
## Loading required package: carData
dwtest(finalmodel)
##
## Durbin-Watson test
##
## data: finalmodel
## DW = 1.9651, p-value = 0.3129
## alternative hypothesis: true autocorrelation is greater than 0
Notes: First assumption for normality of the model was tested using the Shapiro test on the residuals. Second assumption for constant variance was tested using the bptest Last assumption of autocorrelation was done using the dwtest from the car library.
#Multicollinerity check
library(car)
vif(finalmodel)
## Pregnancies Glucose BMI
## 1.025307 1.001716 1.019218
## DiabetesPedigreeFunction
## 1.010518
Our final formula is Log(p/(1-p) = -9.27321 + 0.15213(Pregnancies)+ 0.03810(Glucose) + 0.08779(BMI)+ 0.90231(DiabetesPedigreeFunction)
The predicted value (probability of diabetes) for pregnancy is Log(p/(1-p)) = -9.27321 + 0.15213(1) + 0.03810(0) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.12108 exp (-9.12108) / (1+ exp (-9.12108)) = 0.000193
The predicted value (probability of diabetes) for glucose level is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(1) + 0.08779(0) + 0.90231(0) Log(p/(1-p)) = -9.19911 exp (-9.19911) / (1+ exp (-9.19911)) =0.000101
The predicted value (probability of diabetes) for BMI is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(1) + 0.90231(0) Log(p/(1-p)) = - 9.18542 exp (-9.18542) / (1+ exp (-9.18542)) =0.000103
The predicted value (probability of diabetes) for diabetes pedigree function is Log(p/(1-p)) = -9.27321 + 0.15213(0) + 0.03810(0) + 0.08779(0) + 0.90231 (1) Log(p/(1-p)) = -8.37079 exp (-8.37079) / (1+ exp (-8.37079)) =0.000231
Checking accuracy of our model, the P-value of pregnancies, glucose, BMI and DiabetesPedigreeFunction were less than the level of significance(0.05). This meant that the predictors were probably an excellent addition to the model.
With an accuracy of 0.7756098in our model, indicates that 78% of the time our model classified the patients in a high-risk category when they actually had a high risk of getting diabetes. This shows our model is good.
Assumptions of logistic regression hold ie: * Little or no multicollinearity. For our model there is no multicollinearity between the independent variables * Autocorrelation. This assumption holds as the p-value= 0.3747 is greater than the alpha level of 0.005 and with D = 1.9746 shows positive autocorrelation. * Normality. For logistic regression, since it does not require a linear relationship between dependent and independent variables, normality do not need to be normally distributed. With p-values less than alpha level, the normality assumption does not hold.