Diabetes describes a group of metabolic diseases in which a person has high blood sugar due to problems processing or producing insulin. Diabetes can affect anyone, regardless of age, race, gender, or lifestyle. For this assignment, I used the “Pima Indians Diabetes” data set from Kaggle. A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. The data sets consists of several medical predictor variables and one target variable, “Outcome”. The “Outcome” variable is binary, 0 indicating the woman does not have diabetes and 1 indicating diabetes is present. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. I used a logistic model to demonstrate which factors influence diabetes by using listwise deletion and then using multiple imputations to compare how both methods handle missing values.
This data set contains 768 observations and 9 variables with missing values. (0 indicates missing values)
Pregnancies: number of pregnancies.
Glucose: plasma glucose concentration in an oral glucose tolerance test
BloodPressure: diastolic blood pressure (mm Hg).
SkinThickness: triceps skin fold thickness (mm).
Insulin: insulin level
BMI: body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: diabetes pedigree function.
Age: age in years.
Outcome: 0 indicates diabetes not present and 1 indicates diabetes present.
library(readr)
diabetes<-read_csv("C:\\users\\Sangita Roy\\Desktop\\diabetes.csv")
head(diabetes)
summary(diabetes)
Pregnancies Glucose BloodPressure SkinThickness
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction Age
Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
Outcome
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.349
3rd Qu.:1.000
Max. :1.000
library(dplyr)
diabetes2<-mutate(diabetes,Age= ifelse(Age==0, NA, Age),
BMI= ifelse (BMI==0, NA,BMI),
BloodPressure= ifelse (BloodPressure==0, NA,BloodPressure),
SkinThickness= ifelse (SkinThickness==0, NA,SkinThickness),
Glucose= ifelse (Glucose==0, NA,Glucose))
head(diabetes2)
This data set includes “0” to indicate missing values. I changed the missing values to NA to differentiate which are missing values from what the women reported as a value of “0” to indicate none. It is impossible for the following variables: Age, BMI, BloodPressure, SkinThickness, and Glucose to have a value of “0”. The new data set contains missing values with “NA”."
data(diabetes2)
dim(diabetes2)
[1] 768 9
diabetes3<-na.omit(diabetes2)
dim(diabetes3)
[1] 532 9
All missing values with “NA” have been deleted from the data set, and the new data set now contains 532 observations.
library(Zelig)
library(texreg)
z_dia <- zlogit$new()
z_dia$zelig(as.factor(Outcome) ~ Pregnancies+Glucose+BMI+Age,model="logit", data = diabetes3)
Argument model is only valid for the Zelig wrapper, but not the Zelig method, and will be ignored.
summary(z_dia)
Model:
Call: z_dia$zelig(formula = as.factor(Outcome) ~ Pregnancies + Glucose + BMI + Age, data = diabetes3)
Deviance Residuals: Min 1Q Median 3Q Max
-2.2025 -0.6618 -0.3879 0.6701 2.4160
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -9.338825 0.871605 -10.715 < 2e-16 Pregnancies 0.113733 0.042581 2.671 0.00756 Glucose 0.035152 0.004098 8.579 < 2e-16 BMI 0.087309 0.017840 4.894 9.88e-07 Age 0.026389 0.013113 2.012 0.04417
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 676.79 on 531 degrees of freedom
Residual deviance: 481.12 on 527 degrees of freedom AIC: 491.12
Number of Fisher Scoring iterations: 5
Next step: Use ‘setx’ method
htmlreg(z_dia)
| Model 1 | ||
|---|---|---|
| (Intercept) | -9.34*** | |
| (0.87) | ||
| Pregnancies | 0.11** | |
| (0.04) | ||
| Glucose | 0.04*** | |
| (0.00) | ||
| BMI | 0.09*** | |
| (0.02) | ||
| Age | 0.03* | |
| (0.01) | ||
| AIC | 491.12 | |
| BIC | 512.50 | |
| Log Likelihood | -240.56 | |
| Deviance | 481.12 | |
| Num. obs. | 532 | |
| p < 0.001, p < 0.01, p < 0.05 | ||
This the best fit model which includes pregnancy, glucose, BMI, and Age to show the relationship between these variables and the presence of diabetes in women. Initially, I started with a simple model that consisted of pregnancy, glucose, and BMI and then gradually added more variables to the model. Based on the lowest AIC and BIC values, and results having the most significance, this is the best fit model. The coefficient for the intercept represents an ideal woman who has had no pregnancies, glucose level 0f 0, BMI of 0, and is of age 0. The ideal woman does not exist since those variables cannot be 0, however, the log odds ratio of such a woman having diabetes decreases by -9.34. As women have more pregnancies, the log odds ratio of having diabetes increases by 0.11. For each unit increase in glucose, the log odds ratio for women having diabetes increases by 0.04. For each unit increase of BMI, the log odds ratio of having diabetes increases by 0.09. Lastly, as women get older, their log odds ratio of having diabetes increases by 0.03. All results are significant (p<.05).
library(Amelia)
data(diabetes2)
a.out <- amelia(diabetes2, m = 20)
-- Imputation 1 --
1 2 3 4 5
-- Imputation 2 --
1 2 3 4 5
-- Imputation 3 --
1 2 3 4 5
-- Imputation 4 --
1 2 3 4 5 6
-- Imputation 5 --
1 2 3 4 5
-- Imputation 6 --
1 2 3 4 5 6
-- Imputation 7 --
1 2 3 4
-- Imputation 8 --
1 2 3 4 5 6
-- Imputation 9 --
1 2 3 4 5 6 7
-- Imputation 10 --
1 2 3 4 5 6 7 8 9 10
-- Imputation 11 --
1 2 3 4 5 6
-- Imputation 12 --
1 2 3 4 5 6
-- Imputation 13 --
1 2 3 4 5 6 7
-- Imputation 14 --
1 2 3 4 5 6
-- Imputation 15 --
1 2 3 4 5
-- Imputation 16 --
1 2 3 4 5
-- Imputation 17 --
1 2 3 4 5 6 7
-- Imputation 18 --
1 2 3 4
-- Imputation 19 --
1 2 3 4 5
-- Imputation 20 --
1 2 3 4 5
z.out <- zelig(as.factor(Outcome) ~ Pregnancies + Glucose + BMI + Age, model = "logit", data = a.out, cite = FALSE)
summary(z.out)
Model: Combined Imputations
Estimate Std.Error z value Pr(>|z|)
(Intercept) -9.13156 0.71560 -12.76 < 2e-16
Pregnancies 0.11614 0.03177 3.66 0.00026
Glucose 0.03630 0.00352 10.31 < 2e-16
BMI 0.09334 0.01466 6.37 1.9e-10
Age 0.01146 0.00913 1.25 0.20951
For results from individual imputed datasets, use summary(x, subset = i:j)
Next step: Use 'setx' method
Multiple imputation models is used to approximate missing values instead of deleting from the original data set which listwise deletion does. There are 20 imputations which predicts the missing values. The table above shows the average multiple coefficients. There is not a major difference in terms of the coefficients between listwise deletion and the multiple imputations method for handling missing values. However, Age is no longer significant in the combined imputations models (p<.05). The combined imputations model demonstrates the log odds ratio of a woman with no pregnancies, glucose, BMI, and 0 years of age having diabetes decreases by -9.13. As the number of pregnancies of women increases, the log odds ratio of having diabetes increases by 0.11. For each unit of glucose increase, the log odds ratio of having diabetes increases by 0.04. For each unit increase of BMI, the log odds ratio of having diabetes increases by 0.09. Lastly, as women get older, the log odds ratio of having diabetes is 0.01. The differences in coefficients between both methods is seen in the intercept and age in which it decreased from the listwise deletion.
summary(z.out, subset = 1)
Imputed Dataset 1
Call:
z5$zelig(formula = as.factor(Outcome) ~ Pregnancies + Glucose +
BMI + Age, data = a.out)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2892 -0.7099 -0.4027 0.7285 2.3801
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.142608 0.715957 -12.770 < 2e-16
Pregnancies 0.116164 0.031782 3.655 0.000257
Glucose 0.036571 0.003515 10.403 < 2e-16
BMI 0.092460 0.014616 6.326 2.52e-10
Age 0.011678 0.009121 1.280 0.200459
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 721.81 on 763 degrees of freedom
AIC: 731.81
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
This shows the first imputation of the missing values from the diabetes data set that has been predicted. The AIC value has increased from 491 using the listwise deletion to 730 of the first imputation.
summary(z.out, subset = 2)
Imputed Dataset 2
Call:
z5$zelig(formula = as.factor(Outcome) ~ Pregnancies + Glucose +
BMI + Age, data = a.out)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2839 -0.7087 -0.4038 0.7298 2.3755
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.143771 0.713697 -12.812 < 2e-16
Pregnancies 0.117480 0.031812 3.693 0.000222
Glucose 0.036265 0.003509 10.335 < 2e-16
BMI 0.094059 0.014603 6.441 1.19e-10
Age 0.011030 0.009112 1.211 0.226066
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 993.48 on 767 degrees of freedom
Residual deviance: 722.31 on 763 degrees of freedom
AIC: 732.31
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
This shows the second imputation of the missing values that the model has predicted.
z.out$setx()
z.out$sim()
plot(z.out)
There was not a major difference in terms of the coefficients between the listwise deletion and multiple imputations. However, the age variable was no longer significant in the combined imputations model. I found this to be quite interesting because I predicted initially that as the age of women increased, the log odds ratio of having diabetes would also increase and be significant. The multiple imputations methods to handle missing values is a better approach because the model predicts the missing values. Although the missing values will never be known, the guesses may be closer to the true value that is missing than others. The multiple imputations method gives a list of equally good guesses. The results from the complete-case data with different imputed values are not identical, but are similar.
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.
Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.