Type 2 diabetes is a problem with the body that causes blood sugar levels to rise higher than normal (hyperglycemia) because the body does not use insulin properly. Specifically, the body cannot make enough insulin to keep blood sugar levels normal. Type 2 diabetes is associated with various health complications such as neuropathy (nerve damage), glaucoma, cataracts and various skin disorders. Early detection of diabetes is crucial to proper treatment so as to alleviate complications.
The data set contains information on 392 randomly selected women who are at risk for diabetes. The data set contains the following variables:
Variable
Description
pregnant
Number of times pregnant
glucose
Plasma glucose concentration at 2 hours in an oral glucose tolerance test
diastolic
Diastolic blood pressure (mm Hg)
triceps
Triceps skin fold thickness (mm)
insulin
2 hour serum insulin (mu U/ml)
bmi
Body mass index (\(kg/m^2\), mass in kilograms divided by height in meters-squared)
pedigree
Numeric strength of diabetes in family line (higher numbers mean stronger history)
age
Age
diabetes
Does the patient have diabetes (0 if “No”, 1 if “Yes”)
The data can be found in the Diabetes data set on Canvas. Download Diabetes.txt, and put it in the same folder as this quarto file.
0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.
1. Read in the data set, call it “dia”, remove the “row” column, and change the class of any categorical variables to a factor. Print a summary of the data and make sure the data makes sense.
# your code heredia <-read.table("Diabetes.txt", header =TRUE)# |> #mutate(chd = as.factor(chd))#I dont think there are any categorical variables as seen by the str function:# Assuming 'df' is your data framedia$row <-NULLstr(dia)
2. Explore the data. Create a correlation matrix (or correlation plot) for the covariates. Comment on why or why not you think multicollinearity may be a problem for this data set.
After creating this matrix heat map, there seems to be moderate correlations between age and pregnancy, as well as triceps and bmi. There are others, but those are the ones that stick out the most.Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. So yes, I think that multicollinearity may be a problem we will want to work on, but it will be hard to tell without doing any actual analysis.
3. Explore the data. Create boxplots of the response (diabetes) against the following predictors: glucose, bmi, pedigree, and age (4 plots in total. You may want to use the grid.arrange function from the gridExtra package to display them in a 2x2 grid). Briefly comment on one interesting trend you observe.
# your code here#boxplot for glucoseggplot(data = dia) +geom_boxplot(mapping =aes(y = glucose, x = diabetes)) +theme(aspect.ratio =1) +coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
# Boxplot for bmiggplot(data = dia) +geom_boxplot(mapping =aes(y = bmi, x = diabetes)) +theme(aspect.ratio =1) +coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
# Boxplot for pedigreeggplot(data = dia) +geom_boxplot(mapping =aes(y = pedigree, x = diabetes)) +theme(aspect.ratio =1) +coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
# Boxplot for ageggplot(data = dia) +geom_boxplot(mapping =aes(y = age, x = diabetes)) +theme(aspect.ratio =1) +coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
We can see from all these charts and such, that many of our independent variables are skewed left. This is the trend I see in every graph except for glucose, which looks much more normal than the other graphs.
4. Explore the data. Create jittered scatterplots of the response against the following predictors: pregnant, diastolic, triceps, insulin (4 plots in total. You may want to use the grid.arrange function from the gridExtra package to display them in a 2x2 grid). Briefly comment on one interesting trend you observe.
# your code here# Jittered Scatterplot for pregnantggplot(data = dia) +geom_point(mapping =aes(y = diabetes, x = pregnant)) +geom_jitter(mapping =aes(y = diabetes, x = pregnant),height =0.1) +theme(aspect.ratio =1)
# Jittered Scatterplot for diastolicggplot(data = dia) +geom_point(mapping =aes(y = diabetes, x = diastolic)) +geom_jitter(mapping =aes(y = diabetes, x = diastolic),height =0.1) +theme(aspect.ratio =1)
# Jittered Scatterplot for tricepsggplot(data = dia) +geom_point(mapping =aes(y = diabetes, x = triceps)) +geom_jitter(mapping =aes(y = diabetes, x = triceps),height =0.1) +theme(aspect.ratio =1)
# Jittered Scatterplot for insulinggplot(data = dia) +geom_point(mapping =aes(y = diabetes, x = insulin)) +geom_jitter(mapping =aes(y = diabetes, x = insulin),height =0.1) +theme(aspect.ratio =1)
The interesting trend I find, is that I find no direct corellations between any of our independent variables to the dependent variable. All of these different things we are mesuring, look like they have spread on the top and bottom part of the scatterplots. This means that there we can’t just find a direct link between any of these and having diatetes.
5. Briefly explain why traditional multiple linear regression methods are not suitable for this data set. (your reasons should refer to this data set (i.e. be specific, not general))
For this dataset, we are trying to determine the binary response of if someone has diabetes or not. In contrast, traditional linear regression assumes that the response variable (of diabetes in our case) is continuous and normally distributed, which is not the case for this specific example. Thus, we will have to adjust our model to only guess if someone has diabetes or not, and not just a continuous normal distrobution.
6. Use a variable selection procedure to help you decide which, if any, variables to omit from the logistic regression model you will soon fit. You may choose which selection method to use (best subsets, backward, sequential replacement, LASSO, or elastic net) and which metric/criteria to use (AIC, BIC, or CV/PMSE). Briefly justify (in a few sentences) why you chose the method and metric that you did.
# your code here# Does there appear to be multicollinearity based on the correlation plot?corrplot(cor(dia[,-8]))
# For illustration, we will use best subsets (exhaustive) with the BIC metricdia_best_subsets_bic <-bestglm(as.data.frame(dia),IC ="BIC",method ="exhaustive",TopModels =1,family = binomial)
Morgan-Tatar search since family is non-gaussian.
summary(dia_best_subsets_bic$BestModel)
Call:
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.092018 1.080251 -9.342 < 2e-16 ***
glucose 0.036189 0.004982 7.264 3.76e-13 ***
bmi 0.074449 0.020267 3.673 0.000239 ***
pedigree 1.087129 0.419408 2.592 0.009541 **
age 0.053012 0.013439 3.945 8.00e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 498.10 on 391 degrees of freedom
Residual deviance: 347.23 on 387 degrees of freedom
AIC: 357.23
Number of Fisher Scoring iterations: 5
We do not have a ton of variables to choose from, so I will use best subsets to find the best option. I also decided to use BIC because it is known for being more aggressive at penalizing model complexity. I also used LASSO to pick the best lambda for our model.We can see what was fitted: glucose, bmi, pedigree, and age. These are the variables we will use.
7. Write out the logistic regression model for this data set using the covariates that you have chosen. You should use parameters/Greek letters (NOT the “fitted” model using numbers…since you have not fit a model yet).
8. Fit a logistic regression model using the covariates you chose. Print a summary of the results.
library(glmnet) # for ridge, lasso, and elastic net
Warning: package 'glmnet' was built under R version 4.3.3
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-8
set.seed(12345) # make sure to set your seed when doing cross validation!env_x <-as.matrix(dia[, 1:8])env_y <- dia$diabetes# use cross validation to pick the "best" (based on MSE) lambdaenv_lasso_cv <-cv.glmnet(x = env_x,y = env_y,family ="binomial",type.measure ="mse",alpha =1) # 1 is code for "LASSO"coef(env_lasso_cv, s ="lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -8.751795919
pregnant 0.060772900
glucose 0.033415679
diastolic .
triceps 0.009007501
insulin .
bmi 0.055515581
pedigree 0.853890171
age 0.031082678
Call:
glm(formula = diabetes ~ glucose + bmi + pedigree + age, family = binomial(link = "logit"),
data = dia)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.092018 1.080251 -9.342 < 2e-16 ***
glucose 0.036189 0.004982 7.264 3.76e-13 ***
bmi 0.074449 0.020267 3.673 0.000239 ***
pedigree 1.087129 0.419408 2.592 0.009541 **
age 0.053012 0.013439 3.945 8.00e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 498.10 on 391 degrees of freedom
Residual deviance: 347.23 on 387 degrees of freedom
AIC: 357.23
Number of Fisher Scoring iterations: 5
Questions 9-12 involve using diagnostics to check the logistic regression model assumptions. For each assumption, (1) include the relevant code for the diagnostic(s), and (2) explain whether or not you think the assumption is violated and why you think that.
9. The X’s vs log odds are linear (monotone in probability) (Use scatterplots with smoothers)
# your code here# here is the plot for age - repeat for the other predictorsscatter.smooth(x = dia$glucose, y =as.numeric(dia$diabetes) -1)
scatter.smooth(x = dia$bmi, y =as.numeric(dia$diabetes) -1)
scatter.smooth(x = dia$pedigree, y =as.numeric(dia$diabetes) -1)
scatter.smooth(x = dia$age, y =as.numeric(dia$diabetes) -1)
This monotonicity is met excpet for in the age scatterplot. We can see that as age increases (on the far right of the graph), there is some downwardness. It is not terrible, but it is good to note that in later years, there is not an increase in probability of diabetes.
10. The observations are independent (no diagnostic tools or code needed - just think about how the data was collected and briefly write your thoughts)
We randomly sampled women for this dataset, so we can safely assume that these are in fact independent. Perhaps if this was a real scenerio we would really verify that these are random women. How were they randomly sampled from a greater population? For now, we just assume it was done correctly.
11. The model describes all observations (i.e., there are no influential points) (Use residuals vs. leverage plot)
# your code hereplot(dia_logistic, which =5, cook.levels = .5) # Residuals by leverage
There are some points with very high leverage, but none that are outside cook’s number. This means we are fine, for this assumption
12. No multicollinearity (Use variance inflation factors)
# your code here# this code uses the pseudo R-Squaredvif(dia_logistic)
glucose bmi pedigree age
1.022291 1.013365 1.005454 1.030187
# this code uses the real R-Squared (and is valid since we are looking at how# the predictor variables relate, and we are not using the response)dia_lm <-lm(diabetes ~ glucose + bmi + pedigree + age,data = dia)vif(dia_lm)
glucose bmi pedigree age
1.190042 1.065001 1.040289 1.135755
max(vif(dia_lm))# < 10
[1] 1.190042
mean(vif(dia_lm)) # < 5
[1] 1.107772
We can cheer because the VIF max is 1.19, and the mean is 1.1. Thus, we know that there are no multicollinearity issues here.
13. Briefly comment on if all assumptions are met. If there is anything you would like to do before proceeding to statistical inference, do that here.
# your code here, if neededpairs(dia[, c("age", "bmi", "pedigree", "glucose","diabetes")])
I just wanted to put the pairs graph here so that we could really assure that there is no problem with multicollinearity. I am satisfied that all of our assumptions are met. We can recognize the slight deformation of monotoniciity in the age scatterplot, but I think it can be ignored, as that it really is at the edge of our sample, with the oldes lady of the bunch. So, I am gonna pretend it is fine, and we can later throw out that datapoint if really effects our model.
14. For the coefficient for bmi, compute (and show) the estimated log odds ratio (\(\hat{\beta}_{bmi}\), pull this value from the model output), odds ratio (\(e^{\hat{\beta}_{bmi}}\)), and the odds ratio converted to a percentage (\(100 \times (e^{\hat{\beta}_{bmi}} - 1)\%\)). (If you cannot view the math used in this question (and subsequent), you can see it by rendering the document.)
# your code here# Pull the estimated coefficient for BMI from the model outputbeta_bmi <-coef(dia_lm)["bmi"]# Compute the estimated log odds ratio for BMIlog_odds_ratio_bmi <- beta_bmi# Compute the odds ratio for BMIodds_ratio_bmi <-exp(beta_bmi)# Compute the odds ratio converted to a percentageodds_ratio_percentage <-100* (exp(beta_bmi) -1)# Display the resultscat("Estimated Log Odds Ratio for BMI:", log_odds_ratio_bmi, "\n")
Estimated Log Odds Ratio for BMI: 0.01038215
cat("Odds Ratio for BMI:", odds_ratio_bmi, "\n")
Odds Ratio for BMI: 1.010436
cat("Odds Ratio for BMI (Percentage):", odds_ratio_percentage, "%\n")
Odds Ratio for BMI (Percentage): 1.043623 %
15. Interpret the coefficient for bmi based on the FOUR different ways we discussed in class.
Interpretation 1: The coefficient for “bmi” of 0.01038215 represents the estimated change in the log odds of diabetes associated with a one-unit increase in “bmi”, holding all other variables constant.
Interpretation 2: The odds ratio for “bmi” of 1.010436 indicates the ratio of the odds of diabetes for individuals with one unit increase in “bmi” to the odds of diabetes for individuals with no increase in “bmi”.
Interpretation 3: The odds ratio for “bmi”, when converted to a percentage of 1.043%, represents the percentage increase in the odds of diabetes for every one-unit increase in “bmi”.
Directionality Interpretation: The positive sign of the coefficient indicates the direction of the association between “bmi” and the log odds of diabetes. The coefficient is positive, which suggests that higher values of “bmi” are associated with higher log odds of diabetes.
16. Create (and output) 95% confidence intervals for \(\beta_k\), \(e^{\beta_k}\), and \(100 \times (e^{\beta_k} - 1)\%\) for all predictors using the confint function.
# your code heredia_conf_int_LOR <-confint(dia_logistic)[-1,] # omit intercept row
17. Interpret the 95% confidence interval for \(100 \times (e^{\beta_{bmi}} - 1)\%\).
Interpretation using \(100 \times (e^{\beta_{bmi}} - 1)\%\): We are 95% confident that the true percentage increase in the odds of diabetes per one-unit increase in BMI lies between approximately 3.63% and 12.23%.
18. Calculate a 95% confidence interval for the predicted probability that a patient has diabetes where pregnant = 1, glucose = 90, diastolic = 62, triceps = 18, insulin = 59, bmi = 25.1, pedigree = 1.268 and age = 25. Note that you may not need to use all of these values depending on the variables you chose to include in your model. Do you think this patient will develop diabetes? Why or why not?
# your code here# Patient characteristicspatient_data <-data.frame(glucose =90,bmi =25.1,pedigree =1.268,age =25)# Predict probability of diabetes for the patientpredicted_prob <-predict(dia_lm, newdata = patient_data, type ="response")# Get standard error of the predicted probabilityse_predicted_prob <-sqrt(predict(dia_lm, newdata = patient_data, type ="response", se.fit =TRUE)$se.fit^2)# Calculate z-score for 95% confidence intervalz <-qnorm(0.975)# Calculate margin of errormargin_of_error <- z * se_predicted_prob# Calculate confidence intervalconfidence_interval <-c(predicted_prob - margin_of_error, predicted_prob + margin_of_error)# Print confidence intervalcat("95% Confidence Interval for Predicted Probability of Diabetes:", confidence_interval, "\n")
95% Confidence Interval for Predicted Probability of Diabetes: -0.005437571 0.2263336
We are 95$ confident that if a new patient were to be sampled from the population with the aforementioned measurements for each independent variable, that the probability that they have diabetes is between 0 and .226.
19. Compute the likelihood ratio test statistic (aka deviance, aka model chi-squared test) for the model, and compute the associated \(p\)-value. Print out the test statistic and the \(p\)-value. Based on the results, what do you conclude?
# your code here# Likelihood ratio test statisticlike_ratio <- dia_logistic$null.deviance - dia_logistic$deviancecat("The likelyhood ration test statistic for the model is: ", like_ratio)
The likelyhood ration test statistic for the model is: 150.8628
cat(", \nand the p value is: ")
,
and the p value is:
# Likelihood ratio p-valuepchisq(q = like_ratio,df =length(coef(dia_logistic)) -1,lower.tail =FALSE)
[1] 1.329928e-31
The likelyhood ratio test statistic quantifies the difference in fit between the full logistic regression model (with predictors) and the null model. A higher likelihood ratio test statistic indicates a better fit of the full model compared to the null model. With a pvalue this small, we are quite sure that our model is putting in good work!
20. Compute (and output) the pseudo \(R^2\) value for the model.
# your code here1- dia_logistic$deviance/dia_logistic$null.deviance
[1] 0.3028779
21. What is the cutoff value for the model that minimizes the percent misclassified (or equivalently maximizes accuracy)? Show your code and output this cutoff value.
# your code here# get the predicted probabilities for all patients:dia_preds <-predict(dia_logistic,type ="response")# Look at the first few predictions:head(dia_preds)
# create a sequence from 0 to 1 to represent all possible cut-off values (c) # that we could choose:possible_cutoffs <-seq(0, 1, by = .01)# create an empty vector where we will store the percent misclassified for each# possible cut-off value we created:percent_missclass <-rep(NA, length(possible_cutoffs))# for each possible cut-off value, (1) grab the cut-off value, (2) for all 757# patients, store a 1 in "classify" if their predicted probability is larger # than the cut-off value, and (3) compute the average percent misclassified # across the 757 patients when using that cut-off by averaging the number of # times "classify" (0 or 1 based on how that cut-off classified a person) is # not the same as heart_binary (the truth):for(i in1:length(possible_cutoffs)){ classify <-ifelse(dia_preds > possible_cutoffs[i], 1, 0) percent_missclass[i] <-mean(classify != dia$diabetes)}# percent_misclass holds the average misclassification rates for each cut-offpercent_missclass
# put this information in a dataframe so we can plot it with ggplot:missclass_df <-as.data.frame(cbind(percent_missclass, possible_cutoffs))# plot the misclassification rate against the cut-off value:ggplot(data = missclass_df) +geom_line(aes(x = possible_cutoffs, y = percent_missclass))
# choose the "best" cut-off that minimizes the percent misclassified:cutoff_best <- possible_cutoffs[which.min(percent_missclass)]percent_missclass ==min(percent_missclass)
The true cuttoff value is .663, as shown by the code above
22. Create (and output) a confusion matrix using the cutoff value you found above.
# your code herepreds <- dia_preds > cutoff_best# create a confusion matrix with the truth and the predicted classification: conf_mat <-table("truth"= dia$diabetes, "predicted"= preds)conf_mat
predicted
truth FALSE TRUE
0 230 32
1 44 86
# note that we can also add column and row sums, which is useful for # calculations:addmargins(conf_mat)
predicted
truth FALSE TRUE Sum
0 230 32 262
1 44 86 130
Sum 274 118 392
23. Based on the confusion matrix, what is the value for the specificity, and what does the specificity measure? Print the specificity.
The exact specificy for our model is .878. Specificity is a measure of the proportion of actual negatives that are correctly identified as such by the classifier.
24. Based on the confusion matrix, what is the value for the sensitivity, and what does the sensitivity measure? Print the sensitivity.
Our model has a sensitivity of .6615. Sensitivity is a measure of the proportion of actual positives that are correctly identified as such by the classifier.
25. Based on the confusion matrix, what is the percent correctly classified (accuracy), and what does the percent correctly classified measure? Print the percent correctly classified.
# your code here# True positivestrue_positives <-86# True negativestrue_negatives <-230# Total observationstotal_observations <-392# Accuracy calculationaccuracy <- (true_positives + true_negatives) / total_observations# Output the accuracycat("Accuracy:", accuracy, "\n")
Accuracy: 0.8061224
The accuracy of our model is 0.806. This measures the proportion of all correctly classified instances.
26. Plot (and output) the ROC curve for the model (either using the pROC package or the ROCR package).
# your code here# METHOD 1: using the pROC packagemy_roc <-roc(dia$diabetes, dia_preds)
27. What is the AUC for the ROC curve plotted above? Print the value of the AUC.
# your code hereauc(my_roc) # AUC
Area under the curve: 0.8605
28. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.
What I really learned this time was how important it is to respect the multicollinearity problems. It was interesting to me that I was not to worried about our multicollinearity problems, but it was quite a big issue. I am grateful to be able to use best_sets searching algorithm to make sure to find an actually effective set of input variables. It was a lot of fun this time, and I am grateful to work on real-world prolbems like learning about diabetes.
29. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
For a non-statisticians, I would say that this data-analysis was all about learning about the probabilities of diabetes. We cannot just assume that becuase one single thing like bmi will be the direct thing that causes diabetes, or glucose, or insulin. Thus, we learned that we can accuratley predict who will have diabetes by 80%. This is very exciting, and very helpful for understanding if someone has diabetes, just based on daily life measurments.