Homework 7



Logistic Regression


Author


Spencer Hamilton

Data and Description

Type 2 diabetes is a problem with the body that causes blood sugar levels to rise higher than normal (hyperglycemia) because the body does not use insulin properly. Specifically, the body cannot make enough insulin to keep blood sugar levels normal. Type 2 diabetes is associated with various health complications such as neuropathy (nerve damage), glaucoma, cataracts and various skin disorders. Early detection of diabetes is crucial to proper treatment so as to alleviate complications.

The data set contains information on 392 randomly selected women who are at risk for diabetes. The data set contains the following variables:

Variable Description
pregnant Number of times pregnant
glucose Plasma glucose concentration at 2 hours in an oral glucose tolerance test
diastolic Diastolic blood pressure (mm Hg)
triceps Triceps skin fold thickness (mm)
insulin 2 hour serum insulin (mu U/ml)
bmi Body mass index (\(kg/m^2\), mass in kilograms divided by height in meters-squared)
pedigree Numeric strength of diabetes in family line (higher numbers mean stronger history)
age Age
diabetes Does the patient have diabetes (0 if “No”, 1 if “Yes”)

The data can be found in the Diabetes data set on Canvas. Download Diabetes.txt, and put it in the same folder as this quarto file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Read in the data set, call it “dia”, remove the “row” column, and change the class of any categorical variables to a factor. Print a summary of the data and make sure the data makes sense.

# your code here
dia <- read.table("Diabetes.txt", header = TRUE)# |> 
  #mutate(chd = as.factor(chd))
#I dont think there are any categorical variables as seen by the str function:
# Assuming 'df' is your data frame
dia$row <- NULL

str(dia)
'data.frame':   392 obs. of  9 variables:
 $ pregnant : int  1 0 3 2 1 5 0 1 1 3 ...
 $ glucose  : int  89 137 78 197 189 166 118 103 115 126 ...
 $ diastolic: int  66 40 50 70 60 72 84 30 70 88 ...
 $ triceps  : int  23 35 32 45 23 19 47 38 30 41 ...
 $ insulin  : int  94 168 88 543 846 175 230 83 96 235 ...
 $ bmi      : num  28.1 43.1 31 30.5 30.1 25.8 45.8 43.3 34.6 39.3 ...
 $ pedigree : num  0.167 2.288 0.248 0.158 0.398 ...
 $ age      : int  21 33 26 53 59 51 31 33 32 27 ...
 $ diabetes : int  0 1 1 1 1 1 1 0 1 0 ...

2. Explore the data. Create a correlation matrix (or correlation plot) for the covariates. Comment on why or why not you think multicollinearity may be a problem for this data set.

# your code here
matrix <- cor(dia)

# Create correlation plot
corrplot(matrix, method = "color", 
         tl.col = "black", tl.srt = 45)

# Cross-Tabulation (Contingency Table) for age 
table(dia$age, dia$diabetes)
    
      0  1
  21 31  2
  22 38  5
  23 25  3
  24 25  6
  25 21  9
  26 19  5
  27 11  3
  28 14  7
  29  6  8
  30  7  3
  31  5  7
  32  3  3
  33  4  7
  34  6  2
  35  3  3
  36  2  5
  37  6  2
  38  1  2
  39  6  1
  40  3  3
  41  2  3
  42  4  3
  43  2  7
  44  0  1
  45  2  2
  46  2  4
  47  1  2
  48  2  1
  49  2  0
  50  1  1
  51  1  5
  52  0  2
  53  0  3
  54  0  2
  55  1  1
  56  0  1
  57  1  1
  58  1  3
  59  0  1
  60  1  1
  61  1  0
  63  1  0
  81  1  0
table(dia$age) # number of people in each age level

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 
33 43 28 31 30 24 14 21 14 10 12  6 11  8  6  7  8  3  7  6  5  7  9  1  4  6 
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 63 81 
 3  3  2  2  6  2  3  2  2  1  2  4  1  2  1  1  1 

After creating this matrix heat map, there seems to be moderate correlations between age and pregnancy, as well as triceps and bmi. There are others, but those are the ones that stick out the most.Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. So yes, I think that multicollinearity may be a problem we will want to work on, but it will be hard to tell without doing any actual analysis.

3. Explore the data. Create boxplots of the response (diabetes) against the following predictors: glucose, bmi, pedigree, and age (4 plots in total. You may want to use the grid.arrange function from the gridExtra package to display them in a 2x2 grid). Briefly comment on one interesting trend you observe.

# your code here

#boxplot for glucose
ggplot(data = dia) +
  geom_boxplot(mapping = aes(y = glucose, x = diabetes)) +
  theme(aspect.ratio = 1) +
  coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

# Boxplot for bmi
ggplot(data = dia) +
  geom_boxplot(mapping = aes(y = bmi, x = diabetes)) +
  theme(aspect.ratio = 1) +
  coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

# Boxplot for pedigree
ggplot(data = dia) +
  geom_boxplot(mapping = aes(y = pedigree, x = diabetes)) +
  theme(aspect.ratio = 1) +
  coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

# Boxplot for age
ggplot(data = dia) +
  geom_boxplot(mapping = aes(y = age, x = diabetes)) +
  theme(aspect.ratio = 1) +
  coord_flip()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

We can see from all these charts and such, that many of our independent variables are skewed left. This is the trend I see in every graph except for glucose, which looks much more normal than the other graphs.

4. Explore the data. Create jittered scatterplots of the response against the following predictors: pregnant, diastolic, triceps, insulin (4 plots in total. You may want to use the grid.arrange function from the gridExtra package to display them in a 2x2 grid). Briefly comment on one interesting trend you observe.

# your code here

# Jittered Scatterplot for pregnant
ggplot(data = dia) +
  geom_point(mapping = aes(y = diabetes, x = pregnant)) +
  geom_jitter(mapping = aes(y = diabetes, x = pregnant),
              height = 0.1) +
  theme(aspect.ratio = 1)

# Jittered Scatterplot for diastolic
ggplot(data = dia) +
  geom_point(mapping = aes(y = diabetes, x = diastolic)) +
  geom_jitter(mapping = aes(y = diabetes, x = diastolic),
              height = 0.1) +
  theme(aspect.ratio = 1)

# Jittered Scatterplot for triceps
ggplot(data = dia) +
  geom_point(mapping = aes(y = diabetes, x = triceps)) +
  geom_jitter(mapping = aes(y = diabetes, x = triceps),
              height = 0.1) +
  theme(aspect.ratio = 1)

# Jittered Scatterplot for insulin
ggplot(data = dia) +
  geom_point(mapping = aes(y = diabetes, x = insulin)) +
  geom_jitter(mapping = aes(y = diabetes, x = insulin),
              height = 0.1) +
  theme(aspect.ratio = 1)

The interesting trend I find, is that I find no direct corellations between any of our independent variables to the dependent variable. All of these different things we are mesuring, look like they have spread on the top and bottom part of the scatterplots. This means that there we can’t just find a direct link between any of these and having diatetes.

5. Briefly explain why traditional multiple linear regression methods are not suitable for this data set. (your reasons should refer to this data set (i.e. be specific, not general))

For this dataset, we are trying to determine the binary response of if someone has diabetes or not. In contrast, traditional linear regression assumes that the response variable (of diabetes in our case) is continuous and normally distributed, which is not the case for this specific example. Thus, we will have to adjust our model to only guess if someone has diabetes or not, and not just a continuous normal distrobution.

6. Use a variable selection procedure to help you decide which, if any, variables to omit from the logistic regression model you will soon fit. You may choose which selection method to use (best subsets, backward, sequential replacement, LASSO, or elastic net) and which metric/criteria to use (AIC, BIC, or CV/PMSE). Briefly justify (in a few sentences) why you chose the method and metric that you did.

# your code here

# Does there appear to be multicollinearity based on the correlation plot?
corrplot(cor(dia[,-8]))

# For illustration, we will use best subsets (exhaustive) with the BIC metric
dia_best_subsets_bic <- bestglm(as.data.frame(dia),
                                  IC = "BIC",
                                  method = "exhaustive",
                                  TopModels = 1,
                                  family = binomial)
Morgan-Tatar search since family is non-gaussian.
summary(dia_best_subsets_bic$BestModel)

Call:
glm(formula = y ~ ., family = family, data = Xi, weights = weights)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.092018   1.080251  -9.342  < 2e-16 ***
glucose       0.036189   0.004982   7.264 3.76e-13 ***
bmi           0.074449   0.020267   3.673 0.000239 ***
pedigree      1.087129   0.419408   2.592 0.009541 ** 
age           0.053012   0.013439   3.945 8.00e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 347.23  on 387  degrees of freedom
AIC: 357.23

Number of Fisher Scoring iterations: 5

We do not have a ton of variables to choose from, so I will use best subsets to find the best option. I also decided to use BIC because it is known for being more aggressive at penalizing model complexity. I also used LASSO to pick the best lambda for our model.We can see what was fitted: glucose, bmi, pedigree, and age. These are the variables we will use.

7. Write out the logistic regression model for this data set using the covariates that you have chosen. You should use parameters/Greek letters (NOT the “fitted” model using numbers…since you have not fit a model yet).

\(\text{diabetes}_i = \beta_1×\text{glucose}_i + \beta_2×\text{bmi}_i + \beta_3×\text{pedigree}_i +\) $ _4×_i + _i$

8. Fit a logistic regression model using the covariates you chose. Print a summary of the results.

library(glmnet)  # for ridge, lasso, and elastic net
Warning: package 'glmnet' was built under R version 4.3.3
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-8
set.seed(12345)  # make sure to set your seed when doing cross validation!

env_x <- as.matrix(dia[, 1:8])
env_y <- dia$diabetes
# use cross validation to pick the "best" (based on MSE) lambda
env_lasso_cv <- cv.glmnet(x = env_x,
                          y = env_y,
                          family = "binomial",
                          type.measure = "mse",
                          alpha = 1)  # 1 is code for "LASSO"
coef(env_lasso_cv, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                      s1
(Intercept) -8.751795919
pregnant     0.060772900
glucose      0.033415679
diastolic    .          
triceps      0.009007501
insulin      .          
bmi          0.055515581
pedigree     0.853890171
age          0.031082678
dia_logistic <- glm(diabetes ~ glucose + bmi + pedigree + age,
                      data = dia,
                      family = binomial(link = "logit"))
summary(dia_logistic)

Call:
glm(formula = diabetes ~ glucose + bmi + pedigree + age, family = binomial(link = "logit"), 
    data = dia)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.092018   1.080251  -9.342  < 2e-16 ***
glucose       0.036189   0.004982   7.264 3.76e-13 ***
bmi           0.074449   0.020267   3.673 0.000239 ***
pedigree      1.087129   0.419408   2.592 0.009541 ** 
age           0.053012   0.013439   3.945 8.00e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 498.10  on 391  degrees of freedom
Residual deviance: 347.23  on 387  degrees of freedom
AIC: 357.23

Number of Fisher Scoring iterations: 5

Questions 9-12 involve using diagnostics to check the logistic regression model assumptions. For each assumption, (1) include the relevant code for the diagnostic(s), and (2) explain whether or not you think the assumption is violated and why you think that.

9. The X’s vs log odds are linear (monotone in probability) (Use scatterplots with smoothers)

# your code here

# here is the plot for age - repeat for the other predictors
scatter.smooth(x = dia$glucose, y = as.numeric(dia$diabetes) - 1)

scatter.smooth(x = dia$bmi, y = as.numeric(dia$diabetes) - 1)

scatter.smooth(x = dia$pedigree, y = as.numeric(dia$diabetes) - 1)

scatter.smooth(x = dia$age, y = as.numeric(dia$diabetes) - 1)

This monotonicity is met excpet for in the age scatterplot. We can see that as age increases (on the far right of the graph), there is some downwardness. It is not terrible, but it is good to note that in later years, there is not an increase in probability of diabetes.

10. The observations are independent (no diagnostic tools or code needed - just think about how the data was collected and briefly write your thoughts)

We randomly sampled women for this dataset, so we can safely assume that these are in fact independent. Perhaps if this was a real scenerio we would really verify that these are random women. How were they randomly sampled from a greater population? For now, we just assume it was done correctly.

11. The model describes all observations (i.e., there are no influential points) (Use residuals vs. leverage plot)

# your code here

plot(dia_logistic, which = 5, cook.levels = .5) # Residuals by leverage

There are some points with very high leverage, but none that are outside cook’s number. This means we are fine, for this assumption

12. No multicollinearity (Use variance inflation factors)

# your code here
# this code uses the pseudo R-Squared
vif(dia_logistic)
 glucose      bmi pedigree      age 
1.022291 1.013365 1.005454 1.030187 
# this code uses the real R-Squared (and is valid since we are looking at how
# the predictor variables relate, and we are not using the response)
dia_lm <- lm(diabetes ~ glucose + bmi + pedigree + age,
               data = dia)
vif(dia_lm)
 glucose      bmi pedigree      age 
1.190042 1.065001 1.040289 1.135755 
max(vif(dia_lm))# < 10
[1] 1.190042
mean(vif(dia_lm)) # < 5
[1] 1.107772

We can cheer because the VIF max is 1.19, and the mean is 1.1. Thus, we know that there are no multicollinearity issues here.

13. Briefly comment on if all assumptions are met. If there is anything you would like to do before proceeding to statistical inference, do that here.

# your code here, if needed

pairs(dia[, c("age", "bmi", "pedigree", "glucose","diabetes")])

I just wanted to put the pairs graph here so that we could really assure that there is no problem with multicollinearity. I am satisfied that all of our assumptions are met. We can recognize the slight deformation of monotoniciity in the age scatterplot, but I think it can be ignored, as that it really is at the edge of our sample, with the oldes lady of the bunch. So, I am gonna pretend it is fine, and we can later throw out that datapoint if really effects our model.

14. For the coefficient for bmi, compute (and show) the estimated log odds ratio (\(\hat{\beta}_{bmi}\), pull this value from the model output), odds ratio (\(e^{\hat{\beta}_{bmi}}\)), and the odds ratio converted to a percentage (\(100 \times (e^{\hat{\beta}_{bmi}} - 1)\%\)). (If you cannot view the math used in this question (and subsequent), you can see it by rendering the document.)

# your code here

# Pull the estimated coefficient for BMI from the model output
beta_bmi <- coef(dia_lm)["bmi"]

# Compute the estimated log odds ratio for BMI
log_odds_ratio_bmi <- beta_bmi

# Compute the odds ratio for BMI
odds_ratio_bmi <- exp(beta_bmi)

# Compute the odds ratio converted to a percentage
odds_ratio_percentage <- 100 * (exp(beta_bmi) - 1)

# Display the results
cat("Estimated Log Odds Ratio for BMI:", log_odds_ratio_bmi, "\n")
Estimated Log Odds Ratio for BMI: 0.01038215 
cat("Odds Ratio for BMI:", odds_ratio_bmi, "\n")
Odds Ratio for BMI: 1.010436 
cat("Odds Ratio for BMI (Percentage):", odds_ratio_percentage, "%\n")
Odds Ratio for BMI (Percentage): 1.043623 %

15. Interpret the coefficient for bmi based on the FOUR different ways we discussed in class.

Interpretation 1: The coefficient for “bmi” of 0.01038215 represents the estimated change in the log odds of diabetes associated with a one-unit increase in “bmi”, holding all other variables constant.

Interpretation 2: The odds ratio for “bmi” of 1.010436 indicates the ratio of the odds of diabetes for individuals with one unit increase in “bmi” to the odds of diabetes for individuals with no increase in “bmi”.

Interpretation 3: The odds ratio for “bmi”, when converted to a percentage of 1.043%, represents the percentage increase in the odds of diabetes for every one-unit increase in “bmi”.

Directionality Interpretation: The positive sign of the coefficient indicates the direction of the association between “bmi” and the log odds of diabetes. The coefficient is positive, which suggests that higher values of “bmi” are associated with higher log odds of diabetes.

16. Create (and output) 95% confidence intervals for \(\beta_k\), \(e^{\beta_k}\), and \(100 \times (e^{\beta_k} - 1)\%\) for all predictors using the confint function.

# your code here

dia_conf_int_LOR <- confint(dia_logistic)[-1,] # omit intercept row
Waiting for profiling to be done...
# A
dia_conf_int_LOR
              2.5 %     97.5 %
glucose  0.02676551 0.04635419
bmi      0.03564664 0.11539762
pedigree 0.28298910 1.92733798
age      0.02715532 0.08002198
# B
exp(dia_conf_int_LOR)
            2.5 %   97.5 %
glucose  1.027127 1.047445
bmi      1.036290 1.122320
pedigree 1.327091 6.871195
age      1.027527 1.083311
# C
100 * (exp(dia_conf_int_LOR) - 1) # Most straigtforward
             2.5 %     97.5 %
glucose   2.712692   4.744534
bmi       3.628959  12.231960
pedigree 32.709070 587.119461
age       2.752739   8.331088

17. Interpret the 95% confidence interval for \(100 \times (e^{\beta_{bmi}} - 1)\%\).

Interpretation using \(100 \times (e^{\beta_{bmi}} - 1)\%\): We are 95% confident that the true percentage increase in the odds of diabetes per one-unit increase in BMI lies between approximately 3.63% and 12.23%.

18. Calculate a 95% confidence interval for the predicted probability that a patient has diabetes where pregnant = 1, glucose = 90, diastolic = 62, triceps = 18, insulin = 59, bmi = 25.1, pedigree = 1.268 and age = 25. Note that you may not need to use all of these values depending on the variables you chose to include in your model. Do you think this patient will develop diabetes? Why or why not?

# your code here
# Patient characteristics
patient_data <- data.frame(
  glucose = 90,
  bmi = 25.1,
  pedigree = 1.268,
  age = 25
)

# Predict probability of diabetes for the patient
predicted_prob <- predict(dia_lm, newdata = patient_data, type = "response")

# Get standard error of the predicted probability
se_predicted_prob <- sqrt(predict(dia_lm, newdata = patient_data, type = "response", se.fit = TRUE)$se.fit^2)

# Calculate z-score for 95% confidence interval
z <- qnorm(0.975)

# Calculate margin of error
margin_of_error <- z * se_predicted_prob

# Calculate confidence interval
confidence_interval <- c(predicted_prob - margin_of_error, predicted_prob + margin_of_error)

# Print confidence interval
cat("95% Confidence Interval for Predicted Probability of Diabetes:", confidence_interval, "\n")
95% Confidence Interval for Predicted Probability of Diabetes: -0.005437571 0.2263336 

We are 95$ confident that if a new patient were to be sampled from the population with the aforementioned measurements for each independent variable, that the probability that they have diabetes is between 0 and .226.

19. Compute the likelihood ratio test statistic (aka deviance, aka model chi-squared test) for the model, and compute the associated \(p\)-value. Print out the test statistic and the \(p\)-value. Based on the results, what do you conclude?

# your code here

# Likelihood ratio test statistic
like_ratio <- dia_logistic$null.deviance - dia_logistic$deviance

cat("The likelyhood ration test statistic for the model is: ", like_ratio)
The likelyhood ration test statistic for the model is:  150.8628
cat(",   \nand the p value is: ")
,   
and the p value is: 
# Likelihood ratio p-value
pchisq(q = like_ratio,
       df = length(coef(dia_logistic)) - 1,
       lower.tail = FALSE)
[1] 1.329928e-31

The likelyhood ratio test statistic quantifies the difference in fit between the full logistic regression model (with predictors) and the null model. A higher likelihood ratio test statistic indicates a better fit of the full model compared to the null model. With a pvalue this small, we are quite sure that our model is putting in good work!

20. Compute (and output) the pseudo \(R^2\) value for the model.

# your code here
1 - dia_logistic$deviance/dia_logistic$null.deviance
[1] 0.3028779

21. What is the cutoff value for the model that minimizes the percent misclassified (or equivalently maximizes accuracy)? Show your code and output this cutoff value.

# your code here
# get the predicted probabilities for all patients:
dia_preds <- predict(dia_logistic,
                       type = "response")

# Look at the first few predictions:

head(dia_preds)
         1          2          3          4          5          6 
0.02975966 0.90981302 0.03511263 0.90798689 0.92750569 0.76456752 
# create a sequence from 0 to 1 to represent all possible cut-off values (c) 
# that we could choose:

possible_cutoffs <- seq(0, 1, by = .01)



# create an empty vector where we will store the percent misclassified for each
# possible cut-off value we created:

percent_missclass <- rep(NA, length(possible_cutoffs))


# for each possible cut-off value, (1) grab the cut-off value, (2) for all 757
# patients, store a 1 in "classify" if their predicted probability is larger 
# than the cut-off value, and (3) compute the average percent misclassified 
# across the 757 patients when using that cut-off by averaging the number of 
# times "classify" (0 or 1 based on how that cut-off classified a person) is 
# not the same as heart_binary (the truth):

for(i in 1:length(possible_cutoffs)){
  classify <- ifelse(dia_preds > possible_cutoffs[i], 1, 0)
  percent_missclass[i] <- mean(classify != dia$diabetes)
}


# percent_misclass holds the average misclassification rates for each cut-off
percent_missclass
  [1] 0.6683673 0.6632653 0.6581633 0.6275510 0.5969388 0.5510204 0.5331633
  [8] 0.4923469 0.4795918 0.4591837 0.4387755 0.4132653 0.3826531 0.3724490
 [15] 0.3367347 0.3188776 0.3137755 0.3137755 0.3035714 0.2959184 0.2755102
 [22] 0.2653061 0.2576531 0.2525510 0.2448980 0.2372449 0.2295918 0.2219388
 [29] 0.2193878 0.2193878 0.2244898 0.2270408 0.2244898 0.2295918 0.2244898
 [36] 0.2142857 0.2168367 0.2193878 0.2168367 0.2040816 0.1989796 0.1989796
 [43] 0.1989796 0.1938776 0.1964286 0.1964286 0.1989796 0.1964286 0.1989796
 [50] 0.2015306 0.2040816 0.2168367 0.2168367 0.2193878 0.2168367 0.2193878
 [57] 0.2193878 0.2168367 0.2142857 0.2117347 0.2168367 0.2142857 0.2193878
 [64] 0.2270408 0.2270408 0.2244898 0.2193878 0.2244898 0.2244898 0.2321429
 [71] 0.2321429 0.2346939 0.2321429 0.2295918 0.2346939 0.2423469 0.2397959
 [78] 0.2474490 0.2525510 0.2576531 0.2678571 0.2678571 0.2678571 0.2755102
 [85] 0.2806122 0.2780612 0.2806122 0.2882653 0.2959184 0.2959184 0.3010204
 [92] 0.3112245 0.3137755 0.3214286 0.3239796 0.3265306 0.3290816 0.3316327
 [99] 0.3341837 0.3290816 0.3316327
# put this information in a dataframe so we can plot it with ggplot:
missclass_df <- as.data.frame(cbind(percent_missclass, possible_cutoffs))

# plot the misclassification rate against the cut-off value:
ggplot(data = missclass_df) +
  geom_line(aes(x = possible_cutoffs, y = percent_missclass))

# choose the "best" cut-off that minimizes the percent misclassified:

cutoff_best <- possible_cutoffs[which.min(percent_missclass)]

percent_missclass == min(percent_missclass) 
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE

The true cuttoff value is .663, as shown by the code above

22. Create (and output) a confusion matrix using the cutoff value you found above.

# your code here

preds <- dia_preds > cutoff_best

# create a confusion matrix with the truth and the predicted classification: 

conf_mat <- table("truth" = dia$diabetes, "predicted" = preds)
conf_mat
     predicted
truth FALSE TRUE
    0   230   32
    1    44   86
# note that we can also add column and row sums, which is useful for 
# calculations:

addmargins(conf_mat)
     predicted
truth FALSE TRUE Sum
  0     230   32 262
  1      44   86 130
  Sum   274  118 392

23. Based on the confusion matrix, what is the value for the specificity, and what does the specificity measure? Print the specificity.

# your code here
# True negatives
true_negatives <- 230

# False positives
false_positives <- 32

# Specificity calculation
specificity <- true_negatives / (true_negatives + false_positives)

# Output the specificity
cat("Specificity:", specificity, "\n")
Specificity: 0.8778626 

The exact specificy for our model is .878. Specificity is a measure of the proportion of actual negatives that are correctly identified as such by the classifier.

24. Based on the confusion matrix, what is the value for the sensitivity, and what does the sensitivity measure? Print the sensitivity.

# your code here

# True positives
true_positives <- 86

# False negatives
false_negatives <- 44

# Sensitivity calculation
sensitivity <- true_positives / (true_positives + false_negatives)

# Output the sensitivity
cat("Sensitivity:", sensitivity, "\n")
Sensitivity: 0.6615385 

Our model has a sensitivity of .6615. Sensitivity is a measure of the proportion of actual positives that are correctly identified as such by the classifier.

25. Based on the confusion matrix, what is the percent correctly classified (accuracy), and what does the percent correctly classified measure? Print the percent correctly classified.

# your code here
# True positives
true_positives <- 86

# True negatives
true_negatives <- 230

# Total observations
total_observations <- 392

# Accuracy calculation
accuracy <- (true_positives + true_negatives) / total_observations

# Output the accuracy
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.8061224 

The accuracy of our model is 0.806. This measures the proportion of all correctly classified instances.

26. Plot (and output) the ROC curve for the model (either using the pROC package or the ROCR package).

# your code here

# METHOD 1: using the pROC package
my_roc <- roc(dia$diabetes, dia_preds)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
ggplot() +
  geom_path(mapping = aes(x = 1 - my_roc$specificities, 
                          y = my_roc$sensitivities), 
            linewidth = 2) +
  geom_abline(slope = 1, intercept = 0) +
  theme_bw() + 
  xlab("1 - Specificity (False Positive Rate)") +
  ylab("Sensitivity (True Positive Rate)") +
  theme(aspect.ratio = 1)

27. What is the AUC for the ROC curve plotted above? Print the value of the AUC.

# your code here
auc(my_roc)  # AUC
Area under the curve: 0.8605

28. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

What I really learned this time was how important it is to respect the multicollinearity problems. It was interesting to me that I was not to worried about our multicollinearity problems, but it was quite a big issue. I am grateful to be able to use best_sets searching algorithm to make sure to find an actually effective set of input variables. It was a lot of fun this time, and I am grateful to work on real-world prolbems like learning about diabetes.

29. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

For a non-statisticians, I would say that this data-analysis was all about learning about the probabilities of diabetes. We cannot just assume that becuase one single thing like bmi will be the direct thing that causes diabetes, or glucose, or insulin. Thus, we learned that we can accuratley predict who will have diabetes by 80%. This is very exciting, and very helpful for understanding if someone has diabetes, just based on daily life measurments.