In this homework, you will apply logistic regression to a real-world dataset: the Pima Indians Diabetes Database. This dataset contains medical records from 768 women of Pima Indian heritage, aged 21 or older, and is used to predict the onset of diabetes (binary outcome: 0 = no diabetes, 1 = diabetes) based on physiological measurements.
The data is publicly available from the UCI Machine Learning Repository and can be imported directly.
Dataset URL: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
Columns (no header in the CSV, so we need to assign them manually):
Task Overview: You will load the data, build a logistic regression model to predict diabetes onset using a subset of predictors (Glucose, BMI, Age), interpret the model, evaluate it with a confusion matrix and metrics, and analyze the ROC curve and AUC.
Cleaning the dataset Don’t change the following code
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
data <- read.csv(url, header = FALSE)
colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")
data$Outcome <- as.factor(data$Outcome)
# Handle missing values (replace 0s with NA because 0 makes no sense here)
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA
colSums(is.na(data))
## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 0 0 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
- Fit a logistic regression model to predict Outcome using Glucose, BMI, and Age.
Provide the model summary.
Calculate and interpret R²: 1 - (model\(deviance / model\)null.deviance). What does it indicate about the model’s explanatory power?
## Enter your code here
logistic <- glm(Outcome ~ Glucose + BMI + Age, data=data, family="binomial")
summary(logistic)
##
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.032377 0.711037 -12.703 < 2e-16 ***
## Glucose 0.035548 0.003481 10.212 < 2e-16 ***
## BMI 0.089753 0.014377 6.243 4.3e-10 ***
## Age 0.028699 0.007809 3.675 0.000238 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.75 on 751 degrees of freedom
## Residual deviance: 724.96 on 748 degrees of freedom
## (16 observations deleted due to missingness)
## AIC: 732.96
##
## Number of Fisher Scoring iterations: 4
Calculating R^2
r_square <- 1 - (logistic$deviance/logistic$null.deviance)
r_square
## [1] 0.25626
This indicates that about 25 % of the variation in the data can be explained Glucose, BMI, and Age.
What does the intercept represent (log-odds of diabetes when predictors are zero)?
For each predictor (Glucose, BMI, Age), does a one-unit increase raise or lower the odds of diabetes? Are they significant (p-value < 0.05)?
Glucose -> a one-unit increase will raise the odds of diabetes. Yes, this variable is significant
BMI -> a one-unit increase will raise the odds of diabetes. Yes, this variable is significant
Age -> a one-unit increase will raise the odds of diabetes. Yes, this variable is significant
Predict probabilities using the fitted model.
Create predicted classes with a 0.5 threshold (1 if probability > 0.5, else 0).
Build a confusion matrix (Predicted vs. Actual Outcome).
Calculate and report the metrics:
Accuracy: (TP + TN) / Total Sensitivity (Recall): TP / (TP + FN) Specificity: TN / (TN + FP) Precision: TP / (TP + FP)
Use the following starter code
# Keep only rows with no missing values in Glucose, BMI, or Age
data_subset <- data[complete.cases(data[, c("Glucose", "BMI", "Age")]), ]
#Create a numeric version of the outcome (0 = no diabetes, 1 = diabetes).This is required for calculating confusion matrices.
data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)
# Predicted probabilities
predicted.data <- data.frame(
probability.of.hd=logistic$fitted.values,
age=data_subset$Age, glucose = data_subset$Glucose, bmi = data_subset$BMI)
predicted.data
## probability.of.hd age glucose bmi
## 1 0.66360006 50 148 33.6
## 2 0.06101402 31 85 26.6
## 3 0.61834186 32 183 23.3
## 4 0.06043396 21 89 28.1
## 5 0.65771328 33 137 43.1
## 6 0.14802668 30 116 25.6
## 7 0.06116212 26 78 31.0
## 8 0.28013239 29 115 35.3
## 9 0.90283168 53 197 30.5
## 11 0.29185049 30 110 37.6
## 12 0.79018904 34 168 38.0
## 13 0.49423685 57 139 27.1
## 14 0.88904285 59 189 30.1
## 15 0.65652969 51 166 25.8
## 16 0.13393352 32 100 30.0
## 17 0.54057167 31 118 45.8
## 18 0.15678020 31 107 29.6
## 19 0.36875552 33 103 43.3
## 20 0.28484895 32 115 34.6
## 21 0.43753713 27 126 39.3
## 22 0.28886248 50 99 35.4
## 23 0.93606735 41 196 39.8
## 24 0.20309566 29 119 29.0
## 25 0.68988838 51 143 36.6
## 26 0.34957706 41 125 31.1
## 27 0.72382298 43 147 39.4
## 28 0.05362751 22 97 23.2
## 29 0.43793280 57 145 22.2
## 30 0.32692619 38 117 34.1
## 31 0.44902956 60 109 36.0
## 32 0.55575994 28 158 31.6
## 33 0.04535146 22 88 24.8
## 34 0.04022137 28 92 19.9
## 35 0.28355775 45 122 27.6
## 36 0.09365574 33 103 24.0
## 37 0.46443795 35 138 33.2
## 38 0.24352489 46 102 32.9
## 39 0.16388268 27 90 38.2
## 40 0.46267837 56 111 37.1
## 41 0.76206518 26 180 34.0
## 42 0.59035695 37 133 40.2
## 43 0.13595021 48 106 22.7
## 44 0.93528537 54 171 45.4
## 45 0.55649424 40 159 27.4
## 46 0.86452121 25 180 42.0
## 47 0.41473239 29 146 29.7
## 48 0.03343950 22 71 28.0
## 49 0.27449787 31 103 39.1
## 51 0.04750056 22 103 19.4
## 52 0.07420421 26 101 24.2
## 53 0.05451568 30 88 24.4
## 54 0.87138830 58 176 33.7
## 55 0.65012987 42 150 34.7
## 56 0.02252439 21 73 23.0
## 57 0.89802263 41 187 37.7
## 58 0.40432752 31 100 46.8
## 59 0.74180750 44 146 40.5
## 60 0.28015166 22 105 41.5
## 62 0.44217047 39 133 32.9
## 63 0.01490161 36 44 25.0
## 64 0.25891619 24 141 25.4
## 65 0.30350831 42 114 32.8
## 66 0.12005398 32 99 29.0
## 67 0.24046916 38 109 32.5
## 68 0.55590485 54 109 42.7
## 69 0.03997582 25 95 19.6
## 70 0.38375591 27 146 28.9
## 71 0.15172557 28 100 32.9
## 72 0.31473014 26 139 28.6
## 73 0.63351150 42 126 43.4
## 74 0.34608813 23 129 35.1
## 75 0.06176809 22 79 32.0
## 77 0.06146859 41 62 32.6
## 78 0.18291002 27 95 37.7
## 79 0.56166300 26 131 43.2
## 80 0.10732115 24 112 25.0
## 81 0.08520738 22 113 22.4
## 83 0.08173799 36 83 29.3
## 84 0.06896305 22 101 24.6
## 85 0.78236624 37 137 48.8
## 86 0.19166509 27 110 32.4
## 87 0.33450675 45 106 36.6
## 88 0.21824692 26 100 38.5
## 89 0.59050304 43 136 37.1
## 90 0.10326043 24 107 26.5
## 91 0.02040067 21 80 19.1
## 92 0.30744087 34 123 32.0
## 93 0.31948020 42 81 46.7
## 94 0.39870087 60 134 23.8
## 95 0.23776250 21 142 24.7
## 96 0.56884043 40 144 33.9
## 97 0.09647762 24 92 31.6
## 98 0.01718930 22 71 20.4
## 99 0.07653216 23 93 28.7
## 100 0.65810775 31 122 49.7
## 101 0.77018913 33 163 39.0
## 102 0.33387716 22 151 26.1
## 103 0.12273756 21 125 22.5
## 104 0.04407517 24 81 26.6
## 105 0.15687002 27 85 39.6
## 106 0.20185497 21 126 28.7
## 107 0.05549181 27 96 22.4
## 108 0.44920355 37 144 29.5
## 109 0.09229839 25 83 34.3
## 110 0.16661940 24 95 37.4
## 111 0.67346049 24 171 33.3
## 112 0.70042432 46 155 34.0
## 113 0.08254696 23 89 31.2
## 114 0.07164768 25 76 34.0
## 115 0.62528208 39 160 30.5
## 116 0.67008425 61 146 31.2
## 117 0.38171854 38 124 34.0
## 118 0.07464178 25 78 33.7
## 119 0.08152471 22 97 28.2
## 120 0.05582038 21 99 23.2
## 121 0.90191908 25 162 53.2
## 122 0.20945381 24 111 34.2
## 123 0.17465864 23 107 33.6
## 124 0.51139163 69 132 26.8
## 125 0.20316943 23 113 33.3
## 126 0.44483507 26 88 55.0
## 127 0.48619280 30 120 42.9
## 128 0.23346251 23 118 33.3
## 129 0.34777790 40 117 34.5
## 130 0.26573200 62 105 27.9
## 131 0.67483938 33 173 29.7
## 132 0.31871597 33 122 33.3
## 133 0.72476634 30 170 34.5
## 134 0.18399064 39 84 38.3
## 135 0.04834651 26 96 21.1
## 136 0.33949245 31 125 33.8
## 137 0.10807987 21 100 30.8
## 138 0.07452834 22 93 28.7
## 139 0.30701297 29 129 31.2
## 140 0.23426580 28 105 36.9
## 141 0.26698030 55 128 21.1
## 142 0.34785494 38 106 39.5
## 143 0.16180713 22 108 32.5
## 144 0.25353709 42 108 32.4
## 145 0.51149487 23 154 32.8
## 147 0.05287106 41 57 32.8
## 148 0.17493386 34 106 30.5
## 149 0.74711670 65 147 33.7
## 150 0.06000638 22 90 27.3
## 151 0.46199537 24 136 37.4
## 152 0.12428635 37 114 21.9
## 153 0.68933166 42 156 34.3
## 154 0.67051462 23 153 40.6
## 155 0.96022281 43 188 47.9
## 156 0.86895301 36 152 50.0
## 157 0.06282464 21 99 24.6
## 158 0.09658196 23 109 25.2
## 159 0.06477073 22 88 29.0
## 160 0.85590643 47 163 40.9
## 161 0.50854865 36 151 29.7
## 162 0.31513699 45 102 37.2
## 163 0.44079186 27 114 44.2
## 164 0.09892429 21 100 29.7
## 165 0.34954797 32 131 31.6
## 166 0.18616725 41 104 29.9
## 167 0.44449836 22 148 32.5
## 168 0.24339382 34 120 29.6
## 169 0.19361257 29 110 31.9
## 170 0.15377524 29 111 28.4
## 171 0.16673816 36 102 30.8
## 172 0.43550662 29 134 35.4
## 173 0.06733514 25 87 28.9
## 174 0.15979544 23 79 43.5
## 175 0.05988683 33 75 29.7
## 176 0.78563291 36 179 32.7
## 177 0.11866407 42 85 31.2
## 178 0.91067595 26 129 67.1
## 179 0.80825740 47 143 45.0
## 180 0.53993201 37 130 39.1
## 181 0.05025601 32 87 23.2
## 182 0.26703676 23 119 34.9
## 184 0.03707194 27 73 26.8
## 185 0.40252230 40 141 27.6
## 186 0.90574248 41 194 35.9
## 187 0.86120289 60 181 30.1
## 188 0.34005026 33 128 32.0
## 189 0.14630667 31 109 27.9
## 190 0.36876070 25 139 31.6
## 191 0.07904064 21 111 22.6
## 192 0.36791134 40 123 33.1
## 193 0.59421282 36 159 30.4
## 194 0.83322339 40 135 52.3
## 195 0.06814984 42 85 24.4
## 196 0.72166676 29 158 39.4
## 197 0.07473295 21 105 24.3
## 198 0.07492956 23 107 22.9
## 199 0.21618023 26 109 34.8
## 200 0.45868538 29 148 30.9
## 201 0.16377117 21 113 31.0
## 202 0.56854406 28 138 40.1
## 203 0.13888668 32 108 27.3
## 204 0.05179431 27 99 20.4
## 205 0.39920095 55 103 37.7
## 206 0.10279202 27 111 23.9
## 207 0.94962688 57 196 37.5
## 208 0.83235858 52 162 37.7
## 209 0.11534291 21 96 33.2
## 210 0.86661368 41 184 35.5
## 211 0.04976700 25 81 27.7
## 212 0.67335126 24 147 42.8
## 213 0.89304307 60 179 34.2
## 214 0.61220621 24 140 42.6
## 215 0.27923019 36 112 34.2
## 216 0.76451750 38 151 41.8
## 217 0.22670470 25 109 35.8
## 218 0.27330482 32 125 30.0
## 219 0.07659116 32 85 29.0
## 220 0.38185630 41 112 37.8
## 221 0.72467033 21 177 34.6
## 222 0.78827161 66 158 31.6
## 223 0.18565015 37 119 25.2
## 224 0.58685205 61 142 28.8
## 225 0.06829164 26 100 23.6
## 226 0.09949318 22 87 34.6
## 227 0.18367083 26 101 35.7
## 228 0.68004613 24 162 37.2
## 229 0.89605868 31 197 36.7
## 230 0.46813090 24 117 45.2
## 231 0.64472858 22 142 44.0
## 232 0.76813313 46 134 46.2
## 233 0.03512856 22 79 25.4
## 234 0.32697577 29 122 35.0
## 235 0.04410466 23 74 29.7
## 236 0.84628204 26 171 43.6
## 237 0.88969137 51 181 35.9
## 238 0.87532621 23 179 44.1
## 239 0.61780765 32 164 30.8
## 240 0.05170763 27 104 18.4
## 241 0.07082780 21 91 29.2
## 242 0.10017276 22 91 33.1
## 243 0.23827632 22 139 25.6
## 244 0.19422429 33 119 27.1
## 245 0.60311596 29 146 38.2
## 246 0.83303549 49 184 30.0
## 247 0.32770854 41 122 31.2
## 248 0.89909414 23 165 52.3
## 249 0.38428435 34 124 35.4
## 250 0.15124012 23 111 30.1
## 251 0.22120889 42 106 31.2
## 252 0.23889824 27 129 28.0
## 253 0.04953331 24 90 24.4
## 254 0.11459758 25 86 35.8
## 255 0.11691034 44 92 27.6
## 256 0.19828080 21 113 33.6
## 257 0.17887125 30 111 30.1
## 258 0.15623408 25 114 28.7
## 259 0.69883589 24 193 25.9
## 260 0.71707281 51 155 33.3
## 261 0.81853025 34 191 30.9
## 262 0.36525031 27 141 30.0
## 263 0.11051715 24 95 32.1
## 264 0.67512915 63 142 32.4
## 265 0.31358505 35 123 32.0
## 266 0.20261849 43 96 33.6
## 267 0.46226051 25 138 36.3
## 268 0.44933995 24 128 40.0
## 269 0.07235915 21 102 25.1
## 270 0.36110033 28 146 27.5
## 271 0.43567653 38 101 45.6
## 272 0.08877056 21 108 25.2
## 273 0.18493829 40 122 23.0
## 274 0.05088367 21 71 33.2
## 275 0.33128368 52 106 34.2
## 276 0.24506561 25 100 40.5
## 277 0.11369280 29 106 26.5
## 278 0.10154494 23 104 27.8
## 279 0.24801836 57 114 24.9
## 280 0.09186564 22 108 25.3
## 281 0.58972781 28 146 37.9
## 282 0.47370166 39 129 35.9
## 283 0.41711384 37 133 32.4
## 284 0.68313032 47 161 30.4
## 285 0.21797420 52 108 27.0
## 286 0.40116323 51 136 26.0
## 287 0.71641934 34 155 38.7
## 288 0.53067228 29 119 45.6
## 289 0.04712263 26 96 20.8
## 290 0.26775522 33 108 36.1
## 291 0.08745865 21 78 36.9
## 292 0.22682859 25 107 36.6
## 293 0.57291177 31 128 43.3
## 294 0.46046734 24 128 40.5
## 295 0.62758689 65 161 21.9
## 296 0.58058435 28 151 35.5
## 297 0.37824219 29 146 28.0
## 298 0.24803175 24 126 30.7
## 299 0.29474268 46 100 36.6
## 300 0.21955100 58 112 23.6
## 301 0.66018715 30 167 32.3
## 302 0.41100863 25 144 31.6
## 303 0.11129750 35 77 35.8
## 304 0.64729036 28 115 52.9
## 305 0.32005860 37 150 21.0
## 306 0.40826274 29 120 39.7
## 307 0.58137125 47 161 25.5
## 308 0.20853969 21 137 24.8
## 309 0.26360927 25 128 30.5
## 310 0.30776656 30 124 32.9
## 311 0.06535416 41 80 26.2
## 312 0.25036944 22 106 39.4
## 313 0.41092641 27 155 26.6
## 314 0.16107311 25 113 29.5
## 315 0.33149016 43 109 35.9
## 316 0.22369709 26 112 34.1
## 317 0.05117748 30 99 19.3
## 318 0.73245084 29 182 30.5
## 319 0.32712977 28 115 38.1
## 320 0.84109136 59 194 23.5
## 321 0.25184255 31 129 27.5
## 322 0.18282382 25 112 31.6
## 323 0.24378690 36 124 27.4
## 324 0.50258896 43 152 26.8
## 325 0.22371606 21 112 35.7
## 326 0.38582606 24 157 25.6
## 327 0.33531991 30 122 35.1
## 328 0.82388681 37 179 35.1
## 329 0.34014640 23 102 45.5
## 330 0.18639905 37 105 30.8
## 331 0.19088602 46 118 23.1
## 332 0.09218007 25 87 32.7
## 333 0.91902894 41 180 43.3
## 334 0.13200336 44 106 23.6
## 335 0.05320940 22 95 23.9
## 336 0.86742537 26 165 47.9
## 337 0.35965734 44 117 33.8
## 338 0.29290730 44 115 31.2
## 339 0.59568971 33 152 34.2
## 340 0.88624733 41 178 39.9
## 341 0.18920894 22 130 25.9
## 342 0.09132616 36 95 25.9
## 344 0.34659852 33 122 34.7
## 345 0.32815123 57 95 36.8
## 346 0.57649831 49 126 38.5
## 347 0.29236645 22 139 28.7
## 348 0.10531291 23 116 23.5
## 349 0.05676819 26 99 21.8
## 351 0.24193286 29 92 42.2
## 352 0.37729664 30 137 31.2
## 353 0.07898002 46 61 34.4
## 354 0.06279661 24 90 27.2
## 355 0.19814573 21 90 42.7
## 356 0.72467781 49 165 30.4
## 357 0.31076811 28 125 33.3
## 358 0.59801892 44 129 39.9
## 359 0.20451411 48 88 35.3
## 360 0.88526710 29 196 36.5
## 361 0.78897461 29 189 31.2
## 362 0.74400417 63 158 29.8
## 363 0.50320629 65 103 39.2
## 364 0.82287641 67 146 38.5
## 365 0.54649709 30 147 34.9
## 366 0.16790447 30 99 34.0
## 367 0.21165603 29 124 27.6
## 368 0.04952246 21 101 21.0
## 369 0.04507078 22 81 27.5
## 370 0.48272227 45 133 32.8
## 371 0.78269038 25 173 38.4
## 373 0.09704412 21 84 35.8
## 374 0.19000429 25 105 34.9
## 375 0.34459338 28 122 36.2
## 376 0.75532282 58 140 39.2
## 377 0.06564960 22 98 25.2
## 378 0.12244145 22 87 37.2
## 379 0.85402788 32 156 48.3
## 380 0.30435064 35 93 43.4
## 381 0.14485063 24 107 30.8
## 382 0.05348425 22 105 20.0
## 383 0.09319434 21 109 25.4
## 384 0.05402437 25 90 25.1
## 385 0.15572249 25 125 24.3
## 386 0.10794570 24 119 22.3
## 387 0.26789645 35 116 32.3
## 388 0.46951938 45 105 43.3
## 389 0.65094037 58 144 32.0
## 390 0.13731054 28 100 31.6
## 391 0.19779697 42 100 32.0
## 392 0.85134307 27 166 45.7
## 393 0.16168195 21 131 23.7
## 394 0.13430673 37 116 22.1
## 395 0.60509679 31 158 32.9
## 396 0.21179332 25 127 27.7
## 397 0.09248973 39 96 24.7
## 398 0.33946376 22 131 34.3
## 399 0.02913697 25 82 21.1
## 400 0.84267098 25 193 34.9
## 401 0.13084013 31 95 32.0
## 402 0.39847300 55 137 24.2
## 403 0.48699564 35 136 35.0
## 404 0.07268482 38 72 31.6
## 405 0.74444828 41 168 32.9
## 406 0.46625190 26 123 42.1
## 407 0.26301832 46 115 28.9
## 408 0.05958239 25 101 21.9
## 409 0.80446521 39 197 25.9
## 410 0.84435244 28 172 42.4
## 411 0.19801831 28 102 35.7
## 412 0.22338931 25 112 34.4
## 413 0.61960934 22 143 42.4
## 414 0.26996338 21 143 26.2
## 415 0.39684645 21 138 34.6
## 416 0.72171476 22 173 35.7
## 417 0.07505161 22 97 27.2
## 418 0.64654541 37 144 38.5
## 419 0.02475853 27 83 18.2
## 420 0.21863561 28 129 26.4
## 421 0.50245499 26 119 45.3
## 422 0.05982690 21 94 26.0
## 423 0.23869814 21 102 40.6
## 424 0.17118007 21 115 30.8
## 425 0.77187668 36 151 42.9
## 426 0.84799767 31 184 37.0
## 428 0.82533914 38 181 34.1
## 429 0.53910692 26 135 40.6
## 430 0.21756689 43 95 35.0
## 431 0.05413944 23 99 22.2
## 432 0.11409791 38 89 30.4
## 433 0.05393312 22 80 30.0
## 434 0.27662620 29 139 25.6
## 435 0.06907776 36 90 24.5
## 436 0.64969264 29 141 42.4
## 437 0.61721992 41 140 37.4
## 438 0.42076439 28 147 29.9
## 439 0.03395945 21 97 18.2
## 440 0.26189146 31 107 36.8
## 441 0.87450365 41 189 34.3
## 442 0.07172639 22 83 32.2
## 443 0.23064261 24 117 33.2
## 444 0.18113764 33 108 30.5
## 445 0.20642234 30 117 29.7
## 446 0.96817202 25 180 59.4
## 447 0.08292490 28 100 25.3
## 448 0.16339824 26 95 36.5
## 449 0.15599867 22 104 33.6
## 450 0.21704383 26 120 30.5
## 451 0.02779798 23 82 21.2
## 452 0.26600085 23 134 28.9
## 453 0.18259140 25 91 39.9
## 454 0.27355231 72 119 19.6
## 455 0.19842979 24 100 37.8
## 456 0.78495582 38 175 33.6
## 457 0.48559124 62 135 26.7
## 458 0.07070356 24 86 30.2
## 459 0.74404318 51 148 37.6
## 460 0.59394115 81 134 25.9
## 461 0.17913761 48 120 20.8
## 462 0.02176006 26 71 21.8
## 463 0.10771664 39 74 35.3
## 464 0.08587260 37 88 27.6
## 465 0.14009285 34 115 24.0
## 466 0.11253219 21 124 21.8
## 467 0.03642788 22 74 27.8
## 468 0.17309703 25 97 36.8
## 469 0.27220493 38 120 30.0
## 470 0.79486433 27 154 46.1
## 471 0.64494785 28 144 41.3
## 472 0.36560333 22 137 33.2
## 473 0.33439546 22 119 38.8
## 474 0.48018967 50 136 29.9
## 475 0.15482239 24 114 28.9
## 476 0.49529998 59 137 27.3
## 477 0.19109830 29 105 33.7
## 478 0.12410536 31 114 23.8
## 479 0.24797063 39 126 25.9
## 480 0.49527096 63 132 28.0
## 481 0.68458039 35 158 35.5
## 482 0.33885594 29 123 35.2
## 483 0.06226367 28 85 27.8
## 484 0.12371592 23 84 38.2
## 485 0.72687680 31 145 44.2
## 486 0.56265143 24 135 42.3
## 487 0.54101247 21 139 40.7
## 488 0.95052210 58 173 46.5
## 489 0.08227156 28 99 25.6
## 490 0.89372062 67 194 26.1
## 491 0.11005245 24 83 36.8
## 492 0.16022990 42 89 33.5
## 493 0.16490742 33 99 32.8
## 494 0.33102362 45 125 28.9
## 496 0.75953954 66 166 26.6
## 497 0.12702210 30 110 26.0
## 498 0.06099965 25 81 30.1
## 499 0.84950541 55 195 25.1
## 500 0.54761477 39 154 29.3
## 501 0.11828122 21 117 25.2
## 502 0.12966096 28 84 37.2
## 504 0.17866354 41 94 33.3
## 505 0.24526648 40 96 37.3
## 506 0.09221063 38 75 33.3
## 507 0.83844593 35 180 36.5
## 508 0.22417076 21 130 28.6
## 509 0.06208386 21 84 30.4
## 510 0.33491186 64 120 25.0
## 511 0.11299337 46 84 29.7
## 512 0.18168280 21 139 22.1
## 513 0.12336536 58 91 24.2
## 514 0.06204313 22 91 27.3
## 515 0.07400936 24 99 25.6
## 516 0.59909870 28 163 31.6
## 517 0.58968113 53 145 30.3
## 518 0.56205034 51 125 37.6
## 519 0.09884110 41 76 32.8
## 520 0.27576164 60 129 19.6
## 521 0.02523876 25 68 25.0
## 522 0.28936865 26 124 33.2
## 524 0.48747062 45 130 34.2
## 525 0.25656343 24 125 31.6
## 526 0.03291344 21 87 21.8
## 527 0.03395945 21 97 18.2
## 528 0.13475781 24 116 26.3
## 529 0.18580656 22 117 30.8
## 530 0.12036732 31 111 24.6
## 531 0.19948677 22 122 29.8
## 532 0.38363361 24 107 45.3
## 533 0.19213803 29 86 41.3
## 534 0.09680854 31 91 29.8
## 535 0.06801244 24 77 33.3
## 536 0.32583319 23 132 32.9
## 537 0.21032098 46 105 29.6
## 538 0.04166002 67 57 21.7
## 539 0.35441889 23 127 36.3
## 540 0.43504221 32 129 36.4
## 541 0.33020703 43 100 39.4
## 542 0.31016162 27 128 32.4
## 543 0.25095444 56 90 34.9
## 544 0.14385608 25 84 39.5
## 545 0.09976982 29 88 32.0
## 546 0.85041886 37 186 34.5
## 547 0.95475562 53 187 43.6
## 548 0.35407064 28 131 33.1
## 549 0.76428975 50 164 32.8
## 550 0.78684564 37 189 28.5
## 551 0.13623736 21 116 27.4
## 552 0.07829450 25 84 31.9
## 553 0.35648962 66 114 27.8
## 554 0.07172684 23 88 29.9
## 555 0.12665258 28 84 36.9
## 556 0.21859713 37 124 25.5
## 557 0.21354962 30 97 38.1
## 558 0.27639435 58 110 27.8
## 559 0.49525433 42 103 46.2
## 560 0.09072912 35 85 30.1
## 561 0.49863081 54 125 33.8
## 562 0.92529006 28 198 41.3
## 563 0.13282467 24 87 37.6
## 564 0.10152438 32 99 26.9
## 565 0.10768221 27 91 32.4
## 566 0.06408069 22 95 26.1
## 567 0.19062099 21 99 38.6
## 568 0.17225806 46 92 32.0
## 569 0.57765294 37 154 31.3
## 570 0.33059973 33 121 34.3
## 571 0.09766907 39 78 32.5
## 572 0.14429785 21 130 22.6
## 573 0.14094557 22 111 29.5
## 574 0.14150256 22 98 34.7
## 575 0.35723780 23 143 30.1
## 576 0.28936725 25 119 35.5
## 577 0.11561206 35 108 24.0
## 578 0.40501046 21 118 42.9
## 579 0.29985290 36 133 27.0
## 580 0.94605555 62 197 34.7
## 581 0.67186867 21 151 42.1
## 582 0.10536856 27 109 25.0
## 583 0.36048196 62 121 26.5
## 584 0.31028760 42 100 38.7
## 585 0.36443558 52 124 28.7
## 586 0.04412532 22 93 22.5
## 587 0.58904672 41 143 34.9
## 588 0.08645851 29 103 24.3
## 589 0.84621187 52 176 33.3
## 590 0.02132939 25 73 21.1
## 591 0.59997227 45 111 46.8
## 592 0.30450063 24 112 39.4
## 593 0.50255482 44 132 34.4
## 594 0.05509608 25 82 28.5
## 595 0.33883146 34 123 33.6
## 596 0.76026197 22 188 32.0
## 597 0.22016747 46 67 45.3
## 598 0.05892304 21 89 27.8
## 599 0.81919423 38 173 36.8
## 600 0.08801009 26 109 23.1
## 601 0.11183714 24 108 27.1
## 602 0.06362255 28 96 23.7
## 603 0.21954458 30 124 27.8
## 604 0.73280017 54 150 35.2
## 605 0.74174410 36 183 28.4
## 606 0.30819104 21 124 35.8
## 607 0.83525045 22 181 40.0
## 608 0.03576715 25 92 19.5
## 609 0.70485802 27 152 41.5
## 610 0.09343462 23 111 24.0
## 611 0.14159007 24 106 30.9
## 612 0.75749794 36 174 32.9
## 613 0.81997981 40 168 38.2
## 614 0.16291589 26 105 32.5
## 615 0.63373693 50 138 36.1
## 616 0.10212907 27 106 25.8
## 617 0.19210652 30 117 28.7
## 618 0.01550448 23 68 20.1
## 619 0.25255837 50 112 28.2
## 620 0.23051726 24 119 32.4
## 621 0.30983015 28 112 38.4
## 622 0.05806539 28 92 24.2
## 623 0.91880972 45 183 40.8
## 624 0.23434633 21 94 43.5
## 625 0.13870084 21 108 30.8
## 626 0.16560570 29 90 37.7
## 627 0.14562927 21 125 24.7
## 628 0.30377897 21 132 32.4
## 629 0.47868226 45 128 34.6
## 630 0.05359131 21 94 24.7
## 631 0.17582314 34 114 27.4
## 632 0.16503579 24 102 34.5
## 633 0.11155604 23 111 26.2
## 634 0.20058382 22 128 27.5
## 635 0.07258193 31 92 25.9
## 636 0.19084517 38 104 31.2
## 637 0.20214464 48 104 28.8
## 638 0.10023708 23 94 31.6
## 639 0.26993477 32 97 40.9
## 640 0.05098845 28 100 19.5
## 641 0.11900934 27 102 29.3
## 642 0.32851032 24 128 34.3
## 643 0.56852689 50 147 29.5
## 644 0.08089095 31 90 28.0
## 645 0.10727554 27 103 27.6
## 646 0.72028900 30 157 39.4
## 647 0.48785825 33 167 23.4
## 648 0.79490552 22 179 37.8
## 649 0.38877110 42 136 28.3
## 650 0.09982364 23 107 26.4
## 651 0.05337015 23 91 25.2
## 652 0.25640525 27 117 33.8
## 653 0.31091947 28 123 34.1
## 654 0.16989599 27 120 26.8
## 655 0.17316520 22 106 34.2
## 656 0.66116202 25 155 38.7
## 657 0.05447303 22 101 21.8
## 658 0.47537877 41 120 38.9
## 659 0.60974433 51 127 39.0
## 660 0.08753509 27 80 34.2
## 661 0.68185368 54 162 27.7
## 662 0.92576964 22 199 42.9
## 663 0.81949182 43 167 37.6
## 664 0.66187967 40 145 37.9
## 665 0.31610727 40 115 33.7
## 666 0.22464126 24 112 34.8
## 667 0.74038861 70 145 32.5
## 668 0.18688482 40 111 27.5
## 669 0.22045595 43 98 34.0
## 670 0.62406486 45 154 30.9
## 671 0.77816331 49 165 33.6
## 672 0.06718727 21 99 25.4
## 673 0.11105363 47 68 35.5
## 674 0.75292098 22 123 57.3
## 675 0.34281746 68 91 35.6
## 676 0.82671405 31 195 30.9
## 677 0.56464439 53 156 24.8
## 678 0.13697367 25 93 35.3
## 679 0.31378419 25 121 36.0
## 680 0.06850206 23 101 24.2
## 681 0.01422697 22 56 24.2
## 682 0.87261940 26 162 49.6
## 683 0.26484153 22 95 44.6
## 684 0.28598125 27 125 32.3
## 686 0.32094951 25 129 33.2
## 687 0.15362251 22 130 23.1
## 688 0.13511619 29 107 28.3
## 689 0.22573798 23 140 24.1
## 690 0.82408951 46 144 46.1
## 691 0.11455343 34 107 24.6
## 692 0.83801270 44 158 42.3
## 693 0.36316566 23 121 39.1
## 694 0.56041931 43 129 38.5
## 695 0.04713819 25 90 23.5
## 696 0.49449709 43 142 30.4
## 697 0.63379122 31 169 29.9
## 698 0.06673730 22 99 25.0
## 699 0.35029763 28 127 34.5
## 700 0.47563672 26 118 44.5
## 701 0.32580619 26 122 35.9
## 702 0.33060689 49 125 27.6
## 703 0.82826641 52 168 35.0
## 704 0.54623397 41 129 38.5
## 705 0.14206554 27 110 28.4
## 706 0.14030083 28 80 39.8
## 708 0.31026059 22 127 34.4
## 709 0.73746767 45 164 32.8
## 710 0.16033718 23 93 38.0
## 711 0.51831021 24 158 31.2
## 712 0.32110908 40 126 29.6
## 713 0.58460380 38 129 41.2
## 714 0.21470587 21 134 26.4
## 715 0.13700414 32 102 29.5
## 716 0.83664700 34 187 33.9
## 717 0.73899368 31 173 33.8
## 718 0.11811293 56 94 23.1
## 719 0.21112030 24 108 35.5
## 720 0.28973691 52 97 35.6
## 721 0.07753199 34 83 29.3
## 722 0.27735147 21 114 38.1
## 723 0.52482847 42 149 29.3
## 724 0.46044316 42 117 39.1
## 725 0.29918259 45 111 32.8
## 726 0.39551955 38 112 39.4
## 727 0.27863328 25 116 36.1
## 728 0.38207831 22 141 32.4
## 729 0.46885040 22 175 22.9
## 730 0.08098655 22 92 30.1
## 731 0.29185614 34 130 28.4
## 732 0.16991140 22 120 28.4
## 733 0.86244607 24 174 44.5
## 734 0.11608125 22 106 29.0
## 735 0.15609935 53 105 23.3
## 736 0.15782675 28 95 35.4
## 737 0.18370871 21 126 27.4
## 738 0.06634126 42 65 32.0
## 739 0.16444963 21 99 36.6
## 740 0.34166645 42 102 39.5
## 741 0.60048581 48 120 42.3
## 742 0.13057157 26 102 30.8
## 743 0.12257080 22 109 28.5
## 744 0.54257711 45 140 32.7
## 745 0.76309072 39 153 40.6
## 746 0.18772966 46 100 30.0
## 747 0.80105067 27 147 49.3
## 748 0.25368504 32 81 46.3
## 749 0.87161001 36 187 36.4
## 750 0.58476045 50 162 24.3
## 751 0.31730612 22 136 31.2
## 752 0.39481184 28 121 39.0
## 753 0.10506793 25 108 26.0
## 754 0.88435023 26 181 43.3
## 755 0.65508506 45 154 32.4
## 756 0.46396635 37 128 36.5
## 757 0.45736779 39 137 32.0
## 758 0.52258754 52 123 36.3
## 759 0.24005514 26 106 37.5
## 760 0.94278973 66 190 35.5
## 761 0.06158409 22 88 28.4
## 762 0.89970687 43 170 44.0
## 763 0.05205013 33 89 22.5
## 764 0.33601291 63 101 32.9
## 765 0.35029608 27 122 36.8
## 766 0.17967203 30 121 26.2
## 767 0.37685726 47 126 30.1
## 768 0.08803680 23 93 30.4
#xtabs(~ probability.of.hd + age + glucose + bmi, data=predicted.data)
# NOTE: I tried running this code from the class notes and it was running for about 5 minutes with no output, so I stopped it. I hope that isn't a problem
# Predicted classes
predicted.data <- data.frame(
probability.of.hd=logistic$fitted.values,
diabetes = data_subset$Outcome, age=data_subset$Age, glucose = data_subset$Glucose, bmi = data_subset$BMI)
head(predicted.data)
## probability.of.hd diabetes age glucose bmi
## 1 0.66360006 1 50 148 33.6
## 2 0.06101402 0 31 85 26.6
## 3 0.61834186 1 32 183 23.3
## 4 0.06043396 0 21 89 28.1
## 5 0.65771328 1 33 137 43.1
## 6 0.14802668 0 30 116 25.6
# Confusion matrix
Confusion Matrix
#Create a numeric version of the outcome (0 = no diabetes, 1 = diabetes).This is required for calculating confusion matrices.
data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)
# Predicted probabilities
predicted.probs <- logistic$fitted.values
# Predicted classes: 1 if prob > 0.5, else 0
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
# Confusion matrix
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(data_subset$Outcome_num, levels = c(0, 1))
)
confusion
## Actual
## Predicted 0 1
## 0 429 114
## 1 59 150
#Extract Values:
TN <- 429
FP <- 59
FN <- 114
TP <- 150
#Metrics
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN) # also called recall or true positive rate
specificity <- TN / (TN + FP) # true negative rate
precision <- TP / (TP + FP) # positive predictive value
cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))
## Accuracy: 0.77
## Sensitivity: 0.568
## Specificity: 0.879
## Precision: 0.718
Interpret: How well does the model perform? Is it better at detecting diabetes (sensitivity) or non-diabetes (specificity)? Why might this matter for medical diagnosis?
This model isn’t incredible, but its not horrible either. It accurately predicts whether an individual has diabetes 87.9% of the time. However, it only accurately predicts whether an individual is healthy 56% of the time, which is not far above a 50-50 chance. For a medical diagnosis, I would say this is a pretty bad model. When it comes to an individuals health and their lives the model needs to be extremely accurate. And if the model is only detecting healthy individuals about half the time, then many people could be predicted healthy and actually have diabetes (false negative), which endangers them further.
Plot the ROC curve, use the “data_subset” from Q2.
Calculate AUC.
#Enter your code here
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# ROC curve & AUC on full data
roc_obj <- roc(response = data_subset$Outcome,
predictor = logistic$fitted.values,
levels = c("0", "1"),
direction = "<") # smaller prob = Healthy
# Print AUC value
auc_val <- auc(roc_obj); auc_val
## Area under the curve: 0.828
# Plot ROC with AUC displayed
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
What does AUC indicate (0.5 = random, 1.0 = perfect)?
The AUC (or Area Under the Curve) shows how accurate our model is at identifying individuals who do and do not have diabetes. If the line was a straight diagonal, that would indicate that there is a 50-50 chance and the model is no better than a random guess. The more the line is dragged toward the upper left corner, the more accurate our model is. Our model has an AUC of 0.828, which is pretty good. However, I think for a medical situation like this the model should be accurate.
For diabetes diagnosis, prioritize sensitivity (catching cases) or specificity (avoiding false positives)? Suggest a threshold and explain.
For diabetes diagnosis, you should prioritize sensitivity (catching cases). Obviously having false positives isn’t ideal, but they do far less harm then false negatives. In a medical scenario, false negatives mean the patient won’t recieve the necessary treatment because they’ve been wrongly diagnosed without diabetes. Such errors can cause the individual’s health to severely detioriate and can even lead to death. In a situation where model accuracy is so important, I would suggest a threshold of at least 0.05, ideally 0.01 (AUC of 99%).