Introduction:

In this homework, you will apply logistic regression to a real-world dataset: the Pima Indians Diabetes Database. This dataset contains medical records from 768 women of Pima Indian heritage, aged 21 or older, and is used to predict the onset of diabetes (binary outcome: 0 = no diabetes, 1 = diabetes) based on physiological measurements.

The data is publicly available from the UCI Machine Learning Repository and can be imported directly.

Dataset URL: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Columns (no header in the CSV, so we need to assign them manually):

  1. Pregnancies: Number of times pregnant
  2. Glucose: Plasma glucose concentration (2-hour test)
  3. BloodPressure: Diastolic blood pressure (mm Hg)
  4. SkinThickness: Triceps skin fold thickness (mm)
  5. Insulin: 2-hour serum insulin (mu U/ml)
  6. BMI: Body mass index (weight in kg/(height in m)^2)
  7. DiabetesPedigreeFunction: Diabetes pedigree function (a function scoring genetic risk)
  8. Age: Age in years
  9. Outcome: Class variable (0 = no diabetes, 1 = diabetes)

Task Overview: You will load the data, build a logistic regression model to predict diabetes onset using a subset of predictors (Glucose, BMI, Age), interpret the model, evaluate it with a confusion matrix and metrics, and analyze the ROC curve and AUC.

Cleaning the dataset Don’t change the following code.

#past assignment “logistic regression-semi clean.rmd” help with formats/notes

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
url <- "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

data <- read.csv(url, header = FALSE)

colnames(data) <- c("Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age", "Outcome")

data$Outcome <- as.factor(data$Outcome)

# Handle missing values (replace 0s with NA because 0 makes no sense here)
data$Glucose[data$Glucose == 0] <- NA
data$BloodPressure[data$BloodPressure == 0] <- NA
data$BMI[data$BMI == 0] <- NA


colSums(is.na(data))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0
data_subset <- data[complete.cases(data[, c("Glucose", "BMI", "Age")]), ]

data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)

Question 1: Create and Interpret a Logistic Regression Model - Fit a logistic regression model to predict Outcome using Glucose, BMI, and Age.

## Enter your code here

model_df <- glm(Outcome ~ Glucose + BMI + Age, data = data_subset, family = "binomial")

summary(model_df)
## 
## Call:
## glm(formula = Outcome ~ Glucose + BMI + Age, family = "binomial", 
##     data = data_subset)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.032377   0.711037 -12.703  < 2e-16 ***
## Glucose      0.035548   0.003481  10.212  < 2e-16 ***
## BMI          0.089753   0.014377   6.243  4.3e-10 ***
## Age          0.028699   0.007809   3.675 0.000238 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.75  on 751  degrees of freedom
## Residual deviance: 724.96  on 748  degrees of freedom
## AIC: 732.96
## 
## Number of Fisher Scoring iterations: 4

What does the intercept represent (log-odds of diabetes when predictors are zero)?

The intercept represent the log odds of diabetes when bmi, age, or glucose (3) are zero, which might not make sense by itself.

For each predictor (Glucose, BMI, Age), does a one-unit increase raise or lower the odds of diabetes? Are they significant (p-value < 0.05)?

The bmi, glucose, and age increase with the odds of diabetes. making it more likely to have diabetes with the 3 contributing factors, with its p-value 0.05 being significant.

Question 2: Confusion Matrix and Important Metric

Calculate and report the metrics:

Accuracy: (TP + TN) / Total Sensitivity (Recall): TP / (TP + FN) Specificity: TN / (TN + FP) Precision: TP / (TP + FP)

Use the following starter code

# Keep only rows with no missing values in Glucose, BMI, or Age
data_subset <- data[complete.cases(data[, c("Glucose", "BMI", "Age")]), ]

#Create a numeric version of the outcome (0 = no diabetes, 1 = diabetes).This is required for calculating confusion matrices.
data_subset$Outcome_num <- ifelse(data_subset$Outcome == "1", 1, 0)


# Predicted probabilities
predicted.data <- model_df$fitted.values
predicted.data
##          1          2          3          4          5          6          7 
## 0.66360006 0.06101402 0.61834186 0.06043396 0.65771328 0.14802668 0.06116212 
##          8          9         11         12         13         14         15 
## 0.28013239 0.90283168 0.29185049 0.79018904 0.49423685 0.88904285 0.65652969 
##         16         17         18         19         20         21         22 
## 0.13393352 0.54057167 0.15678020 0.36875552 0.28484895 0.43753713 0.28886248 
##         23         24         25         26         27         28         29 
## 0.93606735 0.20309566 0.68988838 0.34957706 0.72382298 0.05362751 0.43793280 
##         30         31         32         33         34         35         36 
## 0.32692619 0.44902956 0.55575994 0.04535146 0.04022137 0.28355775 0.09365574 
##         37         38         39         40         41         42         43 
## 0.46443795 0.24352489 0.16388268 0.46267837 0.76206518 0.59035695 0.13595021 
##         44         45         46         47         48         49         51 
## 0.93528537 0.55649424 0.86452121 0.41473239 0.03343950 0.27449787 0.04750056 
##         52         53         54         55         56         57         58 
## 0.07420421 0.05451568 0.87138830 0.65012987 0.02252439 0.89802263 0.40432752 
##         59         60         62         63         64         65         66 
## 0.74180750 0.28015166 0.44217047 0.01490161 0.25891619 0.30350831 0.12005398 
##         67         68         69         70         71         72         73 
## 0.24046916 0.55590485 0.03997582 0.38375591 0.15172557 0.31473014 0.63351150 
##         74         75         77         78         79         80         81 
## 0.34608813 0.06176809 0.06146859 0.18291002 0.56166300 0.10732115 0.08520738 
##         83         84         85         86         87         88         89 
## 0.08173799 0.06896305 0.78236624 0.19166509 0.33450675 0.21824692 0.59050304 
##         90         91         92         93         94         95         96 
## 0.10326043 0.02040067 0.30744087 0.31948020 0.39870087 0.23776250 0.56884043 
##         97         98         99        100        101        102        103 
## 0.09647762 0.01718930 0.07653216 0.65810775 0.77018913 0.33387716 0.12273756 
##        104        105        106        107        108        109        110 
## 0.04407517 0.15687002 0.20185497 0.05549181 0.44920355 0.09229839 0.16661940 
##        111        112        113        114        115        116        117 
## 0.67346049 0.70042432 0.08254696 0.07164768 0.62528208 0.67008425 0.38171854 
##        118        119        120        121        122        123        124 
## 0.07464178 0.08152471 0.05582038 0.90191908 0.20945381 0.17465864 0.51139163 
##        125        126        127        128        129        130        131 
## 0.20316943 0.44483507 0.48619280 0.23346251 0.34777790 0.26573200 0.67483938 
##        132        133        134        135        136        137        138 
## 0.31871597 0.72476634 0.18399064 0.04834651 0.33949245 0.10807987 0.07452834 
##        139        140        141        142        143        144        145 
## 0.30701297 0.23426580 0.26698030 0.34785494 0.16180713 0.25353709 0.51149487 
##        147        148        149        150        151        152        153 
## 0.05287106 0.17493386 0.74711670 0.06000638 0.46199537 0.12428635 0.68933166 
##        154        155        156        157        158        159        160 
## 0.67051462 0.96022281 0.86895301 0.06282464 0.09658196 0.06477073 0.85590643 
##        161        162        163        164        165        166        167 
## 0.50854865 0.31513699 0.44079186 0.09892429 0.34954797 0.18616725 0.44449836 
##        168        169        170        171        172        173        174 
## 0.24339382 0.19361257 0.15377524 0.16673816 0.43550662 0.06733514 0.15979544 
##        175        176        177        178        179        180        181 
## 0.05988683 0.78563291 0.11866407 0.91067595 0.80825740 0.53993201 0.05025601 
##        182        184        185        186        187        188        189 
## 0.26703676 0.03707194 0.40252230 0.90574248 0.86120289 0.34005026 0.14630667 
##        190        191        192        193        194        195        196 
## 0.36876070 0.07904064 0.36791134 0.59421282 0.83322339 0.06814984 0.72166676 
##        197        198        199        200        201        202        203 
## 0.07473295 0.07492956 0.21618023 0.45868538 0.16377117 0.56854406 0.13888668 
##        204        205        206        207        208        209        210 
## 0.05179431 0.39920095 0.10279202 0.94962688 0.83235858 0.11534291 0.86661368 
##        211        212        213        214        215        216        217 
## 0.04976700 0.67335126 0.89304307 0.61220621 0.27923019 0.76451750 0.22670470 
##        218        219        220        221        222        223        224 
## 0.27330482 0.07659116 0.38185630 0.72467033 0.78827161 0.18565015 0.58685205 
##        225        226        227        228        229        230        231 
## 0.06829164 0.09949318 0.18367083 0.68004613 0.89605868 0.46813090 0.64472858 
##        232        233        234        235        236        237        238 
## 0.76813313 0.03512856 0.32697577 0.04410466 0.84628204 0.88969137 0.87532621 
##        239        240        241        242        243        244        245 
## 0.61780765 0.05170763 0.07082780 0.10017276 0.23827632 0.19422429 0.60311596 
##        246        247        248        249        250        251        252 
## 0.83303549 0.32770854 0.89909414 0.38428435 0.15124012 0.22120889 0.23889824 
##        253        254        255        256        257        258        259 
## 0.04953331 0.11459758 0.11691034 0.19828080 0.17887125 0.15623408 0.69883589 
##        260        261        262        263        264        265        266 
## 0.71707281 0.81853025 0.36525031 0.11051715 0.67512915 0.31358505 0.20261849 
##        267        268        269        270        271        272        273 
## 0.46226051 0.44933995 0.07235915 0.36110033 0.43567653 0.08877056 0.18493829 
##        274        275        276        277        278        279        280 
## 0.05088367 0.33128368 0.24506561 0.11369280 0.10154494 0.24801836 0.09186564 
##        281        282        283        284        285        286        287 
## 0.58972781 0.47370166 0.41711384 0.68313032 0.21797420 0.40116323 0.71641934 
##        288        289        290        291        292        293        294 
## 0.53067228 0.04712263 0.26775522 0.08745865 0.22682859 0.57291177 0.46046734 
##        295        296        297        298        299        300        301 
## 0.62758689 0.58058435 0.37824219 0.24803175 0.29474268 0.21955100 0.66018715 
##        302        303        304        305        306        307        308 
## 0.41100863 0.11129750 0.64729036 0.32005860 0.40826274 0.58137125 0.20853969 
##        309        310        311        312        313        314        315 
## 0.26360927 0.30776656 0.06535416 0.25036944 0.41092641 0.16107311 0.33149016 
##        316        317        318        319        320        321        322 
## 0.22369709 0.05117748 0.73245084 0.32712977 0.84109136 0.25184255 0.18282382 
##        323        324        325        326        327        328        329 
## 0.24378690 0.50258896 0.22371606 0.38582606 0.33531991 0.82388681 0.34014640 
##        330        331        332        333        334        335        336 
## 0.18639905 0.19088602 0.09218007 0.91902894 0.13200336 0.05320940 0.86742537 
##        337        338        339        340        341        342        344 
## 0.35965734 0.29290730 0.59568971 0.88624733 0.18920894 0.09132616 0.34659852 
##        345        346        347        348        349        351        352 
## 0.32815123 0.57649831 0.29236645 0.10531291 0.05676819 0.24193286 0.37729664 
##        353        354        355        356        357        358        359 
## 0.07898002 0.06279661 0.19814573 0.72467781 0.31076811 0.59801892 0.20451411 
##        360        361        362        363        364        365        366 
## 0.88526710 0.78897461 0.74400417 0.50320629 0.82287641 0.54649709 0.16790447 
##        367        368        369        370        371        373        374 
## 0.21165603 0.04952246 0.04507078 0.48272227 0.78269038 0.09704412 0.19000429 
##        375        376        377        378        379        380        381 
## 0.34459338 0.75532282 0.06564960 0.12244145 0.85402788 0.30435064 0.14485063 
##        382        383        384        385        386        387        388 
## 0.05348425 0.09319434 0.05402437 0.15572249 0.10794570 0.26789645 0.46951938 
##        389        390        391        392        393        394        395 
## 0.65094037 0.13731054 0.19779697 0.85134307 0.16168195 0.13430673 0.60509679 
##        396        397        398        399        400        401        402 
## 0.21179332 0.09248973 0.33946376 0.02913697 0.84267098 0.13084013 0.39847300 
##        403        404        405        406        407        408        409 
## 0.48699564 0.07268482 0.74444828 0.46625190 0.26301832 0.05958239 0.80446521 
##        410        411        412        413        414        415        416 
## 0.84435244 0.19801831 0.22338931 0.61960934 0.26996338 0.39684645 0.72171476 
##        417        418        419        420        421        422        423 
## 0.07505161 0.64654541 0.02475853 0.21863561 0.50245499 0.05982690 0.23869814 
##        424        425        426        428        429        430        431 
## 0.17118007 0.77187668 0.84799767 0.82533914 0.53910692 0.21756689 0.05413944 
##        432        433        434        435        436        437        438 
## 0.11409791 0.05393312 0.27662620 0.06907776 0.64969264 0.61721992 0.42076439 
##        439        440        441        442        443        444        445 
## 0.03395945 0.26189146 0.87450365 0.07172639 0.23064261 0.18113764 0.20642234 
##        446        447        448        449        450        451        452 
## 0.96817202 0.08292490 0.16339824 0.15599867 0.21704383 0.02779798 0.26600085 
##        453        454        455        456        457        458        459 
## 0.18259140 0.27355231 0.19842979 0.78495582 0.48559124 0.07070356 0.74404318 
##        460        461        462        463        464        465        466 
## 0.59394115 0.17913761 0.02176006 0.10771664 0.08587260 0.14009285 0.11253219 
##        467        468        469        470        471        472        473 
## 0.03642788 0.17309703 0.27220493 0.79486433 0.64494785 0.36560333 0.33439546 
##        474        475        476        477        478        479        480 
## 0.48018967 0.15482239 0.49529998 0.19109830 0.12410536 0.24797063 0.49527096 
##        481        482        483        484        485        486        487 
## 0.68458039 0.33885594 0.06226367 0.12371592 0.72687680 0.56265143 0.54101247 
##        488        489        490        491        492        493        494 
## 0.95052210 0.08227156 0.89372062 0.11005245 0.16022990 0.16490742 0.33102362 
##        496        497        498        499        500        501        502 
## 0.75953954 0.12702210 0.06099965 0.84950541 0.54761477 0.11828122 0.12966096 
##        504        505        506        507        508        509        510 
## 0.17866354 0.24526648 0.09221063 0.83844593 0.22417076 0.06208386 0.33491186 
##        511        512        513        514        515        516        517 
## 0.11299337 0.18168280 0.12336536 0.06204313 0.07400936 0.59909870 0.58968113 
##        518        519        520        521        522        524        525 
## 0.56205034 0.09884110 0.27576164 0.02523876 0.28936865 0.48747062 0.25656343 
##        526        527        528        529        530        531        532 
## 0.03291344 0.03395945 0.13475781 0.18580656 0.12036732 0.19948677 0.38363361 
##        533        534        535        536        537        538        539 
## 0.19213803 0.09680854 0.06801244 0.32583319 0.21032098 0.04166002 0.35441889 
##        540        541        542        543        544        545        546 
## 0.43504221 0.33020703 0.31016162 0.25095444 0.14385608 0.09976982 0.85041886 
##        547        548        549        550        551        552        553 
## 0.95475562 0.35407064 0.76428975 0.78684564 0.13623736 0.07829450 0.35648962 
##        554        555        556        557        558        559        560 
## 0.07172684 0.12665258 0.21859713 0.21354962 0.27639435 0.49525433 0.09072912 
##        561        562        563        564        565        566        567 
## 0.49863081 0.92529006 0.13282467 0.10152438 0.10768221 0.06408069 0.19062099 
##        568        569        570        571        572        573        574 
## 0.17225806 0.57765294 0.33059973 0.09766907 0.14429785 0.14094557 0.14150256 
##        575        576        577        578        579        580        581 
## 0.35723780 0.28936725 0.11561206 0.40501046 0.29985290 0.94605555 0.67186867 
##        582        583        584        585        586        587        588 
## 0.10536856 0.36048196 0.31028760 0.36443558 0.04412532 0.58904672 0.08645851 
##        589        590        591        592        593        594        595 
## 0.84621187 0.02132939 0.59997227 0.30450063 0.50255482 0.05509608 0.33883146 
##        596        597        598        599        600        601        602 
## 0.76026197 0.22016747 0.05892304 0.81919423 0.08801009 0.11183714 0.06362255 
##        603        604        605        606        607        608        609 
## 0.21954458 0.73280017 0.74174410 0.30819104 0.83525045 0.03576715 0.70485802 
##        610        611        612        613        614        615        616 
## 0.09343462 0.14159007 0.75749794 0.81997981 0.16291589 0.63373693 0.10212907 
##        617        618        619        620        621        622        623 
## 0.19210652 0.01550448 0.25255837 0.23051726 0.30983015 0.05806539 0.91880972 
##        624        625        626        627        628        629        630 
## 0.23434633 0.13870084 0.16560570 0.14562927 0.30377897 0.47868226 0.05359131 
##        631        632        633        634        635        636        637 
## 0.17582314 0.16503579 0.11155604 0.20058382 0.07258193 0.19084517 0.20214464 
##        638        639        640        641        642        643        644 
## 0.10023708 0.26993477 0.05098845 0.11900934 0.32851032 0.56852689 0.08089095 
##        645        646        647        648        649        650        651 
## 0.10727554 0.72028900 0.48785825 0.79490552 0.38877110 0.09982364 0.05337015 
##        652        653        654        655        656        657        658 
## 0.25640525 0.31091947 0.16989599 0.17316520 0.66116202 0.05447303 0.47537877 
##        659        660        661        662        663        664        665 
## 0.60974433 0.08753509 0.68185368 0.92576964 0.81949182 0.66187967 0.31610727 
##        666        667        668        669        670        671        672 
## 0.22464126 0.74038861 0.18688482 0.22045595 0.62406486 0.77816331 0.06718727 
##        673        674        675        676        677        678        679 
## 0.11105363 0.75292098 0.34281746 0.82671405 0.56464439 0.13697367 0.31378419 
##        680        681        682        683        684        686        687 
## 0.06850206 0.01422697 0.87261940 0.26484153 0.28598125 0.32094951 0.15362251 
##        688        689        690        691        692        693        694 
## 0.13511619 0.22573798 0.82408951 0.11455343 0.83801270 0.36316566 0.56041931 
##        695        696        697        698        699        700        701 
## 0.04713819 0.49449709 0.63379122 0.06673730 0.35029763 0.47563672 0.32580619 
##        702        703        704        705        706        708        709 
## 0.33060689 0.82826641 0.54623397 0.14206554 0.14030083 0.31026059 0.73746767 
##        710        711        712        713        714        715        716 
## 0.16033718 0.51831021 0.32110908 0.58460380 0.21470587 0.13700414 0.83664700 
##        717        718        719        720        721        722        723 
## 0.73899368 0.11811293 0.21112030 0.28973691 0.07753199 0.27735147 0.52482847 
##        724        725        726        727        728        729        730 
## 0.46044316 0.29918259 0.39551955 0.27863328 0.38207831 0.46885040 0.08098655 
##        731        732        733        734        735        736        737 
## 0.29185614 0.16991140 0.86244607 0.11608125 0.15609935 0.15782675 0.18370871 
##        738        739        740        741        742        743        744 
## 0.06634126 0.16444963 0.34166645 0.60048581 0.13057157 0.12257080 0.54257711 
##        745        746        747        748        749        750        751 
## 0.76309072 0.18772966 0.80105067 0.25368504 0.87161001 0.58476045 0.31730612 
##        752        753        754        755        756        757        758 
## 0.39481184 0.10506793 0.88435023 0.65508506 0.46396635 0.45736779 0.52258754 
##        759        760        761        762        763        764        765 
## 0.24005514 0.94278973 0.06158409 0.89970687 0.05205013 0.33601291 0.35029608 
##        766        767        768 
## 0.17967203 0.37685726 0.08803680
# Predicted classes
predicted.classes <- ifelse(predicted.data > 0.5, 1, 0)


# Confusion matrix
confusion <- table(
  Predicted = predicted.classes,
  Actual = data_subset$Outcome_num
  )
 

confusion
##          Actual
## Predicted   0   1
##         0 429 114
##         1  59 150

0 429 114 1 59 150

#Extract Values:
TN <- 429
FP <- 59
FN <- 114
TP <- 150

#Metrics    
accuracy <- (TP + TN) / (TP +TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)

cat("Accuracy:", round(accuracy, 3), "\nSensitivity:", round(sensitivity, 3), "\nSpecificity:", round(specificity, 3), "\nPrecision:", round(precision, 3))
## Accuracy: 0.77 
## Sensitivity: 0.568 
## Specificity: 0.879 
## Precision: 0.718

Interpret: How well does the model perform? Is it better at detecting diabetes (sensitivity) or non-diabetes (specificity)? Why might this matter for medical diagnosis? I think the model might be doing good,it does well at detecting those who dont have diabetes rather than non-diabetes. This matters because it could miss an individual who could have diabetes, causing problems for the person and establishment.

Question 3: ROC Curve, AUC, and Interpretation

#Enter your code here

roc_obj <- roc(response = data_subset$Outcome_num,
                predictor = predicted.data,
                levels = c(0, 1),
                direction = "<")

auc_value <- auc(roc_obj);
auc_value
## Area under the curve: 0.828
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
         xlab = "False Positive Rates (1-Specificity)",
          ylab = "True Positive Rate (Sensitivity)")

What does AUC indicate (0.5 = random, 1.0 = perfect)?

= random states that it is done randomly which is not the strongest model, perfect = states that the model is perfect, a better model.

For diabetes diagnosis, prioritize sensitivity (catching cases) or specificity (avoiding false positives)? Suggest a threshold and explain.

prioritizing sensitivity is important because it’ll help find who has diabetes and treat it, although some people could be accidentally called, it is best to find who for sure has diabetes which is life threatening. a possible threshold could be better if lower than 0.5 to improve better chances.