class: center, middle, inverse, title-slide .title[ # STA 321 Logistic Regression Project: Predicting a Patient’s Odds of CHD ] .subtitle[ ##
] .author[ ### Josie Gallop, Chloé Winters, Ava Destefano ] .date[ ### 2025-02-22 ] --- <!-- Start of Josie's Slides --> # Agenda <font size = 6> • Binary Predictive Modeling: logistic regression <BR> <BR> • Introduction: variables <BR> <BR> • Exploratory Data Analysis <BR> <BR> • Three Candidate Models: full, reduced, forward selection <BR> <BR> • Model Selection Process: CV and ROC Analysis <BR> <BR> • Conclusion and Recommendations <BR> <BR> </font> --- # Introduction <font size = 6> • Found on kaggle.com (Dileep, 2019) <BR> <BR> • Cardiovascular study in Framingham, Massachussetts <BR> <BR> • 4,238 observations of 16 variables <BR> <BR> • Odds of a patient being at risk for developing CHD <BR> <BR> </font> --- ## Variables <font size = 6> .pull-left[ - gender - age - education - currentSmoker - cigsPerDay - BPMeds - prevalentStroke - prevalentHyp - diabetes ] .pull-right[ - totChol - sysBP - diaBP - BMI - heartRate - glucose - TenYearCHD(binary response variables) - 0 = "no" and 1 = "yes" ] </font> --- class:inverse middle center name: practical question # Some Practical Questions? --- ## First Few Entries of the Data Set
--- ## Fixing the Missing Values <font size = 6> • Some variables contain missing values <BR> <BR> • Fix this with multiple imputation <BR> <BR> • Use the MICE function <BR> <BR> • Fixes the missing value problem <BR> <BR> </font> --- ## Correcting the Variable Types <font size = 6> • Some incorrect variable types in the data set <BR> <BR> • cigsPerDay: change to numeric <BR> <BR> • age: change to numeric <BR> <BR> • BPMeds: change to integer <BR> <BR> • education: change to character <BR> <BR> </font> --- class:inverse middle center name:storytelling # Visualizations of Quantitative Variable Distributions --- ## sysBP Distribution <img src="FinalSlidesWeek5_files/figure-html/unnamed-chunk-5-1.png" width="600px" style="display: block; margin: auto;" /> --- ## diaBP Distribution <img src="FinalSlidesWeek5_files/figure-html/unnamed-chunk-6-1.png" width="600px" style="display: block; margin: auto;" /> --- ## heartRate Distribution <img src="FinalSlidesWeek5_files/figure-html/unnamed-chunk-7-1.png" width="600px" style="display: block; margin: auto;" /> --- ## cigsPerDay Distribution <img src="FinalSlidesWeek5_files/figure-html/unnamed-chunk-8-1.png" width="600px" style="display: block; margin: auto;" /> --- # Complete Table of the Data Set ``` ## # A tibble: 4,238 × 16 ## male age education currentSmoker cigsPerDay BPMeds prevalentStroke ## <int> <dbl> <chr> <int> <dbl> <int> <int> ## 1 1 39 4 0 0 0 0 ## 2 0 46 2 0 0 0 0 ## 3 1 48 1 1 20 0 0 ## 4 0 61 3 1 30 0 0 ## 5 0 46 3 1 23 0 0 ## 6 0 43 2 0 0 0 0 ## 7 0 63 1 0 0 0 0 ## 8 0 45 2 1 20 0 0 ## 9 1 52 1 0 0 0 0 ## 10 1 43 1 1 30 0 0 ## # ℹ 4,228 more rows ## # ℹ 9 more variables: prevalentHyp <int>, diabetes <int>, totChol <dbl>, ## # sysBP <dbl>, diaBP <dbl>, BMI <dbl>, heartRate <dbl>, glucose <dbl>, ## # TenYearCHD <int> ``` <!-- End of Josie's Slides --> --- ## Pairwise Scatterplot Analysis <img src="FinalSlidesWeek5_files/figure-html/unnamed-chunk-11-1.png" width="500px" style="display: block; margin: auto;" /> --- ## Variable Standardization <font size = 6> • Now all numeric variables will be standardized <BR> <BR> • This will increase predictive power <BR> <BR> </font> --- ## New Data Set <font size = 6> • Create a final data set called **sd.heartdisease** <BR> <BR> • Replaces old variables with standardized ones <BR> <BR> • Essential for model building <BR> <BR> </font> --- ## Data Split <font size = 6> • Spilt the data into two groups <BR> <BR> • 80% for training <BR> <BR> • 20% for testing <BR> <BR> • Training data will be used for building our models <BR> <BR> </font> --- class:inverse middle center name:model building # Model Building Process --- ## Full Model <font size = 6> • Includes all variables <BR> <BR> </font> --- <h3 align="center"> Full Model Summary Statistics </h3> | | Estimate| Std. Error| z value| Pr(>|z|)| |:---------------|----------:|----------:|-----------:|------------------:| |(Intercept) | -2.2024269| 0.1240704| -17.7514313| 0.0000000| |male | 0.4852112| 0.1013449| 4.7877196| 0.0000017| |sd.age | 0.5168830| 0.0539408| 9.5824163| 0.0000000| |education2 | -0.2356171| 0.1154925| -2.0401079| 0.0413396| |education3 | -0.1026466| 0.1376433| -0.7457430| 0.4558227| |education4 | 0.0115144| 0.1519726| 0.0757666| 0.9396048| |currentSmoker | 0.0219380| 0.1443470| 0.1519811| 0.8792019| |sd.cigsPerDay | 0.2507207| 0.0679008| 3.6924543| 0.0002221| |BPMeds | 0.3270416| 0.2170889| 1.5064870| 0.1319422| |prevalentStroke | 0.9388714| 0.4468276| 2.1011937| 0.0356240| --- ## Reduced Model <font size = 6> • Includes the variables "current smoker", "sd.cigsperday", "sd.sysBP", "sd.diaBP", "sd.totalChol" <BR> <BR> • Based on variables most recognized in the real world <BR> <BR> • This model could be used as a starting point <BR> <BR> </font> --- <h3 align="center"> Reduced Model Summary Statistics </h3> | | Estimate| Std. Error| z value| Pr(>|z|)| |:-------------|----------:|----------:|----------:|------------------:| |(Intercept) | -1.7675361| 0.0822320| -21.494511| 0.0000000| |currentSmoker | -0.1443498| 0.1392986| -1.036262| 0.3000800| |sd.cigsPerDay | 0.2683287| 0.0637373| 4.209920| 0.0000255| |sd.sysBP | 0.6381074| 0.0641837| 9.941893| 0.0000000| |sd.diaBP | -0.1409295| 0.0650918| -2.165088| 0.0303809| |sd.totChol | 0.1113871| 0.0431866| 2.579205| 0.0099028| --- ## Stepwise Model <font size = 6> • Uses forward regression <BR> <BR> • Includes variables "currentSmoker", "sd.cigsPerDay", "sd.sysBP", "sd.diaBP", "sd.totChol", "sd.age", "male", "sd.glucose", "prevalentStroke", "prevalentHyp", and "BPMeds". <BR> <BR> </font> --- <h3 align="center"> Stepwise Model Summary Statistics </h3> | | Estimate| Std. Error| z value| Pr(>|z|)| |:---------------|----------:|----------:|-----------:|------------------:| |(Intercept) | -2.2897777| 0.1108872| -20.6496069| 0.0000000| |currentSmoker | 0.0083694| 0.1431896| 0.0584499| 0.9533902| |sd.cigsPerDay | 0.2497211| 0.0677217| 3.6874613| 0.0002265| |sd.sysBP | 0.3080000| 0.0777695| 3.9604228| 0.0000748| |sd.diaBP | -0.0324529| 0.0698694| -0.4644798| 0.6423040| |sd.totChol | 0.0755135| 0.0454483| 1.6615237| 0.0966083| |sd.age | 0.5363993| 0.0527922| 10.1605849| 0.0000000| |male | 0.5164091| 0.0993494| 5.1979103| 0.0000002| |sd.glucose | 0.1825977| 0.0373944| 4.8830291| 0.0000010| |prevalentStroke | 0.9413243| 0.4425788| 2.1269078| 0.0334277| |prevalentHyp | 0.2310484| 0.1281164| 1.8034261| 0.0713213| |BPMeds | 0.3233004| 0.2160729| 1.4962560| 0.1345870| --- <h3 align="center"> Cross Validation </h3> <font size = 5> .pull-left[ • We will use 5 fold cross validation. <BR> <BR> • Candidate01 is the full model <BR> <BR> • Candidate02 is the reduced model <BR> <BR> • Candidate03 is the step wise model <BR> <BR> ] .pull-right[ <h5 align="center"> Predictive Error Table </h5> <BR> | PE1 | PE2 | PE3 | |:---:|:---:|:---:| |0.8546|0.8495|0.8546| ] <BR> <BR> </font> <!-- Start of Chloes Slides --> --- ## ROC Curve <img src="FinalSlidesWeek5_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> --- ## ROC Analysis <font size = 6> • An AUC value closer to 1 indicates ideal performance <BR> <BR> • The reduced model has the lowest AUC <BR> <BR> • This contradicts previous findings <BR> <BR> • Possible issue with false positives and negatives <BR> <BR> • Forward selection looks like a good choice <BR> <BR> </font> --- class:inverse middle center name:conclusion # Conclusion --- # Conclusion <font size = 6> • Reduced model has the best performance reducing the PEs <BR> <BR> • Using the AUC, forward selection model was best <BR> <BR> • Less variables could have caused false positives & negatives <BR> </font> --- class:inverse middle center name:general # General Discussion --- ## Recommendations & Limitations <font size = 6> • Expand data collection <BR> <BR> • Consider other variables - income and family history <BR> <BR> • Consider other candidate models <BR> <BR> • Investigate potential false positives and negatives <BR> </font> --- ## Final Statements <font size = 6> • Benefits in using reduced and forward selection models <BR> <BR> • Lower PE in reduced model <BR> <BR> • Higher AUC in forward selection model <BR> <BR> • Potential false positives and negatives <BR> <BR> • Both models provide important information regarding risk of CHD <BR> </font> --- class:inverse middle center name:Reference # References --- # References <font size = 6> • Dileep. (2019, June 7). Logistic regression to predict heart disease. Kaggle. https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression?resource=download&select=framingham.csv <BR> <BR> • Hajar, R. (2017). Risk factors for coronary artery disease: Historical perspectives. Heart views : the official journal of the Gulf Heart Association. https://pmc.ncbi.nlm.nih.gov/articles/PMC5686931/ <BR> </font> --- class: inverse center middle # Q & A --- # Credits <font size = 6> • Josie Gallop, Slides 1 - 14, 38 <BR> <BR> • Ava Destefano, Slides 15 - 26 <BR> <BR> • Chloe Winters, Slides 27 - 37 <BR> <BR> </font> --- name: Thank you class: inverse1 center middle # Thank you! Slides created using R packages: [**xaringan**](https://github.com/yihui/xaringan)<br> [**gadenbuie/xaringanthemer**](https://github.com/gadenbuie/xaringanthemer)<br> [**knitr**](http://yihui.name/knitr)<br> [**R Markdown**](https://rmarkdown.rstudio.com)<br> via <br> [**RStudio Desktop**](https://posit.co/download/rstudio-desktop/)