class: center, middle, inverse, title-slide .title[ # STA 321 Logistic Regression Project: Predicting a Patient’s Odds of CHD ] .subtitle[ ##
] .author[ ### Josie Gallop, Chloé Winters, Ava Destefano ] .date[ ### 2025-02-16 ] --- <!-- Start of Josie's Slides --> # Agenda <font size = 5> .pull-left[ - Binary Predictive Modeling - Logistic Regression - Introduction - Variables - Practical Questions - Exploratory Data Analysis - Three Candidate Models - Full Model - Reduced Model - Forward Selection Model - Model Selection Process - Cross Validation - ROC Analysis - Conclusion and Recommendations ] <BR> <BR> </font> --- # Introduction <font size = 5> .pull-left[ - Data found on kaggle.com (Dileep, 2019). - Ongoing cardiovascular study in Framingham, Massachussetts. - 4,238 observations of 16 variables. - Various personal and medical risk factors. - Logistic regression model - Odds of a patient being at risk for developing CHD in a 10 year period. - Three candidate models: - Full model - Reduced model - Forward selection model ] <BR> <BR> </font> --- ## Variables <font size = 6> .pull-left[ - gender - age - education - currentSmoker - cigsPerDay - BPMeds - prevalentStroke - prevalentHyp - diabetes ] .pull-right[ - totChol - sysBP - diaBP - BMI - heartRate - glucose - TenYearCHD(binary response variables) - 0 = "no" and 1 = "yes" ] <BR> <BR> </font> --- ## First Few Entries of the Data Set
--- ## Fixing the Missing Values <font size = 6> .pull-left[ - Some variables contain missing values - Fix this with multiple imputation - Use the MICE function - Fixes the missing value problem ] <BR> <BR> </font> --- ## Correcting the Variable Types <font size = 6> .pull-left[ - Some incorrect variable types in the data set - cigsPerDay: change to numeric - age: change to numeric - BPMeds: change to integer - education: change to character ] <BR> <BR> </font> --- class:inverse1 middle center name:storytelling # Visualizations of Quantitative Variable Distributions --- ## **sysBP** Distribution <img src="Week4GroupSlides_files/figure-html/unnamed-chunk-5-1.png" width="600px" style="display: block; margin: auto;" /> --- ## **diaBP** Distribution <img src="Week4GroupSlides_files/figure-html/unnamed-chunk-6-1.png" width="600px" style="display: block; margin: auto;" /> --- ## **cigsPerDay** Distribution <img src="Week4GroupSlides_files/figure-html/unnamed-chunk-7-1.png" width="600px" style="display: block; margin: auto;" /> --- # Complete Table of the Data Set ``` ## # A tibble: 4,238 × 16 ## male age education currentSmoker cigsPerDay BPMeds prevalentStroke ## <int> <dbl> <chr> <int> <dbl> <int> <int> ## 1 1 39 4 0 0 0 0 ## 2 0 46 2 0 0 0 0 ## 3 1 48 1 1 20 0 0 ## 4 0 61 3 1 30 0 0 ## 5 0 46 3 1 23 0 0 ## 6 0 43 2 0 0 0 0 ## 7 0 63 1 0 0 0 0 ## 8 0 45 2 1 20 0 0 ## 9 1 52 1 0 0 0 0 ## 10 1 43 1 1 30 0 0 ## # ℹ 4,228 more rows ## # ℹ 9 more variables: prevalentHyp <int>, diabetes <int>, totChol <dbl>, ## # sysBP <dbl>, diaBP <dbl>, BMI <dbl>, heartRate <dbl>, glucose <dbl>, ## # TenYearCHD <int> ``` <!-- End of Josie's Slides --> --- class: inverse1 center top ## Pairwise Scatterplot Analysis <img src="Week4GroupSlides_files/figure-html/unnamed-chunk-9-1.png" width="500px" style="display: block; margin: auto;" /> --- class:inverse4, top <h1 align="Left"> Variable Standardization </h1> - Now all **numeric** variables will be standardized - This will increase predictive power --- class:inverse4, top <h1 align="Left"> New Data Set </h1> - Create a final data set called **heartdisease** - Replaces old variables with standardized ones -Essential for model building --- <h1 align="Left"> Data Split </h1> - Spilt the data into two groups - 80% for training - 20% for testing - Training data will be used for building our models --- class:inverse middle center name:model building # Model Building Process --- <h1 align="Center"> Full Model </h1> - Includes all variables <h3 align="center"> Full Model Equation: </h3> $$ log p/(1-p) = -2.2024 + 0.4852 * male + 0.5169 * sd.age - 0.2356 * education2 - 0.1026 * education3 + 0.0115 * education4 + 0.0219 * currentSmoker+ 0.2507 * sd.cigsPerDay + 0.3270 * BPMeds + 0.9389 * prevalentStroke + 0.2312 * prevalentHyp + 0.0912 * diabetes + 0.0803 * sd.totChol + 0.3075 * sd.sysBP - 0.0327 * sd.diaBP + 0.0071 * sd.BMI - 0.0137 * sd.heartRate + 0.1730 * sd.glucose $$ --- <h5 align="Left"> Full Model of Summary Statistics </h5> | | Estimate| Std. Error| z value| Pr(>|z|)| |:---------------|----------:|----------:|-----------:|------------------:| |(Intercept) | -2.2591331| 0.1431848| -15.7777494| 0.0000000| |male | 0.5094673| 0.1003674| 5.0760236| 0.0000004| |sd.age | 0.5325551| 0.0533351| 9.9850814| 0.0000000| |education | -0.0174590| 0.0458474| -0.3808063| 0.7033470| |currentSmoker | 0.0162456| 0.1443256| 0.1125620| 0.9103778| |sd.cigsPerDay | 0.2502066| 0.0679496| 3.6822392| 0.0002312| |BPMeds | 0.3182373| 0.2166349| 1.4690027| 0.1418320| |prevalentStroke | 0.9323836| 0.4438094| 2.1008647| 0.0356528| |prevalentHyp | 0.2304012| 0.1286008| 1.7916001| 0.0731970| |diabetes | 0.1079748| 0.2996369| 0.3603520| 0.7185839| |sd.totChol | 0.0770525| 0.0456074| 1.6894732| 0.0911288| |sd.sysBP | 0.3060400| 0.0782452| 3.9112953| 0.0000918| |sd.diaBP | -0.0327021| 0.0713340| -0.4584366| 0.6466388| |sd.BMI | 0.0144815| 0.0481059| 0.3010344| 0.7633883| |sd.heartRate | -0.0171909| 0.0466690| -0.3683580| 0.7126063| |sd.glucose | 0.1707987| 0.0499369| 3.4202885| 0.0006255| --- <h1 align="Center"> Reduced Model </h1> - Includes the variables "current smoker", "sd.cigsperday", "sd.sysBP", "sd.diaBP", "sd.totalChol". <h3 align="center"> Reduced Model Equation: </h3> $$ log p/(1-p) = -1.7675 - 0.1443 * currentSmoker + 0.2683 * sd.cigsPerDay + 0.6381 * sd.sysBP - 0.1409 * sd.diaBP + 0.1114 * sd.totChol $$ --- <h1 align="Left"> Reduced Model </h1> Table: Reduced Model Summary of the Inferential Statistics | | Estimate| Std. Error| z value| Pr(>|z|)| |:-------------|----------:|----------:|----------:|------------------:| |(Intercept) | -1.7675361| 0.0822320| -21.494511| 0.0000000| |currentSmoker | -0.1443498| 0.1392986| -1.036262| 0.3000800| |sd.cigsPerDay | 0.2683287| 0.0637373| 4.209920| 0.0000255| |sd.sysBP | 0.6381074| 0.0641837| 9.941893| 0.0000000| |sd.diaBP | -0.1409295| 0.0650918| -2.165088| 0.0303809| |sd.totChol | 0.1113871| 0.0431866| 2.579205| 0.0099028| --- <h1 align="Center"> Stepwise Model </h1> - Uses forward regression - - Includes the variables "current smoker", "sd.cigsperday", "sd.sysBP", "sd.diaBP", "sd.totalChol", "sd.age, "male", "sd.glucose", "prevalentStroke", "prevalentHyp", and "BPMeds". <h3 align="center"> Stepwise Model Equation: </h3> $$ log p/(1-p) = -2.2898 + 0.0084 * currentSmoker + 0.2497 * sd.cigsPerDay + 0.2497 * sd.sysBP - 0.0325 * sd.diaBP + 0.0755 * sd.totChol + 0.5364 * sd.age + 0.5164 * male + 0.1826 * sd.glucose + 0.9413 * prevalentStroke + 0.2310 * prevalentHyp + 0.3233 * BPMeds $$ --- <h1 align="Left"> Stepwise Model Summary Statistics </h1> | | Estimate| Std. Error| z value| Pr(>|z|)| |:---------------|----------:|----------:|-----------:|------------------:| |(Intercept) | -2.2897777| 0.1108872| -20.6496069| 0.0000000| |currentSmoker | 0.0083694| 0.1431896| 0.0584499| 0.9533902| |sd.cigsPerDay | 0.2497211| 0.0677217| 3.6874613| 0.0002265| |sd.sysBP | 0.3080000| 0.0777695| 3.9604228| 0.0000748| |sd.diaBP | -0.0324529| 0.0698694| -0.4644798| 0.6423040| |sd.totChol | 0.0755135| 0.0454483| 1.6615237| 0.0966083| |sd.age | 0.5363993| 0.0527922| 10.1605849| 0.0000000| |male | 0.5164091| 0.0993494| 5.1979103| 0.0000002| |sd.glucose | 0.1825977| 0.0373944| 4.8830291| 0.0000010| |prevalentStroke | 0.9413243| 0.4425788| 2.1269078| 0.0334277| |prevalentHyp | 0.2310484| 0.1281164| 1.8034261| 0.0713213| |BPMeds | 0.3233004| 0.2160729| 1.4962560| 0.1345870| <!-- Start of Chloes Slides --> --- ## ROC Curve <img src="Week4GroupSlides_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> --- ## ROC Analysis <font size = 6> - An AUC value closer to 1 indicates ideal performance <BR> <BR> - The reduced model has the lowest AUC <BR> <BR> - This contradicts previous findings <BR> <BR> - Possible issue with false positives and negatives <BR> <BR> - Forward selection looks like a good choice <BR> <BR> </font> --- # Conclusion <font size = 6> - Reduced model has the best performance reducing the PEs <BR> <BR> - Using the AUC, forward selection model was best <BR> <BR> - Less variables could have caused false positives & negatives <BR> </font> --- ## Recommendations & Limitations <font size = 6> - Expand data collection <BR> <BR> - Consider other variables - income and family history <BR> <BR> - Consider other candidate models <BR> <BR> - Investigate potential false positives and negatives <BR> </font> --- ## Final Statements <font size = 6> - Benefits in using reduced and forward selection models <BR> <BR> - Lower PE in reduced model <BR> <BR> - Higher AUC in forward selection model <BR> <BR> -Potential false positives and negatives <BR> <BR> - Both models provide important information regarding risk of CHD <BR> </font> --- # References <font size = 6> - Dileep. (2019, June 7). Logistic regression to predict heart disease. Kaggle. https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression?resource=download&select=framingham.csv <BR> <BR> - Hajar, R. (2017). Risk factors for coronary artery disease: Historical perspectives. Heart views : the official journal of the Gulf Heart Association. https://pmc.ncbi.nlm.nih.gov/articles/PMC5686931/ <BR> </font> --- class: center, middle # Q & A --- # Credits <font size = 6> - Josie Gallop, Slides 1 - 12, 32 <BR> <BR> - Ava Destefano, Slides 13 - 23 <BR> <BR> - Chloe Winters, Slides 24- 31 <BR> <BR> </font> --- name: Thank you class: inverse1 center, middle # Thank you! Slides created using R packages: [**xaringan**](https://github.com/yihui/xaringan)<br> [**gadenbuie/xaringanthemer**](https://github.com/gadenbuie/xaringanthemer)<br> [**knitr**](http://yihui.name/knitr)<br> [**R Markdown**](https://rmarkdown.rstudio.com)<br> via <br> [**RStudio Desktop**](https://posit.co/download/rstudio-desktop/)