class: center, middle, inverse, title-slide .title[ #
Logistic Regression on 10 Year Heart Disease Risk
] .author[ ###
By: Evan Parker, Edward Coleman, Zack Shin, and Johnny Zhang
] .institute[ ###
West Chester University of Pennsylvania
] .date[ ###
December 15, 2022
Prepared for
STA 490: Capstone Statistics
] --- class: inverse, middle ## <center><b><font color = gold>Agenda</font></b></center> ### Dataset Description ### Variable Breakdown ### Final Dataset Creation ### Model Building ### Model Selection --- class: inverse # <center><b><font color = gold>Dataset Description</font></b></center> - 1948: 5,000 research participants, 30-60 y.o. - Framingham, Massachusetts - Examined every 2 years - 1971: Re-roster due to passed away patients - 4240 observations - Both demographical and medical information **From this Analysis:** We hope to uncover key characteristics of patents that will determine if they will be diagnosed with *Coronary Heart Disease*. This will help medical professionals potentially diagnose patents sooner, saving their lives n the process --- class: inverse # <center><b><font color = gold>List of Variables</font></b></center> - **Male**: Dummy Variable, 0 for Female, 1 for Male - **Age**: Patient's age in years - **Education**: 1: Some High School, 2: High School/GED, 3: Some College/Vocational School, 4: College - **SmokingStatus**: Dummy Variable, 0 for Non-Smoker, 1 for Smoker - **CigarettesPerDay** Amount of Cigarettes smoked per day - **BloodPressureMeds** Dummy Variable, 0 for Not Prescribed, 1 for Prescribed - **StrokeHistory**: Dummy Variable, 0 for No Stroke, 1 for Stroke - **HighBloodPressure**: Dummy Variable, 0 for no HBP, 1 for HBP - **Diabetes**: Dummy Variable, 0 for Not Diagnosed, 1 for Diagnosed - **TotalCholesterol**: Patient's cholesterol levels (mg/dL) - **SystolicBloodPressure**: Patient's systolic BP (mmHg) - **DiastolicBloodPressure**: Patient's diastolic BP (mmHg) - **BMI**: Patient's BMI (kg/m^2) - **GlucoseLevel**: Patient's glucose level (mg/dL) **Response**: - **CoronaryHeartDisease**: Dummy Variable, 0 for Not at Risk, 1 for At Risk --- # <center><b><font color = purple>Male</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/male chart breakdown-1.png" style="display: block; margin: auto;" /> - More females than males - *0* observations with missing data --- # <center><b><font color = purple>Age</font></b></center> .pull-left[ <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/age chart breakdown-1.png" style="display: block; margin: auto;" /> - Split observations into age groups of 10 years - Very few in 70's category, thus combined with 60's - *0* observations with missing data ] .pull-right[ <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/age histogram breakdown-1.png" style="display: block; margin: auto;" /> - Appears to be approximately Normal ] --- # <center><b><font color = purple>Education</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/education chart breakdown-1.png" style="display: block; margin: auto;" /> - Most patients did not attend college - *105* observations with missing data --- # <center><b><font color = purple>Smoking Status</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/smoking status chart breakdown-1.png" style="display: block; margin: auto;" /> - Almost an even split - *0* observations with missing data --- # <center><b><font color = purple>Cigarettes Per Day</font></b></center> .pull-left[ <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/cigarettes chart breakdown-1.png" style="display: block; margin: auto;" /> - Split into groups of 10 cigarettes - Most patients did not smoke - Of smokers, most fall into *11-20* category - *29* observations with missing data ] .pull-right[ <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/cigarettes histogram breakdown-1.png" style="display: block; margin: auto;" /> - Skewed plot due to half of the sample not smoking - Some high outliers in 60's ] --- # <center><b><font color = purple>Blood Pressure Medication</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/blood pressure meds barchart breakdown-1.png" style="display: block; margin: auto;" /> - Large majority are were not prescribed - *53* observations with missing data --- # <center><b><font color = purple>Stroke History</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/stroke history barchart breakdown-1.png" style="display: block; margin: auto;" /> - Large majority did not have a previous stroke - *0* observations with missing data --- # <center><b><font color = purple>High Blood Pressure</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/blood pressure barchart breakdown-1.png" style="display: block; margin: auto;" /> - About two-thirds do not have High Blood Pressure - *0* observations with missing data --- # <center><b><font color = purple>Diabetes</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/diabetes barchart breakdown-1.png" style="display: block; margin: auto;" /> - Large majority have not been diagnosed - *0* observations with missing data --- # <center><b><font color = purple>Total Cholesterol</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/cholesterol histogram breakdown-1.png" style="display: block; margin: auto;" /> - Quite Normal histogram - Some high outliers in 600's range - *50* observations with missing data --- # <center><b><font color = purple>Systolic Blood Pressure</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/systolic histogram breakdown-1.png" style="display: block; margin: auto;" /> - Normal Histogram - *0* observations with missing data --- # <center><b><font color = purple>Diastolic Blood Pressure</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/diastolic histogram breakdown-1.png" style="display: block; margin: auto;" /> - Normal Histogram - *0* observations with missing data --- # <center><b><font color = purple>BMI</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/bmi histogram breakdown-1.png" style="display: block; margin: auto;" /> - Normal Histogram - Few high outliers in 50's and 60's - *19* observations with missing data --- # <center><b><font color = purple>BPM</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/bpm histogram breakdown-1.png" style="display: block; margin: auto;" /> - Normal Histogram - *1* observation with missing data --- # <center><b><font color = purple>Glucose</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/glucose histogram breakdown-1.png" style="display: block; margin: auto;" /> - Normal Histogram, potential right tail - Some high outliers in 200's and 300's - *388* observations with missing data --- # <center><b><font color = purple>Coronary Heart Disease</font></b></center> <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/coronary barchart breakdown-1.png" style="display: block; margin: auto;" /> - Most patients were not at risk of Coronary Heart Disease - *0* observations with missing data --- class: inverse # <center><b><font color = gold>Missing Data</font></b></center> - **EducationLevel**, **CigarettesPerDay**, **BloodPressureMeds**, **TotalCholesterol**, **BPM**, and **GlucoseLevel** are all missing data - Patients not want to disclose information, medical professions did not test **Our Attempt** - Create a temporary data set with only complete observations, run regression to create a predictive model - Variables were not highly correlated correlated, meaning linear regression was not accurate - Instead, replace missing values with mode of variable --- class: inverse # <center><b><font color = gold>Variable Recoding</font></b></center> - **Male**: 0 - Female | 1 - Male - **Age**: 30 to 39 - 30's | 40 to 49 - 40's | 50 to 59 - 50's | 60+ - 60+ - **Education**: 0 - No College | 1-3 - Some College | 4 - Graduated College - **SmokingStatus**: 0 - No | 1 - Yes - **CigarettesPerDay**: 0 - 0 | 1-10 - 1-10 | 11-20 - 11-20 | 21-30 - 21-30 | 30+ - 30+ - **BloodPressureMedication**: 0 - Not Prescribed | 1 - Prescribed - **StrokeHistory**: 0 - No | 1 - Yes - **HighPloodressure**: 0 - No | 1 - Yes - **Diabetes**: 0 - Not Diagnosed | 1 - Diagnosed - **CHDRisk**: 0 - Not at Risk | 1 - At Risk --- # <center><b><font color = purple>PCA's</font></b></center> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Final%20Project%20PCA's.png" style="display: block; margin: auto;" /> .pull-left[ <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/pca hist 1-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#Logistic-Regression-on-10-Year-Heart-Disease-Risk-Slideshow_files/figure-html/pca hist 2-1.png" style="display: block; margin: auto;" /> ] --- class: inverse, middle # <center><b><font color = gold>Our Final Dataset is Complete!</font></b></center> - Replaced Missing Values - Re-coded Variables - Added PCA 1 and PCA 2 **Analysis**: - Since **CHDRisk** is a dummy variable, analysis will be a *logistic regression analysis* --- # <center><b><font color = purple>Model 1: All Predictors</font></b></center> <font color = purple><b>CHDRisk</b> = <b>Male</b> + <b>Age</b> + <b>Education</b> + <b>SmokingStatus</b> + <b>CigaretesPerDay</b> + <b>BloodPressureMedication</b> + <b>StrokeHistory</b> + <b>HighBloodPressure</b> + <b>Diabetes</b> + <b>TotalCholesterol</b> + <b>SystolicBloodPressure</b> + <b>DiastolicBloodPressure</b> + <b>BMI</b> + <b>BPM</b> + <b>GlucoseLevel</b> + <b>PCA1</b> + <b>PCA2</b></font> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Final%20Project%20Model%201%20Resize%201.png" style="display: block; margin: auto;" /> --- # <center><b><font color = purple>Model 2: Categorical Predictors</font></b></center> <font color = purple><b>CHDRisk</b> = <b>Male</b> + <b>Age</b> + <b>Education</b> + <b>SmokingStatus</b> + <b>CigaretesPerDay</b> + <b>BloodPressureMedication</b> + <b>StrokeHistory</b> + <b>HighBloodPressure</b> + <b>Diabetes</b> + <b>PCA1</b> + <b>PCA2</b></font> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Final%20Project%20Model%202.png" style="display: block; margin: auto;" /> --- # <center><b><font color = purple>Model 1 vs Model 2</font></b></center> Because *Model 2* is a nested model of *Model 1*, we run a likelihood ratio test <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Final%20Project%20Model%201%20v%20Model%202.png" style="display: block; margin: auto;" /> - Statistically significant p-value - *Model 1* performs better than *Model 2* - Numerical Variables are important for Coronary Heart Disease diagnoses --- # <center><b><font color = purple>Model 3: Numerical Predictors</font></b></center> <font color = purple><b>CHDRisk</b> = <b>TotalCholesterol</b> + <b>SystolicBloodPressure</b> + <b>DiastolicBloodPressure</b> + <b>BMI</b> + <b>BPM</b> + <b>GlucoseLevel</b> + <b>PCA1</b> + <b>PCA2</b></font> <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Final%20Project%20Model%203.png" style="display: block; margin: auto;" /> --- # <center><b><font color = purple>Model 1 vs Model 3</font></b></center> Because *Model 3* is a nested model of *Model 1*, we run a likelihood ratio test <img src="data:image/png;base64,#https://raw.githubusercontent.com/EPKeep32/ParkerSTA490/main/Images/Final%20Project%20Model%201%20v%20Model%203.png" style="display: block; margin: auto;" /> - Statistically significant p-value - *Model 1* performs better than *Model 3* - Categorical Variables are important for Coronary Heart Disease diagnoses --- class: inverse # <center><b><font color = gold>Final Model</font></b></center> <center><i>Model 1</i> is the best model!</center> **CHDRisk** = 1.271 − 0.4996(**Male.Male**) − 0.1659(**Age.40's**) − 0.3964(**Age.50's**) − 0.5961(**Age.60+**) + 0.1894(**Education.SomeCollege**) + 0.4123(**Education.GraduatedCollege**) − 0.5369(**SmokingStatus.Yes**) − 0.1485(**CigarettesPerDay.1to10**) − 0.4599(**CigarettesPerDay.11to20**) − 0.712(**CigarettesPerDay.21to30**) − 1.051(**CigarettesPerDay.30+**) − 1.23(**BloodPressureMedication.Prescribed**) − 1.491(**StrokeHistory.Yes**) − 0.6388(**HighBloodPressure.Yes**) − 0.8538(**Diabetes.Diagnosed**) − 0.000136(**TotalCholesterol**) − 0.0002852(**SystolicBloodPressure**) + 0.0003342∗(**DiastolicBloodPressure**) + 0.0003147(**BMI**) − 0.00007206(**BPM**) + 0.0002211(**Glucose**) − 0.1139(**PCA1**) + 0.7346(**PCA2**) - For Categorical Variables: There are multiple cases for each variable. If the patient is in their 50's only use the **Ages.50's** quantifier and ignore **Age.40's** and **Age.60's** - Model is used for finding the odds of the patient getting diagnosed with Coronary Heart Disease --- class: inverse, middle # <center><b><font color = gold>Conclusion and Discussion</font></b></center> - Further analysis can be done with interaction and hierarchical variables - Model is limited to the sample (both time and location) - Model might be out of date given the time period of the sample and the medical advances since 1948 and 1971 ## <center><b>Any Questions?</b></center>