Peter Phung, Ahmed Elsaeyed, Coffy Andrews, Alec McCabe, Krutika Patel
2022-12-06
This project looks at a data set created by the Framingham Heart Study of 1948.
The data has been used to develop the Framingham Risk Score, an algorithm that estimates the likeliness of a person developing cardiovascular disease in a specified amount of time.
We will be using the data set to train and test predictive models on their ability to estimate the risk of cardiovascular disease.
Heart disease is the leading cause of death in the United States. It accounts for over one fifth of all deaths per per, with over 600,000 reported deaths in 2020 alone (https://www.cdc.gov/nchs/products/databriefs/db427.htm). To put this further into perspective, roughly one person dies every 34 seconds from cardiovascular disease in the US. It is for these reasons that much effort has been exerted into the study of treatments, medicines, preventative measures and monitoring practices related to heart disease. As it turns out, applied data science techniques and the use of real-world data has proven to be highly effective tools in combating this pressing threat.
Our project objective is to develop a predictive model to be used for the classification of future coronary heart disease in patients, based on select personal attributes and lifestyles. Such a model would help researchers and doctors best help patients, preventing future disease by addressing the current.
Our data is sourced from the Framingham Heart Study, which was initiated by the United States Public Health Service in 1948, under the guidance of President Franklin D. Roosevelt. The study consisted of 5,209 participants with ages between 30-59. Patients were given questionnaires and exams every two years, which expanded over time. The study tracked a large cohort of patients over time and was continued for three generations of the original participants.
The Framingham Heart Study was initiated by the United States Public Health Service in 1948.
The origin is linked to the cardiovascular health of President Franklin D. Roosevelt.
It investigates the epidemiology and risk factors that contribute to a person’s cardiovascular health.
The study is set in the town of Framingham, MA
Initial cohort consisted of 5209 participants from the town between the ages 30-59.
Patients were given questionaires and exams every two years, which expanded over time.
The study tracked a large cohort of patients over time and was continued for three generations of the original participants.
The study collected a range of 16 variables mesurements from the patients in order to create their database.
Variables: Male, Age, Education, Current Smoker, Cigs Per Day, BP Medications, Prevalent Stroke, Prevalent Hypertension, Diabetes, Total Cholesterol, Systolic Blood Pressure, Diastolic Blood Pressure, Heart Rate, Glucose, CHD in 10 years
Most of the variables are categorized by their role in a person’s cardiovascular health.
Demographic Risk Factors:
Behavioral Risk Factors:
Medical History Factors:
Physical Exam Risk:
| Variable | Description |
|---|---|
| Sex | Participant Sex (Male or Female) |
| Age | Age at exam (years) |
| Education | Attained Education |
| Current Smoker | Whether or not the patient is a current smoker |
| Cigs Per Day | The number of cigarettes that the person smoked on average in one day |
| BP Meds | Whether or not the patient was on blood pressure medication |
| Prevelant Stroke | Whether or not the patient had previously had a stroke |
| Prevalant Hyp | Whether or not the patient was hypertensive |
| Diabetes | Whether or not the patient had diabetes |
| Tot Chol | Total cholesterol |
| Sys BP | Systolic blood pressure |
| Dia BP | Diastolic blood pressure |
| BMI | Body Mass Index |
| Heart Rate | Heart rate |
| Glucose | Glucose level |
| Ten Year CHD | 10 year risk of coronary heart disease, ‘TARGET: 1 = Yes | 2 = No’ |
| Model | Precision | Recall | AIC | AUC | F-score | Accuracy | Error |
|---|---|---|---|---|---|---|---|
| Bin. Log. w/ Original Data | 0.68 | 0.11 | 1939.25 | 0.71 | 0.195 | 0.86 | 0.14 |
| Bin. Log. w/ Modified Data | 0.67 | 0.02 | 1507.5 | 0.7 | 0.033 | 0.87 | 0.13 |
| Step AIC Bin. Log. w/ Original Data | 0.7 | 0.11 | 1928.55 | 0.71 | 0.196 | 0.86 | 0.14 |
| Step AIC Bin. Log. w/ Modified Data | 0.67 | 0.02 | 1496.61 | 0.69 | 0.033 | 0.87 | 0.13 |
The figures reveal that the accuracy, precision, recall, AUC, and F-score are higher when the original data was used as opposed to the modified.
The figures reveal that the step-AIC binary logistic regression model using the original data has the second highest AIC, but it also has the highest precision and F-score out of all of the models.
In this paper, 4 different binary logistic regression models were generated in order to predict the 10-year risk of chronic heart disease.
The results from the analysis carried out in this report indicate that the transformation of the skewed variables to a normal distribution and the removal of outliers resulted in worse model performance when comparing the metrics to these models and the models that used the original unaltered data set. With that being said, the AUC’s for all of the models were relatively the same.
The final model that was selected in order to predict the 10-year risk of chronic heart disease was the best performing and the most parsimonious.
We were able to test the validity of this particular model from when the data was split into testing and training data sets.
Center for Drug Evaluation and Research. (2021a, January 21). High Blood Pressure Understanding the Silent Killer. U.S. Food And Drug Administration. https://www.fda.gov/drugs/special-features/high-blood-pressure-understanding-silent-killer
Framingham Study | Boston Medical Center. (n.d.). https://www.bmc.org/stroke-and-cerebrovascular-center/research/framingham-study
High cholesterol - Symptoms and causes. (2021, July 20). Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/symptoms-causes/syc-20350800
Liu, J., MD. (2022a, July 19). What’s a dangerous heart rate? What’s a Dangerous Heart Rate? | Ohio State Health & Discovery. Retrieved December 5, 2022, from https://health.osu.edu/health/heart-and-vascular/what-is-dangerous-heart-rate
NCBI - WWW Error Blocked Diagnostic. (n.d.). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4159698/
NHS website. (2022, July 4). Low blood pressure (hypotension). nhs.uk. https://www.nhs.uk/conditions/low-blood-pressure-hypotension/
Tachycardia: Symptoms, Causes & Treatment. (n.d.). Cleveland Clinic. https://my.clevelandclinic.org/health/diseases/22108-tachycardia
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00