Final Project: Cardiovascular Disease
Abstract
Cardiovascular disease is a major concern around the globe. The issue is so large that it is the leading cause of death globally, according to the WHO [1]. Risk factors for cardiovascular disease include age, poor diet, the lack of exercise, high blood pressure, high cholesterol, and smoking or drinking excessively. This research aims to identify the risk factors that are most associated with determining cardiovascular disease. To answer the research question, 2 logistic regression models were implemented. The first logistic regression model included 8 variables associated with the risk factors of cardiovascular disease (Age, Systolic Blood Pressure, Cholesterol, Activity level, Alcohol Abuse, Glucose level, and Smoking level) and generated a model accuracy of 72.86%. The second logistic regression model included only 1 risk factor of cardiovascular disease (Systolic Blood Pressure) and was able to generate a model accuracy of 71.25%. It can be concluded that High Blood Pressure is most associated with cardiovascular disease. Steps that can be taken to reduce one’s blood pressure, and in turn cardiovascular disease chance, are to: lose weight, eat healthy, exercise regularly, and reduce stress[5].
Libraries
library(tidyverse)
library(GGally)
library(DT)
library(vtable)
library(ggpubr)
library(MASS)
library(caret)Part 1 - Introduction
Cardiovascular disease is a serious problem for many Americans. According to the CDC, “heart disease is the leading cause of death in the United States” [2]. Though this is a stark statistic, there are steps that can be taken to reduce one’s likelihood of contracting cardiovascular disease. A model that can predict heart disease can be leveraged to inform patients on which factors are most associated with heart disease.
- Research Question:
- Are any of the following variables factors that predict cardiovascular disease?
- gender
- age
- body weight
- body height
- blood pressure
- cholesterol
- smoking
- drinking alcohol
- activity level
- How effective is the model at predicting if a person is suffering from cardiovascular disease?
- Are any of the following variables factors that predict cardiovascular disease?
Part 2 - Data
The data source was found on Kaggle.com. The features in the dataset pertain to factual information collected from the patient, subjective information reported by the patient and results from medical examinations of the patient. All dataset values were collected at the moment of medical examination.
Data Source: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
options(scipen=10000)
# load data
url <- "https://raw.githubusercontent.com/SaneSky109/DATA606/main/Data_Project/Data/cardio_train.csv"
cardio.data <- read.csv(url, sep = ";")
# remove unecessary column: id
cardio.data <- cardio.data[,-1]
# create factors
cardio.data$cardio <- factor(cardio.data$cardio)
cardio.data$gender <- factor(cardio.data$gender)
cardio.data$cholesterol <- factor(cardio.data$cholesterol)
cardio.data$gluc <- factor(cardio.data$gluc)
cardio.data$smoke <- factor(cardio.data$smoke)
cardio.data$alco <- factor(cardio.data$alco)
cardio.data$active <- factor(cardio.data$active)
# rename factor levels
levels(cardio.data$cardio) <- c("No", "Yes")
levels(cardio.data$gender) <- c("Female", "Male")
levels(cardio.data$cholesterol) <- c("Norm", "Higher", "Highest")
levels(cardio.data$gluc) <- c("Norm", "Higher", "Highest")
levels(cardio.data$smoke) <- c("No", "Yes")
levels(cardio.data$alco) <- c("No", "Yes")
levels(cardio.data$active) <- c("No", "Yes")
# transform age since it is in days
cardio.data$age <- round(cardio.data$age/365, 3)
# remove outliers of ap_hi
# I am assuming the that these measures are errors and
# I am just dropping them due to problems it will cause with modeling
# Highest pressure recorded in an individual was 370/360.(https://pubmed.ncbi.nlm.nih.gov/7741618/)
summary(cardio.data$ap_hi)
cardio.data <- cardio.data[cardio.data$ap_hi <= 370,]
cardio.data <- cardio.data[cardio.data$ap_hi > 50,]
summary(cardio.data$ap_hi)
# remove outliers of ap_lo
summary(cardio.data$ap_lo)
cardio.data <- cardio.data[cardio.data$ap_lo <= 360,]
cardio.data <- cardio.data[cardio.data$ap_lo > 25,]
summary(cardio.data$ap_lo)
cardio.data$bmi <- cardio.data$weight/((cardio.data$height/100)^2)Cases
After pre-processing the data, there are 68,985 cases and 12 features. Each observation represents a patient. The variables are listed in the table below (only showing first 50 observations).
Dependent Variable
The response variable, cardio, is a binary indicator for cardiovascular disease.
Independent Variables
The independent variables I aim to consider are:
age(quantitative): Age of patient in yearsgender(qualitative): Gender of patientheight(quantitative): Height of patient in cmweight(quantitative): Weight of patient in kgap_hi(quantitative): Systolic blood pressureap_lo(quantitative): Diastolic blood pressurecholesterol(qualitative): Cholesterol level of patientgluc(qualitative): Glucose level of patientsmoke(qualitative): Binary variable to determine if a patient smokesalco(qualitative): Binary variable to determine if a patient drinks alcoholactive(qualitative): Yes/No if patient is physically active
Some notes on specific variables:
genderis limited to a dichotomous identity (male or female)ap_hiandap_loseemed to have data values that were inaccurate. These problematic rows were eliminated.- I transformed
ageto be in terms of years rather than days.
Type of Study
This is an observational study.
Part 3 - Exploratory data analysis
Summary Statistics
summary(cardio.data)## age gender height weight
## Min. :29.58 Female:44795 Min. : 55.0 Min. : 11.00
## 1st Qu.:48.38 Male :23986 1st Qu.:159.0 1st Qu.: 65.00
## Median :53.98 Median :165.0 Median : 72.00
## Mean :53.33 Mean :164.4 Mean : 74.12
## 3rd Qu.:58.42 3rd Qu.:170.0 3rd Qu.: 82.00
## Max. :64.97 Max. :250.0 Max. :200.00
## ap_hi ap_lo cholesterol gluc smoke
## Min. : 60.0 Min. : 30.00 Norm :51581 Norm :58472 No :62728
## 1st Qu.:120.0 1st Qu.: 80.00 Higher : 9314 Higher : 5074 Yes: 6053
## Median :120.0 Median : 80.00 Highest: 7886 Highest: 5235
## Mean :126.6 Mean : 81.38
## 3rd Qu.:140.0 3rd Qu.: 90.00
## Max. :240.0 Max. :190.00
## alco active cardio bmi
## No :65092 No :13524 No :34741 Min. : 3.472
## Yes: 3689 Yes:55257 Yes:34040 1st Qu.: 23.875
## Median : 26.346
## Mean : 27.523
## 3rd Qu.: 30.119
## Max. :298.667
# numeric data
st(cardio.data[,c(1,3:6)], title = "Numeric Summary Statistics")| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| age | 68781 | 53.327 | 6.762 | 29.584 | 48.375 | 58.422 | 64.967 |
| height | 68781 | 164.362 | 8.185 | 55 | 159 | 170 | 250 |
| weight | 68781 | 74.123 | 14.331 | 11 | 65 | 82 | 200 |
| ap_hi | 68781 | 126.615 | 16.764 | 60 | 120 | 140 | 240 |
| ap_lo | 68781 | 81.378 | 9.688 | 30 | 80 | 90 | 190 |
# categorical data
st(cardio.data[,-c(1,3:6,8,13)], title = "Categorical Summary Statistics")| Variable | N | Percent |
|---|---|---|
| gender | 68781 | |
| … Female | 44795 | 65.1% |
| … Male | 23986 | 34.9% |
| cholesterol | 68781 | |
| … Norm | 51581 | 75% |
| … Higher | 9314 | 13.5% |
| … Highest | 7886 | 11.5% |
| smoke | 68781 | |
| … No | 62728 | 91.2% |
| … Yes | 6053 | 8.8% |
| alco | 68781 | |
| … No | 65092 | 94.6% |
| … Yes | 3689 | 5.4% |
| active | 68781 | |
| … No | 13524 | 19.7% |
| … Yes | 55257 | 80.3% |
| cardio | 68781 | |
| … No | 34741 | 50.5% |
| … Yes | 34040 | 49.5% |
Exploratory Visualizations
Looking for Multicolinearity
# anything above 0.7 generally indicates multicolinearity
ggpairs(cardio.data, columns = c(1,3:6))Since the variables ap_hi(Systolic blood pressure) and ap_lo(Diastolic blood pressure) are strongly correlated (\(0.697 \approx 0.7\)) I will only use ap_hi in my logistic regression analysis.
Histograms
ggplot(cardio.data, aes(x=age)) +
geom_histogram(binwidth=4, fill="#69b3a2", color="#e9ecef") +
ggtitle("Distribution: Age")Most patients are in their late 40s to early 60s
ggplot(cardio.data, aes(x=ap_hi)) +
geom_histogram(binwidth=10, fill="#69b3a2", color="#e9ecef") +
ggtitle("Distribution: Systolic Blood Pressure")Most data points for ap_hi fall 115 mm Hg and 125 mm Hg.
Information from the CDC [4] claims that people:
Are NORMAL blood pressure at values less than 120 mm Hg
Are AT RISK of high blood pressure between the values 120 -139 mm Hg
Have HIGH BLOOD PRESSURE at 140 mm Hg or higher
In context of this dataset:
new.data <- cardio.data
new.data$norm.bp <- ifelse(new.data$ap_hi <= 120, 1, 0)
new.data$at.risk <- ifelse(new.data$ap_hi > 120 & new.data$ap_hi < 140, 1, 0)
new.data$high.bp <- ifelse(new.data$ap_hi >= 140, 1, 0)
new.data %>%
summarise(Normal = sum(norm.bp)/68781, At_Risk = sum(at.risk)/68781, High = sum(high.bp)/68781)## Normal At_Risk High
## 1 0.5887382 0.1425684 0.2686934
ggplot(cardio.data, aes(x=weight)) +
geom_histogram(binwidth=15, fill="#69b3a2", color="#e9ecef") +
ggtitle("Weight")Most of the data falls within 50 to 100 kilograms (110 to 220 lbs)
Boxplots
# Age
age1 <- ggplot(cardio.data, aes(x = cardio, y = age)) +
geom_boxplot() +
ggtitle("Boxplot: Age")
# Height
height1 <- ggplot(cardio.data, aes(x = cardio, y = height)) +
geom_boxplot() +
ggtitle("Boxplot: Height")
# Weight
weight1 <- ggplot(cardio.data, aes(x = cardio, y = weight)) +
geom_boxplot() +
ggtitle("Boxplot: Weight")
# Systolic blood pressure
sbp <- ggplot(cardio.data, aes(x = cardio, y = ap_hi)) +
geom_boxplot() +
ggtitle("Boxplot: Systolic blood pressure")
# Put plots together
ggarrange(age1, height1, weight1, sbp,
labels = c("A","B","C","D"),
ncol = 2, nrow = 2)- Age looks to have an affect on whether a patient has cardiovascular disease due to boxplots in plot A differing across groups.
- Height does not seem to have significant differences across groups, therefore it will not be included in final model.
- Weight does not seem to have significant differences across groups, therefore it will not be included in final model.
- Systolic Blood Pressure differs between groups, therefore it will remain in the logistic regression model.
Part 4 - Logistic Regression
# Split the data into training and test set
set.seed(12345)
training.samples <- cardio.data$cardio %>%
createDataPartition(p = 0.7, list = FALSE)
train.data <- cardio.data[training.samples, ]
test.data <- cardio.data[-training.samples, ]
summary(train.data$cardio)## No Yes
## 24319 23828
summary(test.data$cardio)## No Yes
## 10422 10212
Both training and testing datasets have similar cardio factor level distributions.
Using Step-wise Variable selection to create Logistic Regression Model
A model was programmed to generate the highest AIC given the full range of variables, not including height, weight and ap_lo. Step-wise selection was used to create the model. The step wise selection process eliminated gender from the model.
# stepwise variable selection
full.model <- glm(cardio ~ age + ap_hi + cholesterol + active + alco + gluc + smoke + gender, data = train.data, family = binomial)
step.model <- full.model %>% stepAIC(trace = FALSE)
summary(step.model)##
## Call:
## glm(formula = cardio ~ age + ap_hi + cholesterol + active + alco +
## gluc + smoke, family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8227 -0.9269 -0.3425 0.9423 2.6968
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.5380014 0.1319109 -79.887 < 0.0000000000000002 ***
## age 0.0503634 0.0016042 31.395 < 0.0000000000000002 ***
## ap_hi 0.0627467 0.0008314 75.476 < 0.0000000000000002 ***
## cholesterolHigher 0.3687159 0.0324998 11.345 < 0.0000000000000002 ***
## cholesterolHighest 1.1567672 0.0424940 27.222 < 0.0000000000000002 ***
## activeYes -0.2402398 0.0260102 -9.236 < 0.0000000000000002 ***
## alcoYes -0.1294908 0.0504180 -2.568 0.0102 *
## glucHigher -0.0034201 0.0427889 -0.080 0.9363
## glucHighest -0.3538718 0.0471929 -7.498 0.0000000000000646 ***
## smokeYes -0.1824701 0.0397941 -4.585 0.0000045321595044 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66741 on 48146 degrees of freedom
## Residual deviance: 54248 on 48137 degrees of freedom
## AIC: 54268
##
## Number of Fisher Scoring iterations: 4
In the summary output for the step-wise model, all variables, except glucHigher, are statistically significant. The variables with the largest log odds are:
- Categorical:
cholesterol- a person who has
cholesterolHigher, rather thancholesterolNormwill see an increase in the log odds ofcardioby 0.37 - a person who has
cholesterolHighest, rather thancholesterolHigherwill see an increase in the log odds ofcardioby 1.16
- a person who has
active- a person who is
activeYeswill see a decrease in the log odds ofcardioby 0.24
- a person who is
gluc- a person who has
glucHighest, rather thanglucHigherwill see a decrease in the log odds ofcardioby 0.36
- a person who has
- Numeric:
ap_hi- Every millimeter of mercury in
ap_hiincreases the log odds ofcardioby 0.06
- Every millimeter of mercury in
age- Every year increase in
ageincreases the log odds ofcardioby 0.05
- Every year increase in
Analyze the predicted results of model
# Create confusion matrix for model results
probabilities2 <- full.model %>% predict(test.data, type = "response")
predicted.classes2 <- ifelse(probabilities2 > 0.5, "Yes", "No")
model.results2 <- full.model %>% predict(test.data, type = "response")
testing.results2 <- test.data$cardio
model.vs.testing2 <- table(predicted.classes2, testing.results2)
colnames(model.vs.testing2) <- c("No", "Yes")
rownames(model.vs.testing2) <- c("No", "Yes")
confusionMatrix(model.vs.testing2) ## Confusion Matrix and Statistics
##
## testing.results2
## predicted.classes2 No Yes
## No 8332 3511
## Yes 2090 6701
##
## Accuracy : 0.7286
## 95% CI : (0.7224, 0.7346)
## No Information Rate : 0.5051
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4563
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.7995
## Specificity : 0.6562
## Pos Pred Value : 0.7035
## Neg Pred Value : 0.7623
## Prevalence : 0.5051
## Detection Rate : 0.4038
## Detection Prevalence : 0.5740
## Balanced Accuracy : 0.7278
##
## 'Positive' Class : No
##
Logistic Regression Model using fewer variables
To reduce the number of variables in the model, I only considered ap_hi for predicting cardio.
# cardio ~ ap_hi
final.model <- glm(cardio ~ ap_hi, data = train.data, family = binomial)
summary(final.model)##
## Call:
## glm(formula = cardio ~ ap_hi, family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9610 -1.0041 -0.3993 1.0665 2.7884
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.6892597 0.1031688 -84.22 <0.0000000000000002 ***
## ap_hi 0.0688908 0.0008227 83.74 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66741 on 48146 degrees of freedom
## Residual deviance: 56566 on 48145 degrees of freedom
## AIC: 56570
##
## Number of Fisher Scoring iterations: 4
In the summary output above, ap_hi is statistically significant.
ap_hi- Every millimeter of mercury in
ap_hiincreases the log odds ofcardioby 0.07
- Every millimeter of mercury in
Analyze the predicted results of model
probabilities1 <- final.model %>% predict(test.data, type = "response")
predicted.classes1 <- ifelse(probabilities1 > 0.5, "Yes", "No")
model.results1 <- final.model %>% predict(test.data, type = "response")
testing.results1 <- test.data$cardio
model.vs.testing1 <- table(predicted.classes1, testing.results1)
colnames(model.vs.testing1) <- c("No", "Yes")
rownames(model.vs.testing1) <- c("No", "Yes")
confusionMatrix(model.vs.testing1) ## Confusion Matrix and Statistics
##
## testing.results1
## predicted.classes1 No Yes
## No 8387 3897
## Yes 2035 6315
##
## Accuracy : 0.7125
## 95% CI : (0.7063, 0.7187)
## No Information Rate : 0.5051
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4239
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.8047
## Specificity : 0.6184
## Pos Pred Value : 0.6828
## Neg Pred Value : 0.7563
## Prevalence : 0.5051
## Detection Rate : 0.4065
## Detection Prevalence : 0.5953
## Balanced Accuracy : 0.7116
##
## 'Positive' Class : No
##
Part 5 - Conclusion
After analyzing the data from the cardiovascular disease dataset, the data shows that Systolic Blood Pressure is a significant predictor in determining if a patient suffers from cardiovascular disease. I was able to generate a model that had 71.25% accuracy in predicting of an unknown patient suffers from cardiovascular disease.
I also created a model to assess the predicting power with all variables from the dataset in the model, except height, weight, diastolic blood pressure, and gender. The model was able to reach an accuracy level of 72.86% in predicting if an unknown patient suffers from cardiovascular disease. This overall performance is not much better than that of the simpler model, therefore the variables in the simple model are the more important variables for predicting cardiovascular disease.
It can be concluded that Blood Pressure is the most associated factor in the dataset for predicting cardiovascular disease.
References
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
https://www.cdc.gov/heartdisease/index.htm#:~:text=Heart%20disease%20is%20the%20leading,can%20lead%20to%20heart%20attack.
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
https://www.webmd.com/hypertension-high-blood-pressure/how-to-lower-blood-pressure