Final Project: Cardiovascular Disease
Abstract
Cardiovascular disease is a major concern around the globe. The issue is so large that it is the leading cause of death globally, according to the WHO [1]. Risk factors for cardiovascular disease include age, poor diet, the lack of exercise, high blood pressure, high cholesterol, and smoking or drinking excessively. This research aims to identify the risk factors that are most associated with determining cardiovascular disease. To answer the research question, 2 logistic regression models were implemented. The first logistic regression model included 8 variables associated with the risk factors of cardiovascular disease (Age, Systolic Blood Pressure, Cholesterol, Activity level, Alcohol Abuse, Glucose level, and Smoking level) and generated a model accuracy of 72.86%. The second logistic regression model included only 1 risk factor of cardiovascular disease (Systolic Blood Pressure) and was able to generate a model accuracy of 71.25%. It can be concluded that High Blood Pressure is most associated with cardiovascular disease. Steps that can be taken to reduce one’s blood pressure, and in turn cardiovascular disease chance, are to: lose weight, eat healthy, exercise regularly, and reduce stress[5].
Libraries
library(tidyverse)
library(GGally)
library(DT)
library(vtable)
library(ggpubr)
library(MASS)
library(caret)
Part 1 - Introduction
Cardiovascular disease is a serious problem for many Americans. According to the CDC, “heart disease is the leading cause of death in the United States” [2]. Though this is a stark statistic, there are steps that can be taken to reduce one’s likelihood of contracting cardiovascular disease. A model that can predict heart disease can be leveraged to inform patients on which factors are most associated with heart disease.
- Research Question:
- Are any of the following variables factors that predict cardiovascular disease?
- gender
- age
- body weight
- body height
- blood pressure
- cholesterol
- smoking
- drinking alcohol
- activity level
- How effective is the model at predicting if a person is suffering from cardiovascular disease?
- Are any of the following variables factors that predict cardiovascular disease?
Part 2 - Data
The data source was found on Kaggle.com. The features in the dataset pertain to factual information collected from the patient, subjective information reported by the patient and results from medical examinations of the patient. All dataset values were collected at the moment of medical examination.
Data Source: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
options(scipen=10000)
# load data
<- "https://raw.githubusercontent.com/SaneSky109/DATA606/main/Data_Project/Data/cardio_train.csv"
url <- read.csv(url, sep = ";")
cardio.data
# remove unecessary column: id
<- cardio.data[,-1]
cardio.data
# create factors
$cardio <- factor(cardio.data$cardio)
cardio.data$gender <- factor(cardio.data$gender)
cardio.data$cholesterol <- factor(cardio.data$cholesterol)
cardio.data$gluc <- factor(cardio.data$gluc)
cardio.data$smoke <- factor(cardio.data$smoke)
cardio.data$alco <- factor(cardio.data$alco)
cardio.data$active <- factor(cardio.data$active)
cardio.data
# rename factor levels
levels(cardio.data$cardio) <- c("No", "Yes")
levels(cardio.data$gender) <- c("Female", "Male")
levels(cardio.data$cholesterol) <- c("Norm", "Higher", "Highest")
levels(cardio.data$gluc) <- c("Norm", "Higher", "Highest")
levels(cardio.data$smoke) <- c("No", "Yes")
levels(cardio.data$alco) <- c("No", "Yes")
levels(cardio.data$active) <- c("No", "Yes")
# transform age since it is in days
$age <- round(cardio.data$age/365, 3)
cardio.data# remove outliers of ap_hi
# I am assuming the that these measures are errors and
# I am just dropping them due to problems it will cause with modeling
# Highest pressure recorded in an individual was 370/360.(https://pubmed.ncbi.nlm.nih.gov/7741618/)
summary(cardio.data$ap_hi)
<- cardio.data[cardio.data$ap_hi <= 370,]
cardio.data <- cardio.data[cardio.data$ap_hi > 50,]
cardio.data
summary(cardio.data$ap_hi)
# remove outliers of ap_lo
summary(cardio.data$ap_lo)
<- cardio.data[cardio.data$ap_lo <= 360,]
cardio.data <- cardio.data[cardio.data$ap_lo > 25,]
cardio.data
summary(cardio.data$ap_lo)
$bmi <- cardio.data$weight/((cardio.data$height/100)^2) cardio.data
Cases
After pre-processing the data, there are 68,985 cases and 12 features. Each observation represents a patient. The variables are listed in the table below (only showing first 50 observations).
Dependent Variable
The response variable, cardio
, is a binary indicator for cardiovascular disease.
Independent Variables
The independent variables I aim to consider are:
age
(quantitative): Age of patient in yearsgender
(qualitative): Gender of patientheight
(quantitative): Height of patient in cmweight
(quantitative): Weight of patient in kgap_hi
(quantitative): Systolic blood pressureap_lo
(quantitative): Diastolic blood pressurecholesterol
(qualitative): Cholesterol level of patientgluc
(qualitative): Glucose level of patientsmoke
(qualitative): Binary variable to determine if a patient smokesalco
(qualitative): Binary variable to determine if a patient drinks alcoholactive
(qualitative): Yes/No if patient is physically active
Some notes on specific variables:
gender
is limited to a dichotomous identity (male or female)ap_hi
andap_lo
seemed to have data values that were inaccurate. These problematic rows were eliminated.- I transformed
age
to be in terms of years rather than days.
Type of Study
This is an observational study.
Part 3 - Exploratory data analysis
Summary Statistics
summary(cardio.data)
## age gender height weight
## Min. :29.58 Female:44795 Min. : 55.0 Min. : 11.00
## 1st Qu.:48.38 Male :23986 1st Qu.:159.0 1st Qu.: 65.00
## Median :53.98 Median :165.0 Median : 72.00
## Mean :53.33 Mean :164.4 Mean : 74.12
## 3rd Qu.:58.42 3rd Qu.:170.0 3rd Qu.: 82.00
## Max. :64.97 Max. :250.0 Max. :200.00
## ap_hi ap_lo cholesterol gluc smoke
## Min. : 60.0 Min. : 30.00 Norm :51581 Norm :58472 No :62728
## 1st Qu.:120.0 1st Qu.: 80.00 Higher : 9314 Higher : 5074 Yes: 6053
## Median :120.0 Median : 80.00 Highest: 7886 Highest: 5235
## Mean :126.6 Mean : 81.38
## 3rd Qu.:140.0 3rd Qu.: 90.00
## Max. :240.0 Max. :190.00
## alco active cardio bmi
## No :65092 No :13524 No :34741 Min. : 3.472
## Yes: 3689 Yes:55257 Yes:34040 1st Qu.: 23.875
## Median : 26.346
## Mean : 27.523
## 3rd Qu.: 30.119
## Max. :298.667
# numeric data
st(cardio.data[,c(1,3:6)], title = "Numeric Summary Statistics")
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
age | 68781 | 53.327 | 6.762 | 29.584 | 48.375 | 58.422 | 64.967 |
height | 68781 | 164.362 | 8.185 | 55 | 159 | 170 | 250 |
weight | 68781 | 74.123 | 14.331 | 11 | 65 | 82 | 200 |
ap_hi | 68781 | 126.615 | 16.764 | 60 | 120 | 140 | 240 |
ap_lo | 68781 | 81.378 | 9.688 | 30 | 80 | 90 | 190 |
# categorical data
st(cardio.data[,-c(1,3:6,8,13)], title = "Categorical Summary Statistics")
Variable | N | Percent |
---|---|---|
gender | 68781 | |
… Female | 44795 | 65.1% |
… Male | 23986 | 34.9% |
cholesterol | 68781 | |
… Norm | 51581 | 75% |
… Higher | 9314 | 13.5% |
… Highest | 7886 | 11.5% |
smoke | 68781 | |
… No | 62728 | 91.2% |
… Yes | 6053 | 8.8% |
alco | 68781 | |
… No | 65092 | 94.6% |
… Yes | 3689 | 5.4% |
active | 68781 | |
… No | 13524 | 19.7% |
… Yes | 55257 | 80.3% |
cardio | 68781 | |
… No | 34741 | 50.5% |
… Yes | 34040 | 49.5% |
Exploratory Visualizations
Looking for Multicolinearity
# anything above 0.7 generally indicates multicolinearity
ggpairs(cardio.data, columns = c(1,3:6))
Since the variables ap_hi
(Systolic blood pressure) and ap_lo
(Diastolic blood pressure) are strongly correlated (\(0.697 \approx 0.7\)) I will only use ap_hi
in my logistic regression analysis.
Histograms
ggplot(cardio.data, aes(x=age)) +
geom_histogram(binwidth=4, fill="#69b3a2", color="#e9ecef") +
ggtitle("Distribution: Age")
Most patients are in their late 40s to early 60s
ggplot(cardio.data, aes(x=ap_hi)) +
geom_histogram(binwidth=10, fill="#69b3a2", color="#e9ecef") +
ggtitle("Distribution: Systolic Blood Pressure")
Most data points for ap_hi
fall 115 mm Hg and 125 mm Hg.
Information from the CDC [4] claims that people:
Are NORMAL blood pressure at values less than 120 mm Hg
Are AT RISK of high blood pressure between the values 120 -139 mm Hg
Have HIGH BLOOD PRESSURE at 140 mm Hg or higher
In context of this dataset:
<- cardio.data
new.data
$norm.bp <- ifelse(new.data$ap_hi <= 120, 1, 0)
new.data
$at.risk <- ifelse(new.data$ap_hi > 120 & new.data$ap_hi < 140, 1, 0)
new.data
$high.bp <- ifelse(new.data$ap_hi >= 140, 1, 0)
new.data
%>%
new.data summarise(Normal = sum(norm.bp)/68781, At_Risk = sum(at.risk)/68781, High = sum(high.bp)/68781)
## Normal At_Risk High
## 1 0.5887382 0.1425684 0.2686934
ggplot(cardio.data, aes(x=weight)) +
geom_histogram(binwidth=15, fill="#69b3a2", color="#e9ecef") +
ggtitle("Weight")
Most of the data falls within 50 to 100 kilograms (110 to 220 lbs)
Boxplots
# Age
<- ggplot(cardio.data, aes(x = cardio, y = age)) +
age1 geom_boxplot() +
ggtitle("Boxplot: Age")
# Height
<- ggplot(cardio.data, aes(x = cardio, y = height)) +
height1 geom_boxplot() +
ggtitle("Boxplot: Height")
# Weight
<- ggplot(cardio.data, aes(x = cardio, y = weight)) +
weight1 geom_boxplot() +
ggtitle("Boxplot: Weight")
# Systolic blood pressure
<- ggplot(cardio.data, aes(x = cardio, y = ap_hi)) +
sbp geom_boxplot() +
ggtitle("Boxplot: Systolic blood pressure")
# Put plots together
ggarrange(age1, height1, weight1, sbp,
labels = c("A","B","C","D"),
ncol = 2, nrow = 2)
- Age looks to have an affect on whether a patient has cardiovascular disease due to boxplots in plot A differing across groups.
- Height does not seem to have significant differences across groups, therefore it will not be included in final model.
- Weight does not seem to have significant differences across groups, therefore it will not be included in final model.
- Systolic Blood Pressure differs between groups, therefore it will remain in the logistic regression model.
Part 4 - Logistic Regression
# Split the data into training and test set
set.seed(12345)
<- cardio.data$cardio %>%
training.samples createDataPartition(p = 0.7, list = FALSE)
<- cardio.data[training.samples, ]
train.data <- cardio.data[-training.samples, ]
test.data
summary(train.data$cardio)
## No Yes
## 24319 23828
summary(test.data$cardio)
## No Yes
## 10422 10212
Both training and testing datasets have similar cardio
factor level distributions.
Using Step-wise Variable selection to create Logistic Regression Model
A model was programmed to generate the highest AIC given the full range of variables, not including height
, weight
and ap_lo
. Step-wise selection was used to create the model. The step wise selection process eliminated gender
from the model.
# stepwise variable selection
<- glm(cardio ~ age + ap_hi + cholesterol + active + alco + gluc + smoke + gender, data = train.data, family = binomial)
full.model
<- full.model %>% stepAIC(trace = FALSE)
step.model summary(step.model)
##
## Call:
## glm(formula = cardio ~ age + ap_hi + cholesterol + active + alco +
## gluc + smoke, family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8227 -0.9269 -0.3425 0.9423 2.6968
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.5380014 0.1319109 -79.887 < 0.0000000000000002 ***
## age 0.0503634 0.0016042 31.395 < 0.0000000000000002 ***
## ap_hi 0.0627467 0.0008314 75.476 < 0.0000000000000002 ***
## cholesterolHigher 0.3687159 0.0324998 11.345 < 0.0000000000000002 ***
## cholesterolHighest 1.1567672 0.0424940 27.222 < 0.0000000000000002 ***
## activeYes -0.2402398 0.0260102 -9.236 < 0.0000000000000002 ***
## alcoYes -0.1294908 0.0504180 -2.568 0.0102 *
## glucHigher -0.0034201 0.0427889 -0.080 0.9363
## glucHighest -0.3538718 0.0471929 -7.498 0.0000000000000646 ***
## smokeYes -0.1824701 0.0397941 -4.585 0.0000045321595044 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66741 on 48146 degrees of freedom
## Residual deviance: 54248 on 48137 degrees of freedom
## AIC: 54268
##
## Number of Fisher Scoring iterations: 4
In the summary output for the step-wise model, all variables, except glucHigher
, are statistically significant. The variables with the largest log odds are:
- Categorical:
cholesterol
- a person who has
cholesterolHigher
, rather thancholesterolNorm
will see an increase in the log odds ofcardio
by 0.37 - a person who has
cholesterolHighest
, rather thancholesterolHigher
will see an increase in the log odds ofcardio
by 1.16
- a person who has
active
- a person who is
activeYes
will see a decrease in the log odds ofcardio
by 0.24
- a person who is
gluc
- a person who has
glucHighest
, rather thanglucHigher
will see a decrease in the log odds ofcardio
by 0.36
- a person who has
- Numeric:
ap_hi
- Every millimeter of mercury in
ap_hi
increases the log odds ofcardio
by 0.06
- Every millimeter of mercury in
age
- Every year increase in
age
increases the log odds ofcardio
by 0.05
- Every year increase in
Analyze the predicted results of model
# Create confusion matrix for model results
<- full.model %>% predict(test.data, type = "response")
probabilities2
<- ifelse(probabilities2 > 0.5, "Yes", "No")
predicted.classes2
<- full.model %>% predict(test.data, type = "response")
model.results2
<- test.data$cardio
testing.results2
<- table(predicted.classes2, testing.results2)
model.vs.testing2
colnames(model.vs.testing2) <- c("No", "Yes")
rownames(model.vs.testing2) <- c("No", "Yes")
confusionMatrix(model.vs.testing2)
## Confusion Matrix and Statistics
##
## testing.results2
## predicted.classes2 No Yes
## No 8332 3511
## Yes 2090 6701
##
## Accuracy : 0.7286
## 95% CI : (0.7224, 0.7346)
## No Information Rate : 0.5051
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4563
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.7995
## Specificity : 0.6562
## Pos Pred Value : 0.7035
## Neg Pred Value : 0.7623
## Prevalence : 0.5051
## Detection Rate : 0.4038
## Detection Prevalence : 0.5740
## Balanced Accuracy : 0.7278
##
## 'Positive' Class : No
##
Logistic Regression Model using fewer variables
To reduce the number of variables in the model, I only considered ap_hi
for predicting cardio
.
# cardio ~ ap_hi
<- glm(cardio ~ ap_hi, data = train.data, family = binomial)
final.model summary(final.model)
##
## Call:
## glm(formula = cardio ~ ap_hi, family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9610 -1.0041 -0.3993 1.0665 2.7884
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.6892597 0.1031688 -84.22 <0.0000000000000002 ***
## ap_hi 0.0688908 0.0008227 83.74 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66741 on 48146 degrees of freedom
## Residual deviance: 56566 on 48145 degrees of freedom
## AIC: 56570
##
## Number of Fisher Scoring iterations: 4
In the summary output above, ap_hi
is statistically significant.
ap_hi
- Every millimeter of mercury in
ap_hi
increases the log odds ofcardio
by 0.07
- Every millimeter of mercury in
Analyze the predicted results of model
<- final.model %>% predict(test.data, type = "response")
probabilities1
<- ifelse(probabilities1 > 0.5, "Yes", "No")
predicted.classes1
<- final.model %>% predict(test.data, type = "response")
model.results1
<- test.data$cardio
testing.results1
<- table(predicted.classes1, testing.results1)
model.vs.testing1
colnames(model.vs.testing1) <- c("No", "Yes")
rownames(model.vs.testing1) <- c("No", "Yes")
confusionMatrix(model.vs.testing1)
## Confusion Matrix and Statistics
##
## testing.results1
## predicted.classes1 No Yes
## No 8387 3897
## Yes 2035 6315
##
## Accuracy : 0.7125
## 95% CI : (0.7063, 0.7187)
## No Information Rate : 0.5051
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4239
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.8047
## Specificity : 0.6184
## Pos Pred Value : 0.6828
## Neg Pred Value : 0.7563
## Prevalence : 0.5051
## Detection Rate : 0.4065
## Detection Prevalence : 0.5953
## Balanced Accuracy : 0.7116
##
## 'Positive' Class : No
##
Part 5 - Conclusion
After analyzing the data from the cardiovascular disease dataset, the data shows that Systolic Blood Pressure is a significant predictor in determining if a patient suffers from cardiovascular disease. I was able to generate a model that had 71.25% accuracy in predicting of an unknown patient suffers from cardiovascular disease.
I also created a model to assess the predicting power with all variables from the dataset in the model, except height, weight, diastolic blood pressure, and gender. The model was able to reach an accuracy level of 72.86% in predicting if an unknown patient suffers from cardiovascular disease. This overall performance is not much better than that of the simpler model, therefore the variables in the simple model are the more important variables for predicting cardiovascular disease.
It can be concluded that Blood Pressure is the most associated factor in the dataset for predicting cardiovascular disease.
References
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
https://www.cdc.gov/heartdisease/index.htm#:~:text=Heart%20disease%20is%20the%20leading,can%20lead%20to%20heart%20attack.
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
https://www.webmd.com/hypertension-high-blood-pressure/how-to-lower-blood-pressure