Cardiovascular Disease Dataset Analysis

Introduction

Background: According to World Health Organization, cardiovascular diseases (CVDs) accounted for 17.9 million deaths in 2019 globally; they are the leading causes of death in the world.

Problem Statement: Using the Cardiovascular Disease Dataset from Kaggle, analyze factors that could contribute to cardiovascular diseases.

Data

The dataset contains health records of 70 thousand patients. There are 12 variables included in the dataset for each of the patient. Below is a snip of the data table. Not all variables and observations are shown below due to limited space.

id	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	active	cardio
0	18393	2	168	62	110	80	1	1	1	0
1	20228	1	156	85	140	90	3	1	1	1
2	18857	1	165	64	130	70	3	1	0	1
3	17623	2	169	82	150	100	1	1	1	1
4	17474	1	156	56	100	60	1	1	0	0
8	21914	1	151	67	120	80	2	2	0	0

Data Wrangling, Munging, and Cleaning

Of the 12 variables, We are going to only look at the association between the two type of blood pressures (systolic BP and diastolic BP) and CVDs, specifically, within the age group of 30 - 40. The Cardiovascular Disease Dataset is filtered to only include three variables of the targeted age group: systolic BP (“ap_hi”), diastolic BP (“ap_lo”), and presence/absence of cardiovascular diseases (“cardio”)

blood_pressures = cardio_data %>%
  filter(age >= 10950 & age <= 14600) %>%
  select(ap_hi, ap_lo, cardio)

Data Wrangling, Munging, and Cleaning

After data wrangling, we will take an overview of systolic BP and diastolic BP of this age group through a boxplot using the ggplot2 package.

Data Wrangling, Munging, and Cleaning

From the boxplot, we are aware of the presence of outliers. Interquartile range method is used to remove outliers.

systolic = blood_pressures %>%
  select(ap_hi, cardio)%>%
  arrange(desc(ap_hi))

systolic_Q1 = quantile(systolic$ap_hi, .25)
systolic_Q3 = quantile(systolic$ap_hi, .75)
systolic_IQR = IQR(systolic$ap_hi)

systolic_no_outlier = subset(systolic, systolic$ap_hi > 
                               (systolic_Q1 - 1.5*systolic_IQR) & 
                               systolic$ap_hi < 
                               (systolic_Q3 + 1.5*systolic_IQR))

Use the same codes for diastolic BP data cleaning.

Data Wrangling, Munging, and Cleaning

Plotting the data again, systolic BP and dystolic data outliers have been removed.

Hypotheses

Before we hypothesize association between the two kinds of blood pressures and CVDs. We want to first look at systolic BP and diastolic BP data distribution across all ages for the group with CVDs present.

Hypotheses

According to American Heart Assocation, a systolic BP of 130+ mmHg or diastolic BP of 80+ mmHg is considered as high blood pressure. Since the group with CVDs presence show high density at 80 - 150 mmHg on the density plot, we make the following hypotheses on the targeted age group:

\[ H_0: \mu_0 = \mu_1 \newline H_a: \mu_0 < \mu_1 \] Null Hypothesis: there is no difference between the mean systolic BP (or diastolic BP) for CVDs and non-CVDs.

Alternative Hypothesis: the mean systolic BP (or diastolic BP) is higher for CVDs than non-CVDs.

Method

Since we are examining differences in systolic BP (or diastolic BP) between two groups (CVDs vs. non-CVDs), we will use two sample z-test to test our null and alternative hypothesis.

We can use z.test() function from BSDA package to perform the test. We will test at 5% level of significance.

Results - Systolic BP

z.test(systolic_1, systolic_0, alternative='greater', 
       mu=0, sigma.x=systolic_1_sd, sigma.y=systolic_0_sd,conf.level=.95)

## 
##  Two-sample z-Test
## 
## data:  systolic_1 and systolic_0
## z = 6.5588, p-value = 2.713e-11
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.970174       NA
## sample estimates:
## mean of x mean of y 
##  119.2217  115.2573

Because p-value < 0.05, we REJECT the null hypothesis.

Results - Diastolic BP

z.test(diastolic_1, diastolic_0, alternative='greater', 
       mu=0, sigma.x=diastolic_1_sd, sigma.y=diastolic_0_sd,conf.level=.95)

## 
##  Two-sample z-Test
## 
## data:  diastolic_1 and diastolic_0
## z = 14.353, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  5.921546       NA
## sample estimates:
## mean of x mean of y 
##  81.91057  75.22256

Because p-value < 0.05, we REJECT the null hypothesis.

Conclusion

Based on the results from two sample z-test on the dataset of the selected variables, we can conclude that high systolic BP and diastolic BP are CVDs risk factors for the age group of 30 - 40.