Rpubs Presentation Link:
Do male or female have a higher likelihood of a heart attack? At what age is it most common to have a heart attack? This investigation will explain the likelihood of a heart attack through the comparison of gender and age.
The Male and Female age, gender and target variable are used in this investigation. After giving the summary statistics for male and female separately, a histogram and barplot is plotted to present the data in a more comprehensive manner. In addition, a Chi square test of association between the gender and risk of heart attack is used to test if there is an association between the gender and risk of heart attack. Furthermore, a normality test is used to ensure that the data is is drawn from a normal population distribution. After both assumptions are tested, a two sample t tests is used to evaluate if there is a difference in the mean age of Male and Female who have a high chance of heart attack.
The following hypothesis have been set out and tested through this investigation:
There is association between gender and risk of heart attack of an individual.
The age of Male and Female whom have a high chance of a heart attack are the same.
The dataset used contains 14 variables and 303 observations of various individuals from the United States of America, Cleverland. The dataset used in this task was generated from: https://www.kaggle.com/madhav000/attack-prediction-accuracy-more-than-80.
The observation of the data was drawn from https://archive.ics.uci.edu/ml/datasets/Heart+Disease. The dataset contain 76 variables from different countries such as Hungary and Switzerland but have been subset for easier analysis and the Cleverland database is the one that have been selected for this task.
Age: Age of individual in years.
Sex: Gender of the individual. 0 for Female and 1 for Male. Sex variable have been renamed to Gender.
cp: Chest pain type from 1 to 4. Does not have any levels.
trestbps: Resting blood pressure. (in mm Hg on admission to the hospital)
chol: Serum cholesterol measured in mg/dl.
fbs: Fasting blood > 120 mg/dl. 0 for False, 1 for True.
restecg: Resting electrocardiographic results. 0 for normal, 1 for having abnormality, and 2 for probable or definite hypertrophy.
thalach: Maximum heart rate achieved masured in beats per minutes.
exang: Exercise induced angina. 0 for False, 1 for True.
oldpeak: ST depression induced by exercise relative to rest. Does not have any levels.
slope: Slope of peak exercise ST segment. 1 for upsloping, 2 for flat, 3 for downsloping.
ca: Number of major vessels colored by flouroscopy, measured from 0 to 3.
thal: Thalassemia a inherited blood disorder, 0 for normal, 1 for fixed defect and 2 for reversable defect.
target: Likelihood of heart attack. 0 for less chance of heart attack and 1 for more chance of heart attack.
library(readr)
library(dplyr)
library(car)
library(lattice)
library(ggplot2)
heart <- read_csv("heart.csv")
## Parsed with column specification:
## cols(
## age = col_double(),
## sex = col_double(),
## cp = col_double(),
## trestbps = col_double(),
## chol = col_double(),
## fbs = col_double(),
## restecg = col_double(),
## thalach = col_double(),
## exang = col_double(),
## oldpeak = col_double(),
## slope = col_double(),
## ca = col_double(),
## thal = col_double(),
## target = col_double()
## )
str(heart)
## tibble [303 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:303] 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : num [1:303] 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : num [1:303] 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: num [1:303] 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num [1:303] 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : num [1:303] 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : num [1:303] 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : num [1:303] 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : num [1:303] 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num [1:303] 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : num [1:303] 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : num [1:303] 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : num [1:303] 1 2 2 2 2 1 2 3 3 2 ...
## $ target : num [1:303] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_double(),
## .. cp = col_double(),
## .. trestbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_double(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exang = col_double(),
## .. oldpeak = col_double(),
## .. slope = col_double(),
## .. ca = col_double(),
## .. thal = col_double(),
## .. target = col_double()
## .. )
summary(heart)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.314 Mean :0.5446
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
heart$sex <- heart$sex %>% factor(levels=c(0,1),
labels=c("Female","Male"))
heart$target <- heart$target %>% factor(levels=c(0,1),
labels=c(0,1))
heart <- heart %>% rename(gender = sex)
str(heart)
## tibble [303 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:303] 63 37 41 56 57 57 56 44 52 57 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 1 2 2 2 ...
## $ cp : num [1:303] 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: num [1:303] 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num [1:303] 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : num [1:303] 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : num [1:303] 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : num [1:303] 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : num [1:303] 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num [1:303] 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : num [1:303] 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : num [1:303] 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : num [1:303] 1 2 2 2 2 1 2 3 3 2 ...
## $ target : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_double(),
## .. cp = col_double(),
## .. trestbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_double(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exang = col_double(),
## .. oldpeak = col_double(),
## .. slope = col_double(),
## .. ca = col_double(),
## .. thal = col_double(),
## .. target = col_double()
## .. )
heart_summary1 <- heart %>% group_by(`gender`) %>% summarise(Min = min(age,na.rm = TRUE),
Q1 = quantile(age,probs = .25,na.rm = TRUE),
Median = median(age, na.rm = TRUE),
Q3 = quantile(age,probs = .75,na.rm = TRUE),
Max = max(age,na.rm = TRUE),
Mean = mean(age, na.rm = TRUE),
SD = sd(age, na.rm = TRUE),
n = n(),
Missing = sum(is.na(target)))
## `summarise()` ungrouping output (override with `.groups` argument)
heart_summary1
heart %>% histogram(~target | gender, data= ., main = "Risk of heart attack")
heart %>% histogram(~age | gender, data= ., main = "Age of observations", breaks=10)
heart %>% ggplot(aes(x=age)) + geom_histogram(aes(y=..density..), colour="black")+
geom_density(alpha=.2, fill="dodgerblue3")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
heart_filtered <- heart %>% filter(target == 1)
table_age <- table(heart_filtered$gender, heart_filtered$age)
barplot(table_age, main="Bar plot of Age",
ylab="Number of observations", xlab="Age",
ylim=c(0,8),legend=row.names(table_age), beside=TRUE,
args.legend=c(x="topright",horiz=FALSE,title="Gender"),
col=c( "#FF66FF","#0099FF"))
boxplot(age ~ gender, data=heart_filtered)
heart_summary2 <- heart_filtered %>% group_by(gender) %>% summarise(Mean = round(mean(age, na.rm = TRUE),2),
Min = min(age,na.rm = TRUE),
Q1 = quantile(age,probs = .25,na.rm = TRUE),
Median = median(age, na.rm = TRUE),
Q3 = quantile(age,probs = .75,na.rm = TRUE),
Max= max(age,na.rm=TRUE),
n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
heart_summary2
heart_summary3 <- heart_filtered %>% group_by(gender) %>% summarise(Mean = round(mean(age, na.rm = TRUE),2),
SD = round(sd(age, na.rm = TRUE),3),
n = n(),
tcrit = round(qt(p = 0.975, df = n - 1),3),
SE = round(SD/sqrt(n),3),
`95% CI Lower Bound` = round(Mean - tcrit * SE,2),
`95% CI Upper Bound` = round(Mean + tcrit * SE,2))
## `summarise()` ungrouping output (override with `.groups` argument)
heart_summary3
State the null and alternate hypothesis
H0: Likelihood that heart attack and gender are related.
H1: Likelihood that heart attack and gender are not related.
Assumption is that no more than 25% of expected counts are less than 5 and all individual counts are 1 or greater.
Table_heart2 shows that Male have a 0.44 chance of high likelihood of heart attack while Female have a 0.75 chance of high likelihood of heart attack.
table_heart <- table(heart$target, heart$gender)
table_heart
##
## Female Male
## 0 24 114
## 1 72 93
table_heart %>% addmargins()
##
## Female Male Sum
## 0 24 114 138
## 1 72 93 165
## Sum 96 207 303
table_heart2 <- table_heart %>% prop.table(margin=2)
table_heart2
##
## Female Male
## 0 0.2500000 0.5507246
## 1 0.7500000 0.4492754
barplot(table_heart2, main="Bar plot For Male and Female",
ylab="Proportion within Gender", xlab="Likelihood of heart attack",
ylim=c(0,1),legend=row.names(table_heart2), beside=TRUE,
args.legend=c(x="topleft",horiz=FALSE,title="Likelihood of heart attack"))
chi_heart <- chisq.test(table_heart, p=c(0.5,0.5))
chi_heart
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_heart
## X-squared = 22.717, df = 1, p-value = 1.877e-06
chi_heart$expected
##
## Female Male
## 0 43.72277 94.27723
## 1 52.27723 112.72277
chi_heart$observed
##
## Female Male
## 0 24 114
## 1 72 93
heart_filtered <- heart %>% filter(target == 1)
heart_male <- heart_filtered %>% filter(gender=="Male")
heart_male$age %>% qqPlot(dist="norm", xlab = "Age of Male")
## [1] 45 81
heart_female <- heart_filtered %>% filter(gender=="Female")
heart_female$age %>% qqPlot(dist="norm", ylab = "Age of Female")
## [1] 65 53
leveneTest(age ~ gender, data = heart)
The t test is used to compare the difference between the Male and Female population mean age. The two sample t test assumes the population being compared are independent of each other, the data for both Male and Female population have equal variance and are normally distributed. The following assumption have been checked as mentioned above.
H0: M1 - M2 = 0
HA: M1 - M2 =/= 0
where M1 and M2 is the mean age of Male and Female respectively.
result <- t.test(age ~ gender, data=heart_filtered,
var.equal=TRUE, alternative ="two.sided")
result
##
## Two Sample t-test
##
## data: age by gender
## t = 2.4739, df = 163, p-value = 0.01439
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7370777 6.5675818
## sample estimates:
## mean in group Female mean in group Male
## 54.55556 50.90323
result$conf.int
## [1] 0.7370777 6.5675818
## attr(,"conf.level")
## [1] 0.95
result$p.value
## [1] 0.01439017
The first analysis was based on a categorical assocation if gender plays a role in the likelihood of a heart attack. Based on the visualisation, there looks to be evidence that the Female gender are at a higher risk of heart attack, however using the chi square we see a differing answer.
Using the chi square test of association between gender and likelihood, the test is statistically significant and we can conclude that a person’s gender is not related to the likelihood of them having a heart attack. Therefore whether a person has a high or low chance of heart attack is not associated with their gender.
The second investigation was to provide insight on whether male and female have the same age of a high likelihood of a heart attack. The mean age for both the Male and Female exhibited evidence of normality upon inspecting the QQ plot. The central limit theorem also sets out that the distribution of mean of more than 30 samples will convergent to normality. Therefore, t test can be applied due to large sample size in both groups. The Leeve test of homogeneity of variance was applied to check if the variance of the age of male and female are equal. Upon inspecting the p value of 0.09, we can conclude that the population variance are homogeneous.
Therefore, since both assumption of the t test was tested above. We can conclude that there is normality and equal variance. The t test is used to evaluate individuals with a high likelihood of heart attack. The results of the t test shows that there is a evidence of a difference in the mean age of Male and Female. In conclusion, the mean age of Male and Female who have a high likelihood of heart attack is different.
However at a stricter confidence interval of 0.01, the p value of 0.014 is greater than 0.01 and the null hypothesis would not be rejected, suggesting that the mean age of Male and Female are the same.
World Health Organization 2017, Cardiovascular disease (CVDs) Webpage (HTML Format), World Health Organization, Melbourne, viewed 5 September 2020, https://www.who.int/en/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)