Sai Vamsi Chunduru - S3884753, Pragati Patidar – S3858702, Kyron Reshi – S3920193, Arjun Padmanabha Pillai – S3887231
last updated on 17 October, 2021
hd<-read_csv("heart.csv")
#renaming variables
names(hd)[5] <- 'Cholesterol'
names(hd)[9] <-'Exercise_Induced_Angina'
#factorizing variables
#target variable as factor variable 1 for having disease and 0 for not having heart disease
hd$target <- hd$target %>% factor(levels=c(0,1),
labels=c("no heart disease","heart disease"))
#Exercise_Induced_Angina variable as factor variable 1 for yes and 0 for no.
hd$Exercise_Induced_Angina<- hd$Exercise_Induced_Angina%>% factor(levels=c(1,0) , labels=c("Yes","No"))
#summarizing required variables
summary(hd$Cholesterol)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 126.0 211.0 240.0 246.3 274.5 564.0
## Yes No
## 99 204
## no heart disease heart disease
## 138 165
-In terms of the Cholesterol of individuals, the no heart disease sample appears to have the highest percent of total in the range of 220 to 240 cholesterol levels followed by 280 to 300 cholesterol levels. While the heart disease sample appears to have the highest percent of total in the range of 200-250 cholesterol levels.
-The plot below shows the distribution in the cholesterol levels of the individual in this investigation. The curve appears to be more positively skewed.
#code for checking distribution of data -bins=30
hd %>% ggplot(aes(x=Cholesterol)) + geom_histogram(aes(y=..density..), colour="black")+
geom_density(alpha=.2, fill="dodgerblue3")##
## no heart disease heart disease Sum
## Yes 76 23 99
## No 62 142 204
## Sum 138 165 303
##
## no heart disease heart disease
## Yes 0.5507246 0.1393939
## No 0.4492754 0.8606061
barplot(table_ang_target1, main="Bar plot For Exercise Induced Angina",
ylab="Proportion within Exercise_Induced_Angina", xlab="Likelihood of heart attack",
ylim=c(0,1),legend=row.names(table_ang_target1), beside=TRUE,
args.legend=c(x="topleft",horiz=FALSE,title="Likelihood of heart attack"))##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_ang_target
## X-squared = 55.945, df = 1, p-value = 7.454e-14
##
## no heart disease heart disease
## Yes 45.08911 53.91089
## No 92.91089 111.08911
The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of the person’s cholesterol measurement (in mg/dl) and likelihood of heart disease are equal or not. We want to know if the mean score for the cholesterol measurement of person having heart disease is different from the person does not have heart disease.
Using two variables, One variable Target defines the two groups heart disease and no heart disease. The second variable is the measurement of interest that is cholesterol which continuous variable. -The data values are independent. The cholesterol measurement for any one person does not depend on the cholesterol measurement for another person. -The summary statistics table shows the sample mean for the groups (no heart disease= 131 or heart disease= 126) compared with person’s cholesterol measurement.
Above mentioned histogram compares the distribution of data. -By plotting QQ Plot a simple random sample from the population. We assume the data are normally distributed, as the sample size in both groups are greater than 30, sampling distribution will approximate a normal distribution.. -The data values are cholesterol measurements. The measurements are continuous by above histogram. -The variances for heart disease and no heart disease are equal, by using Homogeneity of variances by levene Test.
As p(p-value = 0.1388)>0.05, population variances are homogeneous. Now we can apply the two-sample t-test. b- State the Null and Alternate hypothesis for the appropriate hypothesis test. c- Report the test statistic, , -value and 95% CI of the mean difference from the results of the hypothesis test.
hypothesis test:
H0:M1=M2 (mean score for the cholesterol measurement of person having heart disease is equal from the person does not having heart disease.)
HA:M1!=M2 (mean score for the cholesterol measurement of person having heart disease is equal from the person does not having heart disease).
hd %>% group_by(target) %>% summarise(Min = min(Cholesterol,na.rm = TRUE),
Q1 = quantile(Cholesterol,probs = .25,na.rm = TRUE),
Median = median(Cholesterol, na.rm = TRUE),
Q3 = quantile(Cholesterol,probs = .75,na.rm = TRUE),
Max = max(Cholesterol,na.rm = TRUE),
Mean = mean(Cholesterol, na.rm = TRUE),
SD = sd(Cholesterol, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Cholesterol))) -> table1
knitr::kable(table1)| target | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| no heart disease | 131 | 217.25 | 249 | 283 | 409 | 251.0870 | 49.45461 | 138 | 0 |
| heart disease | 126 | 208.00 | 234 | 267 | 564 | 242.2303 | 53.55287 | 165 | 0 |
# QQ plot for target == "no heart disease" for showing distribution
target_no <- hd %>% filter(target == "no heart disease")
target_no$Cholesterol%>% qqPlot(dist="norm")## [1] 82 56
# QQ plot for target == "no heart disease" for showing distribution
target_yes<- hd %>% filter(target == "heart disease")
target_yes$Cholesterol%>% qqPlot(dist="norm")## [1] 86 29
test_result<- t.test(Cholesterol ~ target,
data = hd,
var.equal = TRUE, alternative = "two.sided" ) #Independent two sample t-Test
test_result##
## Two Sample t-test
##
## data: Cholesterol by target
## t = 1.4842, df = 301, p-value = 0.1388
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.885882 20.599189
## sample estimates:
## mean in group no heart disease mean in group heart disease
## 251.0870 242.2303
#Use the p-value and CI of the mean to make a decision about the null hypothesis.
test_result$p.value## [1] 0.1387903
## [1] -2.885882 20.599189
## attr(,"conf.level")
## [1] 0.95