NUID:002893549
Importing the libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(infer)
library(ggplot2)
library(ggcorrplot)
options(scipen=999)
Specifying colours to be used while plotting visualizations
my_colors<-c("steel blue", "green" ,"orange", "grey" ,"yellow" )
path<-"/Users/kareena_610/Desktop/R-Programming/Heart Disease/heart_disease_uci.csv"
data<-read.csv(path,header=FALSE)
head(data, n=5)
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## 1 id age sex dataset cp trestbps chol fbs restecg
## 2 1 63 Male Cleveland typical angina 145 233 TRUE lv hypertrophy
## 3 2 67 Male Cleveland asymptomatic 160 286 FALSE lv hypertrophy
## 4 3 67 Male Cleveland asymptomatic 120 229 FALSE lv hypertrophy
## 5 4 37 Male Cleveland non-anginal 130 250 FALSE normal
## V10 V11 V12 V13 V14 V15 V16
## 1 thalch exang oldpeak slope ca thal num
## 2 150 FALSE 2.3 downsloping 0 fixed defect 0
## 3 108 TRUE 1.5 flat 3 normal 2
## 4 129 TRUE 2.6 flat 2 reversable defect 1
## 5 187 FALSE 3.5 downsloping 0 normal 0
data<-data[-c(1,2),-c(1,4)]
We name columns as on the UCI Website
colnames(data)<- c("age","sex","cp","trestbps","chol","fbs","restecg","thalach",
"exang","oldpeak","slope","ca","thal","hd")
There is some missing data in columns of exang, fbs, restecg, ca
data[data == " "] <- NA
data[data == ""] <- NA
data<-data %>%
drop_na()
typeof(data) # Data is a list, lets make it into data frame
## [1] "list"
data<-as.data.frame(data)
nrow(data)
## [1] 298
The dataset consists of 13 attributes
V1<-age: age of subject
V2<-sex: sex of subject (1 = male; 0 = female)
V3<-cp: chest pain type Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
V4<-tresbps: resting blood pressure (on admission to the hospital) (in mm Hg)
V5<-chol: serum cholestoral (in mg/dl)
V6<-fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
V7<-restecg: resting electrocardiographic results Value 0: normal Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
V8<-thalach: Max. heart rate reached
V9<-exang: Exercise induced angina (1 = yes; 0 = no)
V10<-oldpeak: ST depression induced by exercise relative to rest
V11<-slope: the slope of the peak exercise ST segment Value 1: upsloping Value 2: flat Value 3: downsloping
V12<-ca: No. of major vessels
V13<-thal: categorical variable Value 3: normal Value 6: fixed defect Value 7: Reversible defect
V14<-hd: diagnosis of heart disease (angiographic disease status) Value 0: Absense of Heart Disease (>50% Vessel Narrowing) Value 1,2,3,4:Presence of Heart Disease (>50% Vessel Narrowing)
Viewing the Structure of the data frame
str(data)
## 'data.frame': 298 obs. of 14 variables:
## $ age : chr "67" "67" "37" "41" ...
## $ sex : chr "Male" "Male" "Male" "Female" ...
## $ cp : chr "asymptomatic" "asymptomatic" "non-anginal" "atypical angina" ...
## $ trestbps: chr "160" "120" "130" "130" ...
## $ chol : chr "286" "229" "250" "204" ...
## $ fbs : chr "FALSE" "FALSE" "FALSE" "FALSE" ...
## $ restecg : chr "lv hypertrophy" "lv hypertrophy" "normal" "lv hypertrophy" ...
## $ thalach : chr "108" "129" "187" "172" ...
## $ exang : chr "TRUE" "TRUE" "FALSE" "FALSE" ...
## $ oldpeak : chr "1.5" "2.6" "3.5" "1.4" ...
## $ slope : chr "flat" "flat" "downsloping" "upsloping" ...
## $ ca : chr "3" "2" "0" "0" ...
## $ thal : chr "normal" "reversable defect" "normal" "normal" ...
## $ hd : chr "2" "1" "0" "0" ...
Changing the datatype into factors/categories
Changing sex as factors [M, F]
data<-data %>%
mutate(sex=if_else(sex=="Female", "F", "M"))
data$sex<-as.factor(data$sex)
Changing the values of hd to [“Disease Absent”, “Disease Present”]
data<-data %>%
mutate(hd2=ifelse(hd==0, "Disease Absent", "Disease Present"))
data$hd<-as.factor(data$hd)
data$hd2<-as.factor(data$hd2)
Converting the categorical data columns to factors
data$cp<-as.factor(data$cp)
data$fbs<-as.factor(data$fbs)
data$restecg<-as.factor(data$restecg)
data$exang<-as.factor(data$exang)
data$slope<-as.factor(data$slope)
data$ca<-as.factor(data$ca)
data$thal<-as.factor(data$thal)
Converting the columns with discrete data columns to numeric
data$thalach<-as.numeric(data$thalach)
data$trestbps<-as.numeric(data$trestbps)
data$age<-as.numeric(data$age)
data$chol<-as.numeric(data$chol)
Lets see the summary of data
data_summ<-summary(data)
print(data_summ) #printing the summary
## age sex cp trestbps chol
## Min. :29.00 F: 96 asymptomatic :144 Min. : 94.0 Min. :100.0
## 1st Qu.:48.00 M:202 atypical angina: 49 1st Qu.:120.0 1st Qu.:211.0
## Median :56.00 non-anginal : 83 Median :130.0 Median :242.5
## Mean :54.49 typical angina : 22 Mean :131.7 Mean :246.8
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275.8
## Max. :77.00 Max. :200.0 Max. :564.0
## fbs restecg thalach exang
## FALSE:256 lv hypertrophy :145 Min. : 71.0 FALSE:199
## TRUE : 42 normal :149 1st Qu.:132.2 TRUE : 99
## st-t abnormality: 4 Median :152.5
## Mean :149.3
## 3rd Qu.:165.8
## Max. :202.0
## oldpeak slope ca thal hd
## Length:298 downsloping: 20 0:175 fixed defect : 17 0:159
## Class :character flat :139 1: 65 normal :164 1: 56
## Mode :character upsloping :139 2: 38 reversable defect:117 2: 35
## 3: 20 3: 35
## 4: 13
##
## hd2
## Disease Absent :159
## Disease Present:139
##
##
##
##
##Isolating the numeric/discrete data
disc_data<-data %>%
select(age, trestbps,chol,thalach,oldpeak)
disc_data<-lapply(disc_data,as.numeric) #converting the correlation matrix to numeric type
disc_data<-as.data.frame(disc_data)
corr_matrix<-cor(disc_data)
ggcorrplot(corr_matrix, type="full", lab= TRUE)
Plotting Linear Regression model for age and thalach (Max. Heart Rate) as they are negatively correlated
ggplot(data, aes(x=age, y=thalach))+
geom_point()+
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
Linear regression model for age Vs thalach (Max. Heart Rate)
model1<-lm(thalach~age,data=data)
summary(model1)
##
## Call:
## lm(formula = thalach ~ age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.984 -11.941 4.214 16.118 45.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203.0999 7.5982 26.730 < 0.0000000000000002 ***
## age -0.9868 0.1376 -7.174 0.00000000000589 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 296 degrees of freedom
## Multiple R-squared: 0.1481, Adjusted R-squared: 0.1452
## F-statistic: 51.46 on 1 and 296 DF, p-value: 0.000000000005889
Result: Although there is a negative correlation between age and maximum heart rate attained, Age only explains ~14.5 % of the variance observed in Max. Heart rate in this sample
Plotting Box Plot for Cholesterol Vs Fasting Blood Sugar
ggplot(data, aes(x = chol, y = fbs, color=fbs)) +
geom_boxplot() +
labs(x = "Cholesterol", y = "Fasting Blood Sugar") +
ggtitle("Cholesterol Vs. Fasting Blood Sugar")
Scatter plot of cholesterol vs. resting blood pressure
data %>%
ggplot(aes(x = chol, y = trestbps)) +
geom_point() +
labs(x = "Cholesterol", y = "Resting BP") +
ggtitle("Cholesterol vs. Resting Blood Pressure")
Plotting a Box Plot of Age Vs HD disease status using the ggplot2 package
ggplot(data, aes(x = age, y = hd2, color = hd2)) +
geom_boxplot() +
labs(x = "Age", y = "Heart Disease State") +
ggtitle("Age Vs the Heart Disease State") +
scale_fill_manual(values = c("Disease Absent" = "steel blue", "Disease Present" = "light pink"))
T-Test to check a significant difference in mean in age of HD State( Disease Present or Disease Absent) in the sample?
## Perform t-test
t_test_age_hd <- t.test(data$age ~ data$hd2, var.equal = FALSE, conf.level=0.95)
t_test_age_hd
##
## Welch Two Sample t-test
##
## data: data$age by data$hd2
## t = -4.0634, df = 295.08, p-value = 0.00006205
## alternative hypothesis: true difference in means between group Disease Absent and group Disease Present is not equal to 0
## 95 percent confidence interval:
## -6.092920 -2.116754
## sample estimates:
## mean in group Disease Absent mean in group Disease Present
## 52.57862 56.68345
Result: Null: There is no difference between the mean of ages among Heart Disease Present and Heart Disease Absent
Alternate:There is a difference between the mean of ages among Heart Disease Present and Heart Disease Absent
Critical T value: 1.96
Conclusion: Since the observed t-statistic is greater than the critical t value at (alpha=0.05), we can reject the Null Hypothesis and accept the Alternative Hypothesis. According to Welsch Two sample T-test, there is significant difference in mean of ages for Heart Disease Present and Heart Disease Absent subjects groups at a confidence level of 95%.
Visualizing the Frequency Distribution of different HD States [0,1,2,3,4] and the observed ECG slopes
freq_tab_slopehd2<-table(data$hd,data$slope)
freq_tab_slopehd2<-as.data.frame(freq_tab_slopehd2)
names(freq_tab_slopehd2)<-c("HD.State", "Slope","Frequency")
freq_tab_slopehd2
## HD.State Slope Frequency
## 1 0 downsloping 8
## 2 1 downsloping 2
## 3 2 downsloping 3
## 4 3 downsloping 5
## 5 4 downsloping 2
## 6 0 flat 48
## 7 1 flat 32
## 8 2 flat 25
## 9 3 flat 24
## 10 4 flat 10
## 11 0 upsloping 103
## 12 1 upsloping 22
## 13 2 upsloping 7
## 14 3 upsloping 6
## 15 4 upsloping 1
Using Dodge
ggplot(freq_tab_slopehd2, aes(x=HD.State, y=Frequency, fill=Slope))+
geom_bar(stat="identity", position="dodge")+
ggtitle("Frequency chart of HD state Vs the ECG Slope")+
theme_minimal()
Visualizing the Frequency Distribution of different HD States [Disease Present, Disease Absent] and the observed ECG slopes
freq_tab_slopehd<-table(data$hd2,data$slope)
freq_tab_slopehd<-as.data.frame(freq_tab_slopehd)
names(freq_tab_slopehd)<-c("HD.State", "Slope","Frequency")
freq_tab_slopehd
## HD.State Slope Frequency
## 1 Disease Absent downsloping 8
## 2 Disease Present downsloping 12
## 3 Disease Absent flat 48
## 4 Disease Present flat 91
## 5 Disease Absent upsloping 103
## 6 Disease Present upsloping 36
Using Dodge
ggplot(freq_tab_slopehd, aes(x=HD.State, y=Frequency, fill=Slope))+
geom_bar(stat="identity", position="dodge")+
ggtitle("Frequency chart of HD state Vs the ECG Slope")+
theme_minimal()
Using stack
ggplot(freq_tab_slopehd, aes(x=HD.State, y=Frequency, fill=Slope))+
geom_bar(stat="identity", position="stack")+
ggtitle("Frequency chart of HD state Vs the ECG Slope")+
theme_minimal()+
scale_fill_manual(values=my_colors)
Chi-Square Test to check for statistically significant diff in expected frequencies and observed frequencies
contingency_tab<-table(data$hd2,data$slope)
print(contingency_tab)
##
## downsloping flat upsloping
## Disease Absent 8 48 103
## Disease Present 12 91 36
Running Chi-Square Test
chi_square<- chisq.test(contingency_tab)
print(chi_square)
##
## Pearson's Chi-squared test
##
## data: contingency_tab
## X-squared = 45.259, df = 2, p-value = 0.0000000001487
Result:
Null: There is no association between Heart Disease State and Slope
Alternate:There is an association between Heart Disease State and Slope
Critical Chi_Square value: 5.99
Conclusion: According to the Chi_Square test since the observed chi_square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between Heart disease state and slope of ECG
#Relationship between the state of HD and the cp (Chest Pain)
freq_tab_cphd<-table(data$hd2,data$cp)
freq_tab_cphd<-as.data.frame(freq_tab_cphd)
names(freq_tab_cphd)<-c("HD.State","CP","Frequency")
print(freq_tab_cphd)
## HD.State CP Frequency
## 1 Disease Absent asymptomatic 39
## 2 Disease Present asymptomatic 105
## 3 Disease Absent atypical angina 40
## 4 Disease Present atypical angina 9
## 5 Disease Absent non-anginal 65
## 6 Disease Present non-anginal 18
## 7 Disease Absent typical angina 15
## 8 Disease Present typical angina 7
Data Visualization of HD state corresponding to the Chest Pain presented in the sample
ggplot(freq_tab_cphd, aes(x=HD.State, y=Frequency,fill=CP))+
geom_bar(stat="identity", position="stack")+
ggtitle("Frequency chart of HD state Vs the Chest Pain")+
theme_minimal()
Chi-Square Test to check for significant diff in variance in the Chest Pain Categories
contingency_tab2<-table(data$hd2,data$cp)
print(contingency_tab2)
##
## asymptomatic atypical angina non-anginal typical angina
## Disease Absent 39 40 65 15
## Disease Present 105 9 18 7
Running Chi-Square Test
chi_square2<- chisq.test(contingency_tab2)
print(chi_square2)
##
## Pearson's Chi-squared test
##
## data: contingency_tab2
## X-squared = 78.397, df = 3, p-value < 0.00000000000000022
Result:
Null: There is no association between Heart Disease State and Chest Pain
Alternate:There is an association between Heart Disease State and Chest Pain
Critical Chi_Square value: 7.81
Conclusion: According to the Chi Square test since the observed chi square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between the heart disease state and chest pain observed.
Chi-Square Test to check for significant diff in variance
contingency_tab4<-table(data$hd2,data$sex)
print(contingency_tab4)
##
## F M
## Disease Absent 71 88
## Disease Present 25 114
Running Chi-Square Test
chi_square4<- chisq.test(contingency_tab4)
print(chi_square4)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: contingency_tab4
## X-squared = 22.949, df = 1, p-value = 0.000001664
Result: Null: There is no association between Heart Disease State and sex of the subject
Alternate:There is an association between Heart Disease State and sex of the subject
Critical Chi_Square value: 5.99
Conclusion: According to the Chi Square test since the observed chi square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between the heart disease state and sex of the subject.
Example 1: T-Test to check significant difference in means of age in HD States
result<-data %>%
specify(response = age, explanatory = hd2) %>%
hypothesize(null="independence",) %>%
calculate(stat="t", order=c("Disease Present","Disease Absent"))
result
## Response: age (numeric)
## Explanatory: hd2 (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 4.06
Result:
Null: There is no difference between the mean of ages among Heart Disease Present and Heart Disease Absent
Alternate:There is a difference between the mean of ages among Heart Disease Present and Heart Disease Absent
Critical T value: 1.96
Conclusion: According to the Two sample unpaired T-test, the observed statistic, 4.06 exceeds the critical T value and hence we reject the Null Hypothesis and accept the Alternative Hypothesis. We can conclude that there is a significant difference in mean of ages for Heart Disease Present and Heart Disease Absent subjects groups at a confidence level of 95%.
Example 2: Chi-square test to check the association between chest pain type and heart disease presence
chi_sq2<-data %>%
specify(hd2 ~ cp) %>%
hypothesize(null = "independence") %>%
generate(reps=1000,type="permute" ) %>%
calculate(stat = "Chisq")
chi_sq2
## Response: hd2 (factor)
## Explanatory: cp (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 2.40
## 2 2 5.43
## 3 3 3.08
## 4 4 10.2
## 5 5 2.35
## 6 6 3.21
## 7 7 2.39
## 8 8 3.26
## 9 9 2.15
## 10 10 0.747
## # ℹ 990 more rows
get_p_value(chi_sq2, obs_stat=chi_sq2, direction="two.sided")
## Warning: The first row and first column value of the given `obs_stat` will be
## used.
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.37
Result:
Null: There is no association between Heart Disease State and Chest Pain
Alternate:There is an association between Heart Disease State and Chest Pain
Critical Chi_Square value: 7.81
Conclusion: According to the Chi Square test since the observed chi square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between the heart disease state and chest pain observed.
Example 3: Does the gender play a role in the prevalence of HD state as per the given sample data?
data %>%
specify(hd ~ sex )%>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq", order=c("F", "M"))
## Warning: Statistic is not based on a difference or ratio; the `order` argument
## will be ignored. Check `?calculate` for details.
## Response: hd (factor)
## Explanatory: sex (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 24.4
Result:
Null: There is no association between Heart Disease State and Sex
Alternate:There is an association between Heart Disease State and Sex
Critical Chi_Square value: 7.81
Conclusion:According to the Chi Square test since the observed chi square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between the heart disease state and the sex of the subjects.