Heart Disease Dataset [EDA+ Inference]

NUID:002893549

Importing the libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(infer)
library(ggplot2)
library(ggcorrplot)
options(scipen=999)

Specifying colours to be used while plotting visualizations

my_colors<-c("steel blue", "green" ,"orange", "grey" ,"yellow" )

path<-"/Users/kareena_610/Desktop/R-Programming/Heart Disease/heart_disease_uci.csv"
data<-read.csv(path,header=FALSE)
head(data, n=5)

##   V1  V2   V3        V4             V5       V6   V7    V8             V9
## 1 id age  sex   dataset             cp trestbps chol   fbs        restecg
## 2  1  63 Male Cleveland typical angina      145  233  TRUE lv hypertrophy
## 3  2  67 Male Cleveland   asymptomatic      160  286 FALSE lv hypertrophy
## 4  3  67 Male Cleveland   asymptomatic      120  229 FALSE lv hypertrophy
## 5  4  37 Male Cleveland    non-anginal      130  250 FALSE         normal
##      V10   V11     V12         V13 V14               V15 V16
## 1 thalch exang oldpeak       slope  ca              thal num
## 2    150 FALSE     2.3 downsloping   0      fixed defect   0
## 3    108  TRUE     1.5        flat   3            normal   2
## 4    129  TRUE     2.6        flat   2 reversable defect   1
## 5    187 FALSE     3.5 downsloping   0            normal   0

data<-data[-c(1,2),-c(1,4)]

We name columns as on the UCI Website

colnames(data)<- c("age","sex","cp","trestbps","chol","fbs","restecg","thalach",
                   "exang","oldpeak","slope","ca","thal","hd")

There is some missing data in columns of exang, fbs, restecg, ca

data[data == " "] <- NA
data[data == ""] <- NA
data<-data %>% 
    drop_na()

typeof(data) # Data is a list, lets make it into data frame

## [1] "list"

data<-as.data.frame(data)

nrow(data)

## [1] 298

The dataset consists of 13 attributes

V1<-age: age of subject
V2<-sex: sex of subject (1 = male; 0 = female)
V3<-cp: chest pain type Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
V4<-tresbps: resting blood pressure (on admission to the hospital) (in mm Hg)
V5<-chol: serum cholestoral (in mg/dl)
V6<-fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
V7<-restecg: resting electrocardiographic results Value 0: normal Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
V8<-thalach: Max. heart rate reached
V9<-exang: Exercise induced angina (1 = yes; 0 = no)
V10<-oldpeak: ST depression induced by exercise relative to rest
V11<-slope: the slope of the peak exercise ST segment Value 1: upsloping Value 2: flat Value 3: downsloping
V12<-ca: No. of major vessels
V13<-thal: categorical variable Value 3: normal Value 6: fixed defect Value 7: Reversible defect
V14<-hd: diagnosis of heart disease (angiographic disease status) Value 0: Absense of Heart Disease (>50% Vessel Narrowing) Value 1,2,3,4:Presence of Heart Disease (>50% Vessel Narrowing)

Viewing the Structure of the data frame

str(data)

## 'data.frame':    298 obs. of  14 variables:
##  $ age     : chr  "67" "67" "37" "41" ...
##  $ sex     : chr  "Male" "Male" "Male" "Female" ...
##  $ cp      : chr  "asymptomatic" "asymptomatic" "non-anginal" "atypical angina" ...
##  $ trestbps: chr  "160" "120" "130" "130" ...
##  $ chol    : chr  "286" "229" "250" "204" ...
##  $ fbs     : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...
##  $ restecg : chr  "lv hypertrophy" "lv hypertrophy" "normal" "lv hypertrophy" ...
##  $ thalach : chr  "108" "129" "187" "172" ...
##  $ exang   : chr  "TRUE" "TRUE" "FALSE" "FALSE" ...
##  $ oldpeak : chr  "1.5" "2.6" "3.5" "1.4" ...
##  $ slope   : chr  "flat" "flat" "downsloping" "upsloping" ...
##  $ ca      : chr  "3" "2" "0" "0" ...
##  $ thal    : chr  "normal" "reversable defect" "normal" "normal" ...
##  $ hd      : chr  "2" "1" "0" "0" ...

Changing the datatype into factors/categories

Changing sex as factors [M, F]

data<-data %>% 
      mutate(sex=if_else(sex=="Female", "F", "M"))

data$sex<-as.factor(data$sex)

Changing the values of hd to [“Disease Absent”, “Disease Present”]

data<-data %>% 
     mutate(hd2=ifelse(hd==0, "Disease Absent", "Disease Present"))
data$hd<-as.factor(data$hd)
data$hd2<-as.factor(data$hd2)

Converting the categorical data columns to factors

data$cp<-as.factor(data$cp)
data$fbs<-as.factor(data$fbs)
data$restecg<-as.factor(data$restecg)
data$exang<-as.factor(data$exang)
data$slope<-as.factor(data$slope)
data$ca<-as.factor(data$ca)
data$thal<-as.factor(data$thal)

Converting the columns with discrete data columns to numeric

data$thalach<-as.numeric(data$thalach)
data$trestbps<-as.numeric(data$trestbps)
data$age<-as.numeric(data$age)
data$chol<-as.numeric(data$chol)

Lets see the summary of data

data_summ<-summary(data)
print(data_summ) #printing the summary

##       age        sex                   cp         trestbps          chol      
##  Min.   :29.00   F: 96   asymptomatic   :144   Min.   : 94.0   Min.   :100.0  
##  1st Qu.:48.00   M:202   atypical angina: 49   1st Qu.:120.0   1st Qu.:211.0  
##  Median :56.00           non-anginal    : 83   Median :130.0   Median :242.5  
##  Mean   :54.49           typical angina : 22   Mean   :131.7   Mean   :246.8  
##  3rd Qu.:61.00                                 3rd Qu.:140.0   3rd Qu.:275.8  
##  Max.   :77.00                                 Max.   :200.0   Max.   :564.0  
##     fbs                  restecg       thalach        exang    
##  FALSE:256   lv hypertrophy  :145   Min.   : 71.0   FALSE:199  
##  TRUE : 42   normal          :149   1st Qu.:132.2   TRUE : 99  
##              st-t abnormality:  4   Median :152.5              
##                                     Mean   :149.3              
##                                     3rd Qu.:165.8              
##                                     Max.   :202.0              
##    oldpeak                  slope     ca                     thal     hd     
##  Length:298         downsloping: 20   0:175   fixed defect     : 17   0:159  
##  Class :character   flat       :139   1: 65   normal           :164   1: 56  
##  Mode  :character   upsloping  :139   2: 38   reversable defect:117   2: 35  
##                                       3: 20                           3: 35  
##                                                                       4: 13  
##                                                                              
##               hd2     
##  Disease Absent :159  
##  Disease Present:139  
##                       
##                       
##                       
##

Exploratory Data Analysis using some basic visualizations

Correlation Matrix for the dataset

##Isolating the numeric/discrete data
disc_data<-data %>% 
     select(age, trestbps,chol,thalach,oldpeak)
disc_data<-lapply(disc_data,as.numeric) #converting the correlation matrix to numeric type
disc_data<-as.data.frame(disc_data)
corr_matrix<-cor(disc_data)
ggcorrplot(corr_matrix, type="full", lab= TRUE)

Plotting Linear Regression model for age and thalach (Max. Heart Rate) as they are negatively correlated

ggplot(data, aes(x=age, y=thalach))+
    geom_point()+
    geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

Linear regression model for age Vs thalach (Max. Heart Rate)

model1<-lm(thalach~age,data=data)
summary(model1)

## 
## Call:
## lm(formula = thalach ~ age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.984 -11.941   4.214  16.118  45.188 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 203.0999     7.5982  26.730 < 0.0000000000000002 ***
## age          -0.9868     0.1376  -7.174     0.00000000000589 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 296 degrees of freedom
## Multiple R-squared:  0.1481, Adjusted R-squared:  0.1452 
## F-statistic: 51.46 on 1 and 296 DF,  p-value: 0.000000000005889

Result: Although there is a negative correlation between age and maximum heart rate attained, Age only explains ~14.5 % of the variance observed in Max. Heart rate in this sample

Plotting Box Plot for Cholesterol Vs Fasting Blood Sugar

 ggplot(data, aes(x = chol, y = fbs, color=fbs)) +
    geom_boxplot() +
    labs(x = "Cholesterol", y = "Fasting Blood Sugar") +
    ggtitle("Cholesterol Vs. Fasting Blood Sugar")

Scatter plot of cholesterol vs. resting blood pressure

data %>%
    ggplot(aes(x = chol, y = trestbps)) +
    geom_point() +
    labs(x = "Cholesterol", y = "Resting BP") +
    ggtitle("Cholesterol vs. Resting Blood Pressure")

Plotting a Box Plot of Age Vs HD disease status using the ggplot2 package

ggplot(data, aes(x = age, y = hd2, color = hd2)) +
    geom_boxplot() +
    labs(x = "Age", y = "Heart Disease State") +
    ggtitle("Age Vs the Heart Disease State") +
    scale_fill_manual(values = c("Disease Absent" = "steel blue", "Disease Present" = "light pink"))

T-Test to check a significant difference in mean in age of HD State( Disease Present or Disease Absent) in the sample?

## Perform t-test
t_test_age_hd <- t.test(data$age ~ data$hd2, var.equal = FALSE, conf.level=0.95)
t_test_age_hd

## 
##  Welch Two Sample t-test
## 
## data:  data$age by data$hd2
## t = -4.0634, df = 295.08, p-value = 0.00006205
## alternative hypothesis: true difference in means between group Disease Absent and group Disease Present is not equal to 0
## 95 percent confidence interval:
##  -6.092920 -2.116754
## sample estimates:
##  mean in group Disease Absent mean in group Disease Present 
##                      52.57862                      56.68345

Result: Null: There is no difference between the mean of ages among Heart Disease Present and Heart Disease Absent

Alternate:There is a difference between the mean of ages among Heart Disease Present and Heart Disease Absent

Critical T value: 1.96

Conclusion: Since the observed t-statistic is greater than the critical t value at (alpha=0.05), we can reject the Null Hypothesis and accept the Alternative Hypothesis. According to Welsch Two sample T-test, there is significant difference in mean of ages for Heart Disease Present and Heart Disease Absent subjects groups at a confidence level of 95%.

Relationship between the state of HD and the ECG slopes

Visualizing the Frequency Distribution of different HD States [0,1,2,3,4] and the observed ECG slopes

freq_tab_slopehd2<-table(data$hd,data$slope)

freq_tab_slopehd2<-as.data.frame(freq_tab_slopehd2)
names(freq_tab_slopehd2)<-c("HD.State", "Slope","Frequency")
freq_tab_slopehd2

##    HD.State       Slope Frequency
## 1         0 downsloping         8
## 2         1 downsloping         2
## 3         2 downsloping         3
## 4         3 downsloping         5
## 5         4 downsloping         2
## 6         0        flat        48
## 7         1        flat        32
## 8         2        flat        25
## 9         3        flat        24
## 10        4        flat        10
## 11        0   upsloping       103
## 12        1   upsloping        22
## 13        2   upsloping         7
## 14        3   upsloping         6
## 15        4   upsloping         1

Using Dodge

ggplot(freq_tab_slopehd2, aes(x=HD.State, y=Frequency, fill=Slope))+
    geom_bar(stat="identity", position="dodge")+
    ggtitle("Frequency chart of HD state Vs the ECG Slope")+
    theme_minimal()

Visualizing the Frequency Distribution of different HD States [Disease Present, Disease Absent] and the observed ECG slopes

freq_tab_slopehd<-table(data$hd2,data$slope)

freq_tab_slopehd<-as.data.frame(freq_tab_slopehd)
names(freq_tab_slopehd)<-c("HD.State", "Slope","Frequency")
freq_tab_slopehd

##          HD.State       Slope Frequency
## 1  Disease Absent downsloping         8
## 2 Disease Present downsloping        12
## 3  Disease Absent        flat        48
## 4 Disease Present        flat        91
## 5  Disease Absent   upsloping       103
## 6 Disease Present   upsloping        36

Using Dodge

ggplot(freq_tab_slopehd, aes(x=HD.State, y=Frequency, fill=Slope))+
    geom_bar(stat="identity", position="dodge")+
    ggtitle("Frequency chart of HD state Vs the ECG Slope")+
    theme_minimal()

Using stack

ggplot(freq_tab_slopehd, aes(x=HD.State, y=Frequency, fill=Slope))+
    geom_bar(stat="identity", position="stack")+
    ggtitle("Frequency chart of HD state Vs the ECG Slope")+
    theme_minimal()+
    scale_fill_manual(values=my_colors)

Chi-Square Test to check for statistically significant diff in expected frequencies and observed frequencies

contingency_tab<-table(data$hd2,data$slope)
print(contingency_tab)

##                  
##                   downsloping flat upsloping
##   Disease Absent            8   48       103
##   Disease Present          12   91        36

Running Chi-Square Test

chi_square<- chisq.test(contingency_tab)
print(chi_square)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_tab
## X-squared = 45.259, df = 2, p-value = 0.0000000001487

Result:

Null: There is no association between Heart Disease State and Slope

Alternate:There is an association between Heart Disease State and Slope

Critical Chi_Square value: 5.99

Conclusion: According to the Chi_Square test since the observed chi_square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between Heart disease state and slope of ECG

#Relationship between the state of HD and the cp (Chest Pain)

freq_tab_cphd<-table(data$hd2,data$cp)
freq_tab_cphd<-as.data.frame(freq_tab_cphd)
names(freq_tab_cphd)<-c("HD.State","CP","Frequency")
print(freq_tab_cphd)

##          HD.State              CP Frequency
## 1  Disease Absent    asymptomatic        39
## 2 Disease Present    asymptomatic       105
## 3  Disease Absent atypical angina        40
## 4 Disease Present atypical angina         9
## 5  Disease Absent     non-anginal        65
## 6 Disease Present     non-anginal        18
## 7  Disease Absent  typical angina        15
## 8 Disease Present  typical angina         7

Data Visualization of HD state corresponding to the Chest Pain presented in the sample

ggplot(freq_tab_cphd, aes(x=HD.State, y=Frequency,fill=CP))+
    geom_bar(stat="identity", position="stack")+
    ggtitle("Frequency chart of HD state Vs the Chest Pain")+
    theme_minimal()

Chi-Square Test to check for significant diff in variance in the Chest Pain Categories

contingency_tab2<-table(data$hd2,data$cp)
print(contingency_tab2)

##                  
##                   asymptomatic atypical angina non-anginal typical angina
##   Disease Absent            39              40          65             15
##   Disease Present          105               9          18              7

Running Chi-Square Test

chi_square2<- chisq.test(contingency_tab2)
print(chi_square2)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_tab2
## X-squared = 78.397, df = 3, p-value < 0.00000000000000022

Result:

Null: There is no association between Heart Disease State and Chest Pain

Alternate:There is an association between Heart Disease State and Chest Pain

Critical Chi_Square value: 7.81

Conclusion: According to the Chi Square test since the observed chi square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between the heart disease state and chest pain observed.

Chi-Square Test to check for significant diff in variance

contingency_tab4<-table(data$hd2,data$sex)
print(contingency_tab4)

##                  
##                     F   M
##   Disease Absent   71  88
##   Disease Present  25 114

Running Chi-Square Test

chi_square4<- chisq.test(contingency_tab4)
print(chi_square4)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_tab4
## X-squared = 22.949, df = 1, p-value = 0.000001664

Result: Null: There is no association between Heart Disease State and sex of the subject

Alternate:There is an association between Heart Disease State and sex of the subject

Critical Chi_Square value: 5.99

Statistical inferences using infer package

Example 1: T-Test to check significant difference in means of age in HD States

result<-data %>%
    specify(response = age, explanatory = hd2) %>%
    hypothesize(null="independence",) %>% 
    calculate(stat="t", order=c("Disease Present","Disease Absent"))
result

## Response: age (numeric)
## Explanatory: hd2 (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  4.06

Result:

Null: There is no difference between the mean of ages among Heart Disease Present and Heart Disease Absent

Alternate:There is a difference between the mean of ages among Heart Disease Present and Heart Disease Absent

Critical T value: 1.96

Conclusion: According to the Two sample unpaired T-test, the observed statistic, 4.06 exceeds the critical T value and hence we reject the Null Hypothesis and accept the Alternative Hypothesis. We can conclude that there is a significant difference in mean of ages for Heart Disease Present and Heart Disease Absent subjects groups at a confidence level of 95%.

Example 2: Chi-square test to check the association between chest pain type and heart disease presence

chi_sq2<-data %>%
    specify(hd2 ~ cp) %>%
    hypothesize(null = "independence") %>%
    generate(reps=1000,type="permute" ) %>% 
    calculate(stat = "Chisq")
chi_sq2

## Response: hd2 (factor)
## Explanatory: cp (factor)
## Null Hypothesis: independence
## # A tibble: 1,000 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  2.40 
##  2         2  5.43 
##  3         3  3.08 
##  4         4 10.2  
##  5         5  2.35 
##  6         6  3.21 
##  7         7  2.39 
##  8         8  3.26 
##  9         9  2.15 
## 10        10  0.747
## # ℹ 990 more rows

get_p_value(chi_sq2, obs_stat=chi_sq2, direction="two.sided")

## Warning: The first row and first column value of the given `obs_stat` will be
## used.

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1    0.37

Result:

Null: There is no association between Heart Disease State and Chest Pain

Alternate:There is an association between Heart Disease State and Chest Pain

Critical Chi_Square value: 7.81

Example 3: Does the gender play a role in the prevalence of HD state as per the given sample data?

data %>%
    specify(hd ~ sex )%>%
    hypothesize(null = "independence") %>%
    calculate(stat = "Chisq", order=c("F", "M"))

## Warning: Statistic is not based on a difference or ratio; the `order` argument
## will be ignored. Check `?calculate` for details.

## Response: hd (factor)
## Explanatory: sex (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  24.4

Result:

Null: There is no association between Heart Disease State and Sex

Alternate:There is an association between Heart Disease State and Sex

Critical Chi_Square value: 7.81

Conclusion:According to the Chi Square test since the observed chi square value exceeds the critical chi square value at 95 percent confidence level, we reject the null hypothesis and accept the alternative hypothesis that there exists an association between the heart disease state and the sex of the subjects.

Heart Disease Dataset [EDA+ Inference]

Kareena Anil Mulchandani

20/06/2023

Exploratory Data Analysis using some basic visualizations

Correlation Matrix for the dataset

Relationship between the state of HD and the ECG slopes

Statistical inferences using infer package