Part 1 - Introduction

Our heart is one of the most important organs in our body. There are five great vessels enter and leave the heart: the superior and inferior vena cava, the pulmonary artery, the pulmonary vein, and the aorta. Malfunctions of the heart is called Heart Disease or Cardiac Disease.

There are many factors that can increase the risk of getting heart disease. Some of these factors are out of control, but many of them can be avoided by choosing to live a healthy lifestyle. The factors that cannot be controlled are: Gender,age, family history, heart shape. The controllable risk factors are: High blood pressure, cholesterol level, obesity, smoking, and diabetes.

Heart disease is a leading cause of death. One person dies every 36 seconds in the United States from cardiovascular disease. About 655,000 Americans die from heart disease each year, that is 1 in every 4 deaths. In this analysis, I will use heart disease dataset to explore the highest important features that leads to heart disease. I also also do a logistic regression model to predict if a patient will have a heart disease or no.

Purpose of the analysis:

Heart disease analysis and prediction

Research question

What is the most common factor for both males and females to have the highest cause of heart disease?

Analysis Methods:

1- Statistical Analysis

2- Feature importance/selection

3- Logistic regression modeling and prediction

Libraries required for the analysis.

library(ggplot2)
library(DATA606)
library(psych)
library(corrplot)
library(dplyr)
library(caTools)
library(caret)
library(randomForest)

Part 2 - Data

# load data
#Data <- read.csv("heart.csv")

Data <- read.csv("https://github.com/GehadGad/Heart-disease-dataset/raw/main/heart.csv")

Data Creators:

  1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.

  2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.

  3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.

  4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

#Display the first few rows in the data
head(Data)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  52   1  0      125  212   0       1     168     0     1.0     2  2    3
## 2  53   1  0      140  203   1       0     155     1     3.1     0  0    3
## 3  70   1  0      145  174   0       1     125     1     2.6     0  0    3
## 4  61   1  0      148  203   0       1     161     0     0.0     2  1    3
## 5  62   0  0      138  294   1       1     106     0     1.9     1  3    2
## 6  58   0  0      100  248   0       0     122     0     1.0     1  0    2
##   target
## 1      0
## 2      0
## 3      0
## 4      0
## 5      0
## 6      1

There are 14 cases in this dataset and 1025 observations from individuals(patients).

Part 3 - Exploratory data analysis

To understand the data in a better way,I created the following table to explain the description, data type, and the value of each feature.

Feature Definition Type Value
age Patient’s age in years Numerical 29 - 77
sex Gender Nominal (0)female, (1)male
cp Type of chest-pain Nominal (0)typical angina, (1)atypical angina, (2)non-angina pain, (3)asymptomatic
trestbps Resting blood pressure in mmHg Numerical 94 - 200
chol Serum cholestoral in mg/dl Numerical 126 – 564
fbs Fasting blood sugar higher than 120 mg/dl Nominal (0)False (1)True
restecg Resting electrocardiographic results Nominal (0)normal, (1)having ST-T wave abnormality, (2)showing probable left ventricular hypertrophy
thalach Maximum heart rate achieved Numerical 71 –202
exang Exercise induced angina Nominal (0)No(1)Yes
oldpeak ST depression induced by exercise relative to rest Numerical -2.6 - 6.2
slope The slope of the peak exercise ST segment Nominal (1)upsloping, (2)flat, (3)downsloping
ca Number of major vessels colored by flourosopy Nominal 0, 1, 2, 3
thal Thalassemia Nominal (3)normal,(6)fixed defect, (7)reversible defect
target Diagnosis of heart disease Nominal (0)heart disease not present, (1)heart disease present

Type of study

This is observational study.

Dependent Variable

Target is the output response (dependent) variable and it is qualitative.

Independent Variable

The independent variable is age, gender, and all other variables.

Check if there are missing values (NA) in the data

sum(is.na(Data))
## [1] 0

There are not missing values in this data.

#Count the number of patients have or have not been diagnosed with heart disease.

Data %>% count(target)
##   target   n
## 1      0 499
## 2      1 526

There are 499 patients do not have heart disease and 526 have heart disease

#The proportion of patients with chest pain types.

Data %>% group_by(cp) %>% 
  summarise( percent = 100 * n() / nrow( Data ))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
##      cp percent
##   <int>   <dbl>
## 1     0   48.5 
## 2     1   16.3 
## 3     2   27.7 
## 4     3    7.51

There are 48% of patient with typical angina chest pain, 16 % of patients with atypical angina, 27% with non-angina pain, and 7% with asymptomatic chest pain.

#The proportion of females and males patients in the dataset.

Data %>% 
    group_by( sex ) %>% 
    summarise( percent = 100 * n() / nrow( Data ))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##     sex percent
##   <int>   <dbl>
## 1     0    30.4
## 2     1    69.6

There are 30.4 % females and 69.6% males in the dataset

  • Check the percentage of males and females with heart disease.
Sub_female <- table(Data[Data$sex==0,]$target)
Sub_male <- table(Data[Data$sex==1,]$target)
FM_combine <- rbind(Sub_female,Sub_male)

#Rename columns names and rows names.
colnames(FM_combine) <- c("Has heart disease", "Does not have heart disease")
rownames(FM_combine) <- c("Females", "Males")

#Display the table
FM_combine
##         Has heart disease Does not have heart disease
## Females                86                         226
## Males                 413                         300

There are 86 females out of 312 who have diagnosed with heart disease and 413 males out of 713 were diagnosed with heart disease.

This indicates that 58% of males in this dataset are diagnosed with heart disease where is only 28% of females are diagnosed with heart disease.

Finding 1:

Males are more diagnosed with heart disease than females

summary(Data)
##       age             sex               cp            trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol          fbs            restecg          thalach     
##  Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  
##  Median :240   Median :0.0000   Median :1.0000   Median :152.0  
##  Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  
##  3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope             ca        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  
##  3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.324   Mean   :0.5132  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

The summary function displays useful information about the each feature such as: the minimum value, maximum value, first and third quartile, mean and the median.

describe(Data)
##          vars    n   mean    sd median trimmed   mad min   max range  skew
## age         1 1025  54.43  9.07   56.0   54.66  8.90  29  77.0  48.0 -0.25
## sex         2 1025   0.70  0.46    1.0    0.74  0.00   0   1.0   1.0 -0.85
## cp          3 1025   0.94  1.03    1.0    0.83  1.48   0   3.0   3.0  0.53
## trestbps    4 1025 131.61 17.52  130.0  130.39 14.83  94 200.0 106.0  0.74
## chol        5 1025 246.00 51.59  240.0  243.26 48.93 126 564.0 438.0  1.07
## fbs         6 1025   0.15  0.36    0.0    0.06  0.00   0   1.0   1.0  1.97
## restecg     7 1025   0.53  0.53    1.0    0.52  0.00   0   2.0   2.0  0.18
## thalach     8 1025 149.11 23.01  152.0  150.40 23.72  71 202.0 131.0 -0.51
## exang       9 1025   0.34  0.47    0.0    0.30  0.00   0   1.0   1.0  0.69
## oldpeak    10 1025   1.07  1.18    0.8    0.89  1.19   0   6.2   6.2  1.21
## slope      11 1025   1.39  0.62    1.0    1.45  1.48   0   2.0   2.0 -0.48
## ca         12 1025   0.75  1.03    0.0    0.57  0.00   0   4.0   4.0  1.26
## thal       13 1025   2.32  0.62    2.0    2.38  0.00   0   3.0   3.0 -0.52
## target     14 1025   0.51  0.50    1.0    0.52  0.00   0   1.0   1.0 -0.05
##          kurtosis   se
## age         -0.53 0.28
## sex         -1.28 0.01
## cp          -1.15 0.03
## trestbps     0.97 0.55
## chol         3.96 1.61
## fbs          1.87 0.01
## restecg     -1.31 0.02
## thalach     -0.10 0.72
## exang       -1.52 0.01
## oldpeak      1.29 0.04
## slope       -0.65 0.02
## ca           0.68 0.03
## thal         0.24 0.02
## target      -2.00 0.02

The describe function displays important information about the each feature such as: the minimum value, maximum value, standard deviation, number of observations (this is an easy tool to check if there is missing values), mean and the median, and other useful information.

Area under the curve.

Find the probability of a patient to have a heart disease \(\le\) 50

summary(Data$`age`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   56.00   54.43   61.00   77.00
sd(Data$`age`)
## [1] 9.07229
  • (\(\mu\) = 54.43, \(\sigma\) \(\approx\) 9.07)
  • In order to see the probability of a patient to have a heart disease \(\le\) 50, we need to calculate Z = \(\frac{x-\mu}{\sigma}\) Where \(x\) \(\le\) 50 = \(-0.4883157\)
pnorm(50, mean = 54.43, sd = 9.072)
## [1] 0.3126631
normalPlot(mean = 54.43, sd = 9.072, bounds = c(-Inf, 50), tails = FALSE)

The percentage of patients in the age of 50 yrs old represented in the region is: 31.3%

Data Visulazation

This is a barplot, helps to visualize the distribution of heart disease diagnosis.

Data$target[Data$target > 0] <- 1
barplot(table(Data$target),
        main="Heart disease dist", col="blue")

This is a mosaic plot, helps to visualize the statistical association between two variables.

mosaicplot(Data$sex ~ Data$target,
           main="Heart disease outcome by Gender", shade=FALSE,color=TRUE,
           xlab="Gender", ylab="Heart disease")

This is a boxplot to displays the age distribution of heart diagnosis.

boxplot(Data$age ~ Data$target,
        main="Heart disease diagnosis distribution by Age",
         ylab="Age",xlab="Heart disease diagnosed")

This plot to visualize the Heart disease diagnosis Distributions by Chest pain. There are four types of chest pain:(0)typical angina, (1)atypical angina, (2)non-angina pain, and (3)asymptomatic.

Data$sex <- as.factor(Data$sex)
Data$target <- as.factor(Data$target)
Data$cp <- as.factor(Data$cp)

ggplot(data = Data, aes(x = target, fill = cp)) + 
  geom_bar(position = "fill") +
  labs(title = "Heart disease diagnosis Distributions by Chest pain",
       x = "Heart disease diagnosis",
       y = "chest pain") +
  theme_test()

Another plot to visualize heart disease diagnosis Distributions by Number of major vessels.

Data$ca <- as.factor(Data$ca)

ggplot(data = Data, aes(x = target, fill = ca)) + 
  geom_bar(position = "fill") +
  labs(title = "Heart disease diagnosis Distributions by Number of major vessels ",
       x = "Heart disease diagnosis",
       y = "thal") +
  theme_test()

Histogram of patient’s age and gender

#Data$sex <- as.factor(Data$sex)
#making a new data frame to store the mean ages of the male and female
#patients so that it can be included in the ggplot face-wrap function
meanAge <- data.frame(sex = c(0, 1), age = c(mean(Data[Data$sex==0,]$age),mean(Data[Data$sex==1,]$age)))

#ggplot of age of the patients categorized by sex
Plot <- ggplot(Data, aes(x=age, fill=as.factor(sex))) +
  geom_histogram(alpha=0.5, position="identity")+ 
  geom_vline(aes(xintercept = age), meanAge)+
  facet_wrap(~as.factor(sex))+
  labs(title="Histogram of patients's age by gender", 
       x="Age of patients", y="Count", fill="Sex")+
  geom_text(meanAge, mapping=aes(x=age, y=8.5, label=paste("Mean=", signif(age,4))),
            size=4, angle=90, vjust=-0.4, hjust=0)+
  scale_fill_discrete(breaks=c("0", "1"),
                      labels=c("0 - Female", "1 - Male"))

#display the Plot
Plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Heart disease diagnosis frequency by Resting electrocardiographic results and sex

Data$restecg <- as.factor(Data$restecg)
Data %>%
  ggplot(aes(x = target, fill=restecg)) + 
  geom_bar(position = "dodge") +
  facet_grid(~sex) +
  scale_fill_brewer(palette = "Dark2") +
  labs(title="Heart disease diagnosis frequency by restecg and sex")

Data$age <- as.numeric(Data$age)
Data$sex <- as.numeric(Data$sex)
Data$cp <- as.numeric(Data$cp)
Data$trestbps <- as.numeric(Data$trestbps)
Data$chol <- as.numeric(Data$chol)
Data$fbs <- as.numeric(Data$fbs)
Data$restecg <- as.numeric(Data$restecg)
Data$thalach <- as.numeric(Data$thalach)
Data$exang <- as.numeric(Data$exang)
Data$oldpeak <- as.numeric(Data$oldpeak)
Data$slope <- as.numeric(Data$slope)
Data$ca <- as.numeric(Data$ca)
Data$thal <- as.numeric(Data$thal)

correlations <- cor(Data[,1:13])
corrplot(correlations, method="circle")

A dot-representation was used where blue represents positive correlation and red negative. The larger the dot the larger the correlation.

Feature Importance

There are different ways to identify the important features in the data.

1- Correlation

2- Random Forest: Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (across all tress) that include the feature, proportionally to the number of samples it splits.

corelations = data.frame(cor(Data[,1:13], use = "complete.obs"))
corelations
##                  age         sex          cp    trestbps        chol
## age       1.00000000 -0.10324030 -0.07196627  0.27112141  0.21982253
## sex      -0.10324030  1.00000000 -0.04111909 -0.07897377 -0.19825787
## cp       -0.07196627 -0.04111909  1.00000000  0.03817742 -0.08164102
## trestbps  0.27112141 -0.07897377  0.03817742  1.00000000  0.12797743
## chol      0.21982253 -0.19825787 -0.08164102  0.12797743  1.00000000
## fbs       0.12124348  0.02720046  0.07929359  0.18176662  0.02691716
## restecg  -0.13269617 -0.05511721  0.04358061 -0.12379409 -0.14741024
## thalach  -0.39022708 -0.04936524  0.30683928 -0.03926407 -0.02177209
## exang     0.08816338  0.13915681 -0.40151271  0.06119697  0.06738223
## oldpeak   0.20813668  0.08468656 -0.17473348  0.18743411  0.06488031
## slope    -0.16910511 -0.02666629  0.13163278 -0.12044531 -0.01424787
## ca        0.27155053  0.11172891 -0.17620647  0.10455372  0.07425934
## thal      0.07229745  0.19842425 -0.16334148  0.05927618  0.10024418
##                   fbs     restecg      thalach       exang     oldpeak
## age       0.121243479 -0.13269617 -0.390227075  0.08816338  0.20813668
## sex       0.027200461 -0.05511721 -0.049365243  0.13915681  0.08468656
## cp        0.079293586  0.04358061  0.306839282 -0.40151271 -0.17473348
## trestbps  0.181766624 -0.12379409 -0.039264069  0.06119697  0.18743411
## chol      0.026917164 -0.14741024 -0.021772091  0.06738223  0.06488031
## fbs       1.000000000 -0.10405124 -0.008865857  0.04926057  0.01085948
## restecg  -0.104051244  1.00000000  0.048410637 -0.06560553 -0.05011425
## thalach  -0.008865857  0.04841064  1.000000000 -0.38028087 -0.34979616
## exang     0.049260570 -0.06560553 -0.380280872  1.00000000  0.31084376
## oldpeak   0.010859481 -0.05011425 -0.349796163  0.31084376  1.00000000
## slope    -0.061902374  0.08608609  0.395307843 -0.26733547 -0.57518854
## ca        0.137156259 -0.07807235 -0.207888416  0.10784854  0.22181603
## thal     -0.042177320 -0.02050406 -0.098068165  0.19720104  0.20267203
##                slope          ca        thal
## age      -0.16910511  0.27155053  0.07229745
## sex      -0.02666629  0.11172891  0.19842425
## cp        0.13163278 -0.17620647 -0.16334148
## trestbps -0.12044531  0.10455372  0.05927618
## chol     -0.01424787  0.07425934  0.10024418
## fbs      -0.06190237  0.13715626 -0.04217732
## restecg   0.08608609 -0.07807235 -0.02050406
## thalach   0.39530784 -0.20788842 -0.09806817
## exang    -0.26733547  0.10784854  0.19720104
## oldpeak  -0.57518854  0.22181603  0.20267203
## slope     1.00000000 -0.07344041 -0.09409006
## ca       -0.07344041  1.00000000  0.14901387
## thal     -0.09409006  0.14901387  1.00000000

Split the data for females and males in order to find the most important factor leading to heart disease in each gender.

#Create a subset for males only.
Male_Data <- subset(Data, sex==1)
#Create another subset for females only.
Female_Date <- subset(Data, sex != 1)
#Feature selection using random forest technique
Feature_Importance_Males = randomForest(target~., data=Male_Data)
# Create an importance based on mean decreasing gini
importance(Feature_Importance_Males)
##          MeanDecreaseGini
## age             12.759597
## sex              0.000000
## cp              12.462631
## trestbps         9.075635
## chol             7.477978
## fbs              1.499751
## restecg          3.403075
## thalach          9.121434
## exang           10.717974
## oldpeak         16.221557
## slope            8.874301
## ca              11.131809
## thal            19.887746
varImp(Feature_Importance_Males)
##            Overall
## age      12.759597
## sex       0.000000
## cp       12.462631
## trestbps  9.075635
## chol      7.477978
## fbs       1.499751
## restecg   3.403075
## thalach   9.121434
## exang    10.717974
## oldpeak  16.221557
## slope     8.874301
## ca       11.131809
## thal     19.887746
varImpPlot(Feature_Importance_Males, col= "red", pch= 20)

Feature_Importance_Females = randomForest(target~., data=Female_Date)
# Create an importance based on mean decreasing gini
importance(Feature_Importance_Females)
##          MeanDecreaseGini
## age             33.768879
## sex              0.000000
## cp              45.846075
## trestbps        27.677086
## chol            32.230865
## fbs              4.468138
## restecg          6.968435
## thalach         51.957658
## exang           13.220629
## oldpeak         39.155201
## slope           17.413060
## ca              47.126068
## thal            21.242575
varImp(Feature_Importance_Females)
##            Overall
## age      33.768879
## sex       0.000000
## cp       45.846075
## trestbps 27.677086
## chol     32.230865
## fbs       4.468138
## restecg   6.968435
## thalach  51.957658
## exang    13.220629
## oldpeak  39.155201
## slope    17.413060
## ca       47.126068
## thal     21.242575
varImpPlot(Feature_Importance_Females, col= "red", pch= 20)

Part 4 - Inference

Hypothesis statement

\(H_0\) = There is association between chest pain and heart disease diagnosis

\(H_A\) = There is no association between chest pain and heart disease diagnosis

qqnorm(Data$age)
qqline(Data$age)

Logitic Regression Prediction

#Split the Data to training and testing data to conduct a logistic regression model
set.seed(123)
split=sample.split(Data$target, SplitRatio = 0.75)
Train_Data=subset(Data,split == TRUE)
Test_Data=subset(Data,split == FALSE)
#Perform a logistic regression model
Log_model <- glm(target ~., data=Train_Data, family = "binomial")
summary(Log_model)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = Train_Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5089  -0.3761   0.1144   0.5971   2.6952  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.903072   1.787465   2.743 0.006087 ** 
## age         -0.006921   0.014644  -0.473 0.636483    
## sex         -1.772722   0.291722  -6.077 1.23e-09 ***
## cp           0.842312   0.115374   7.301 2.86e-13 ***
## trestbps    -0.020367   0.006726  -3.028 0.002460 ** 
## chol        -0.005041   0.002349  -2.146 0.031881 *  
## fbs         -0.411562   0.332874  -1.236 0.216314    
## restecg      0.426110   0.216430   1.969 0.048975 *  
## thalach      0.026101   0.006489   4.022 5.76e-05 ***
## exang       -1.001130   0.260629  -3.841 0.000122 ***
## oldpeak     -0.588309   0.135907  -4.329 1.50e-05 ***
## slope        0.379709   0.222490   1.707 0.087889 .  
## ca          -0.735775   0.118129  -6.229 4.71e-10 ***
## thal        -0.937204   0.179351  -5.226 1.74e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1064.15  on 767  degrees of freedom
## Residual deviance:  548.96  on 754  degrees of freedom
## AIC: 576.96
## 
## Number of Fisher Scoring iterations: 6

There is a strong association between cp(chest pain) and heart disease diagnosis giving the p-value of 2.86e-13. 1- 2.86e-13 = 0.999 or 99% of confidence level. This accepts the null hypothesis.

predictTrain = predict(Log_model, type='response')
#Confusion matrix using threshold of 0.5
table(Train_Data$target, predictTrain>0.5)
##    
##     FALSE TRUE
##   0   295   79
##   1    35  359
#Calculate the accuracy on the training set
(295+359)/nrow(Train_Data)
## [1] 0.8515625
#Predictions on Test set
predictTest = predict(Log_model, newdata=Test_Data, type='response')
#Confusion matrix using threshold of 0.5
table(Test_Data$target, predictTest>0.5)
##    
##     FALSE TRUE
##   0   103   22
##   1    12  120
#Accuracy
(103+120)/(nrow(Test_Data))
## [1] 0.8677043

Anova test

anova(Log_model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: target
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       767    1064.15              
## age       1   40.290       766    1023.86 2.189e-10 ***
## sex       1   80.482       765     943.38 < 2.2e-16 ***
## cp        1  142.059       764     801.32 < 2.2e-16 ***
## trestbps  1   17.529       763     783.79 2.830e-05 ***
## chol      1    5.270       762     778.52   0.02170 *  
## fbs       1    3.061       761     775.46   0.08019 .  
## restecg   1    2.469       760     772.99   0.11610    
## thalach   1   71.812       759     701.18 < 2.2e-16 ***
## exang     1   25.360       758     675.82 4.756e-07 ***
## oldpeak   1   50.334       757     625.49 1.297e-12 ***
## slope     1    0.343       756     625.14   0.55791    
## ca        1   48.249       755     576.90 3.755e-12 ***
## thal      1   27.931       754     548.96 1.257e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

t-test

t-test is statistical method used to determine the significant difference between the means of two groups.

# t-test to confirm the association between chest pain and heart disease

ttest_age <- t.test(Data$cp ~ Data$target, var.equal= TRUE)  
ttest_age
## 
##  Two Sample t-test
## 
## data:  Data$cp by Data$target
## t = -15.445, df = 1023, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.009114 -0.781608
## sample estimates:
## mean in group 0 mean in group 1 
##        1.482966        2.378327

Chi-test

CHI_cp <- chisq.test(Data$cp, Data$target) 

# Print the results to see if p<0.05.
print(CHI_cp)
## 
##  Pearson's Chi-squared test
## 
## data:  Data$cp and Data$target
## X-squared = 280.98, df = 3, p-value < 2.2e-16

Part 5 - Conclusion

1- Males are more vulnerable to be diagnosed with heart disease than females.

2- Chest Pain is most common factor that leads to heart disease for males and females.

3- Maximum heart rate achieved is the highest cause factor to cause heart disease for females where is Thalassemia is the highest to cause heart disease for males.

4- There is a high association between chest pain and heart disease diagnosis.

Limitation

The dataset is missing some useful information such as smoking, obesity or family history that can help in predicting.