Our heart is one of the most important organs in our body. There are five great vessels enter and leave the heart: the superior and inferior vena cava, the pulmonary artery, the pulmonary vein, and the aorta. Malfunctions of the heart is called Heart Disease or Cardiac Disease.
There are many factors that can increase the risk of getting heart disease. Some of these factors are out of control, but many of them can be avoided by choosing to live a healthy lifestyle. The factors that cannot be controlled are: Gender,age, family history, heart shape. The controllable risk factors are: High blood pressure, cholesterol level, obesity, smoking, and diabetes.
Heart disease is a leading cause of death. One person dies every 36 seconds in the United States from cardiovascular disease. About 655,000 Americans die from heart disease each year, that is 1 in every 4 deaths. In this analysis, I will use heart disease dataset to explore the highest important features that leads to heart disease. I also also do a logistic regression model to predict if a patient will have a heart disease or no.
Heart disease analysis and prediction
What is the most common factor for both males and females to have the highest cause of heart disease?
1- Statistical Analysis
2- Feature importance/selection
3- Logistic regression modeling and prediction
library(ggplot2)
library(DATA606)
library(psych)
library(corrplot)
library(dplyr)
library(caTools)
library(caret)
library(randomForest)
# load data
#Data <- read.csv("heart.csv")
Data <- read.csv("https://github.com/GehadGad/Heart-disease-dataset/raw/main/heart.csv")
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
#Display the first few rows in the data
head(Data)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## target
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
There are 14 cases in this dataset and 1025 observations from individuals(patients).
To understand the data in a better way,I created the following table to explain the description, data type, and the value of each feature.
| Feature | Definition | Type | Value |
|---|---|---|---|
| age | Patient’s age in years | Numerical | 29 - 77 |
| sex | Gender | Nominal | (0)female, (1)male |
| cp | Type of chest-pain | Nominal | (0)typical angina, (1)atypical angina, (2)non-angina pain, (3)asymptomatic |
| trestbps | Resting blood pressure in mmHg | Numerical | 94 - 200 |
| chol | Serum cholestoral in mg/dl | Numerical | 126 – 564 |
| fbs | Fasting blood sugar higher than 120 mg/dl | Nominal | (0)False (1)True |
| restecg | Resting electrocardiographic results | Nominal | (0)normal, (1)having ST-T wave abnormality, (2)showing probable left ventricular hypertrophy |
| thalach | Maximum heart rate achieved | Numerical | 71 –202 |
| exang | Exercise induced angina | Nominal | (0)No(1)Yes |
| oldpeak | ST depression induced by exercise relative to rest | Numerical | -2.6 - 6.2 |
| slope | The slope of the peak exercise ST segment | Nominal | (1)upsloping, (2)flat, (3)downsloping |
| ca | Number of major vessels colored by flourosopy | Nominal | 0, 1, 2, 3 |
| thal | Thalassemia | Nominal | (3)normal,(6)fixed defect, (7)reversible defect |
| target | Diagnosis of heart disease | Nominal | (0)heart disease not present, (1)heart disease present |
This is observational study.
Target is the output response (dependent) variable and it is qualitative.
The independent variable is age, gender, and all other variables.
Check if there are missing values (NA) in the data
sum(is.na(Data))
## [1] 0
There are not missing values in this data.
#Count the number of patients have or have not been diagnosed with heart disease.
Data %>% count(target)
## target n
## 1 0 499
## 2 1 526
There are 499 patients do not have heart disease and 526 have heart disease
#The proportion of patients with chest pain types.
Data %>% group_by(cp) %>%
summarise( percent = 100 * n() / nrow( Data ))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## cp percent
## <int> <dbl>
## 1 0 48.5
## 2 1 16.3
## 3 2 27.7
## 4 3 7.51
There are 48% of patient with typical angina chest pain, 16 % of patients with atypical angina, 27% with non-angina pain, and 7% with asymptomatic chest pain.
#The proportion of females and males patients in the dataset.
Data %>%
group_by( sex ) %>%
summarise( percent = 100 * n() / nrow( Data ))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## sex percent
## <int> <dbl>
## 1 0 30.4
## 2 1 69.6
There are 30.4 % females and 69.6% males in the dataset
Sub_female <- table(Data[Data$sex==0,]$target)
Sub_male <- table(Data[Data$sex==1,]$target)
FM_combine <- rbind(Sub_female,Sub_male)
#Rename columns names and rows names.
colnames(FM_combine) <- c("Has heart disease", "Does not have heart disease")
rownames(FM_combine) <- c("Females", "Males")
#Display the table
FM_combine
## Has heart disease Does not have heart disease
## Females 86 226
## Males 413 300
There are 86 females out of 312 who have diagnosed with heart disease and 413 males out of 713 were diagnosed with heart disease.
This indicates that 58% of males in this dataset are diagnosed with heart disease where is only 28% of females are diagnosed with heart disease.
Males are more diagnosed with heart disease than females
summary(Data)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.43 Mean :0.6956 Mean :0.9424 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## chol fbs restecg thalach
## Min. :126 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:132.0
## Median :240 Median :0.0000 Median :1.0000 Median :152.0
## Mean :246 Mean :0.1493 Mean :0.5298 Mean :149.1
## 3rd Qu.:275 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3366 Mean :1.072 Mean :1.385 Mean :0.7541
## 3rd Qu.:1.0000 3rd Qu.:1.800 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.324 Mean :0.5132
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
The summary function displays useful information about the each feature such as: the minimum value, maximum value, first and third quartile, mean and the median.
describe(Data)
## vars n mean sd median trimmed mad min max range skew
## age 1 1025 54.43 9.07 56.0 54.66 8.90 29 77.0 48.0 -0.25
## sex 2 1025 0.70 0.46 1.0 0.74 0.00 0 1.0 1.0 -0.85
## cp 3 1025 0.94 1.03 1.0 0.83 1.48 0 3.0 3.0 0.53
## trestbps 4 1025 131.61 17.52 130.0 130.39 14.83 94 200.0 106.0 0.74
## chol 5 1025 246.00 51.59 240.0 243.26 48.93 126 564.0 438.0 1.07
## fbs 6 1025 0.15 0.36 0.0 0.06 0.00 0 1.0 1.0 1.97
## restecg 7 1025 0.53 0.53 1.0 0.52 0.00 0 2.0 2.0 0.18
## thalach 8 1025 149.11 23.01 152.0 150.40 23.72 71 202.0 131.0 -0.51
## exang 9 1025 0.34 0.47 0.0 0.30 0.00 0 1.0 1.0 0.69
## oldpeak 10 1025 1.07 1.18 0.8 0.89 1.19 0 6.2 6.2 1.21
## slope 11 1025 1.39 0.62 1.0 1.45 1.48 0 2.0 2.0 -0.48
## ca 12 1025 0.75 1.03 0.0 0.57 0.00 0 4.0 4.0 1.26
## thal 13 1025 2.32 0.62 2.0 2.38 0.00 0 3.0 3.0 -0.52
## target 14 1025 0.51 0.50 1.0 0.52 0.00 0 1.0 1.0 -0.05
## kurtosis se
## age -0.53 0.28
## sex -1.28 0.01
## cp -1.15 0.03
## trestbps 0.97 0.55
## chol 3.96 1.61
## fbs 1.87 0.01
## restecg -1.31 0.02
## thalach -0.10 0.72
## exang -1.52 0.01
## oldpeak 1.29 0.04
## slope -0.65 0.02
## ca 0.68 0.03
## thal 0.24 0.02
## target -2.00 0.02
The describe function displays important information about the each feature such as: the minimum value, maximum value, standard deviation, number of observations (this is an easy tool to check if there is missing values), mean and the median, and other useful information.
Area under the curve.
Find the probability of a patient to have a heart disease \(\le\) 50
summary(Data$`age`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 48.00 56.00 54.43 61.00 77.00
sd(Data$`age`)
## [1] 9.07229
pnorm(50, mean = 54.43, sd = 9.072)
## [1] 0.3126631
normalPlot(mean = 54.43, sd = 9.072, bounds = c(-Inf, 50), tails = FALSE)
The percentage of patients in the age of 50 yrs old represented in the region is: 31.3%
This is a barplot, helps to visualize the distribution of heart disease diagnosis.
Data$target[Data$target > 0] <- 1
barplot(table(Data$target),
main="Heart disease dist", col="blue")
This is a mosaic plot, helps to visualize the statistical association between two variables.
mosaicplot(Data$sex ~ Data$target,
main="Heart disease outcome by Gender", shade=FALSE,color=TRUE,
xlab="Gender", ylab="Heart disease")
This is a boxplot to displays the age distribution of heart diagnosis.
boxplot(Data$age ~ Data$target,
main="Heart disease diagnosis distribution by Age",
ylab="Age",xlab="Heart disease diagnosed")
This plot to visualize the Heart disease diagnosis Distributions by Chest pain. There are four types of chest pain:(0)typical angina, (1)atypical angina, (2)non-angina pain, and (3)asymptomatic.
Data$sex <- as.factor(Data$sex)
Data$target <- as.factor(Data$target)
Data$cp <- as.factor(Data$cp)
ggplot(data = Data, aes(x = target, fill = cp)) +
geom_bar(position = "fill") +
labs(title = "Heart disease diagnosis Distributions by Chest pain",
x = "Heart disease diagnosis",
y = "chest pain") +
theme_test()
Another plot to visualize heart disease diagnosis Distributions by Number of major vessels.
Data$ca <- as.factor(Data$ca)
ggplot(data = Data, aes(x = target, fill = ca)) +
geom_bar(position = "fill") +
labs(title = "Heart disease diagnosis Distributions by Number of major vessels ",
x = "Heart disease diagnosis",
y = "thal") +
theme_test()
Histogram of patient’s age and gender
#Data$sex <- as.factor(Data$sex)
#making a new data frame to store the mean ages of the male and female
#patients so that it can be included in the ggplot face-wrap function
meanAge <- data.frame(sex = c(0, 1), age = c(mean(Data[Data$sex==0,]$age),mean(Data[Data$sex==1,]$age)))
#ggplot of age of the patients categorized by sex
Plot <- ggplot(Data, aes(x=age, fill=as.factor(sex))) +
geom_histogram(alpha=0.5, position="identity")+
geom_vline(aes(xintercept = age), meanAge)+
facet_wrap(~as.factor(sex))+
labs(title="Histogram of patients's age by gender",
x="Age of patients", y="Count", fill="Sex")+
geom_text(meanAge, mapping=aes(x=age, y=8.5, label=paste("Mean=", signif(age,4))),
size=4, angle=90, vjust=-0.4, hjust=0)+
scale_fill_discrete(breaks=c("0", "1"),
labels=c("0 - Female", "1 - Male"))
#display the Plot
Plot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Heart disease diagnosis frequency by Resting electrocardiographic results and sex
Data$restecg <- as.factor(Data$restecg)
Data %>%
ggplot(aes(x = target, fill=restecg)) +
geom_bar(position = "dodge") +
facet_grid(~sex) +
scale_fill_brewer(palette = "Dark2") +
labs(title="Heart disease diagnosis frequency by restecg and sex")
Data$age <- as.numeric(Data$age)
Data$sex <- as.numeric(Data$sex)
Data$cp <- as.numeric(Data$cp)
Data$trestbps <- as.numeric(Data$trestbps)
Data$chol <- as.numeric(Data$chol)
Data$fbs <- as.numeric(Data$fbs)
Data$restecg <- as.numeric(Data$restecg)
Data$thalach <- as.numeric(Data$thalach)
Data$exang <- as.numeric(Data$exang)
Data$oldpeak <- as.numeric(Data$oldpeak)
Data$slope <- as.numeric(Data$slope)
Data$ca <- as.numeric(Data$ca)
Data$thal <- as.numeric(Data$thal)
correlations <- cor(Data[,1:13])
corrplot(correlations, method="circle")
A dot-representation was used where blue represents positive correlation and red negative. The larger the dot the larger the correlation.
There are different ways to identify the important features in the data.
1- Correlation
2- Random Forest: Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (across all tress) that include the feature, proportionally to the number of samples it splits.
corelations = data.frame(cor(Data[,1:13], use = "complete.obs"))
corelations
## age sex cp trestbps chol
## age 1.00000000 -0.10324030 -0.07196627 0.27112141 0.21982253
## sex -0.10324030 1.00000000 -0.04111909 -0.07897377 -0.19825787
## cp -0.07196627 -0.04111909 1.00000000 0.03817742 -0.08164102
## trestbps 0.27112141 -0.07897377 0.03817742 1.00000000 0.12797743
## chol 0.21982253 -0.19825787 -0.08164102 0.12797743 1.00000000
## fbs 0.12124348 0.02720046 0.07929359 0.18176662 0.02691716
## restecg -0.13269617 -0.05511721 0.04358061 -0.12379409 -0.14741024
## thalach -0.39022708 -0.04936524 0.30683928 -0.03926407 -0.02177209
## exang 0.08816338 0.13915681 -0.40151271 0.06119697 0.06738223
## oldpeak 0.20813668 0.08468656 -0.17473348 0.18743411 0.06488031
## slope -0.16910511 -0.02666629 0.13163278 -0.12044531 -0.01424787
## ca 0.27155053 0.11172891 -0.17620647 0.10455372 0.07425934
## thal 0.07229745 0.19842425 -0.16334148 0.05927618 0.10024418
## fbs restecg thalach exang oldpeak
## age 0.121243479 -0.13269617 -0.390227075 0.08816338 0.20813668
## sex 0.027200461 -0.05511721 -0.049365243 0.13915681 0.08468656
## cp 0.079293586 0.04358061 0.306839282 -0.40151271 -0.17473348
## trestbps 0.181766624 -0.12379409 -0.039264069 0.06119697 0.18743411
## chol 0.026917164 -0.14741024 -0.021772091 0.06738223 0.06488031
## fbs 1.000000000 -0.10405124 -0.008865857 0.04926057 0.01085948
## restecg -0.104051244 1.00000000 0.048410637 -0.06560553 -0.05011425
## thalach -0.008865857 0.04841064 1.000000000 -0.38028087 -0.34979616
## exang 0.049260570 -0.06560553 -0.380280872 1.00000000 0.31084376
## oldpeak 0.010859481 -0.05011425 -0.349796163 0.31084376 1.00000000
## slope -0.061902374 0.08608609 0.395307843 -0.26733547 -0.57518854
## ca 0.137156259 -0.07807235 -0.207888416 0.10784854 0.22181603
## thal -0.042177320 -0.02050406 -0.098068165 0.19720104 0.20267203
## slope ca thal
## age -0.16910511 0.27155053 0.07229745
## sex -0.02666629 0.11172891 0.19842425
## cp 0.13163278 -0.17620647 -0.16334148
## trestbps -0.12044531 0.10455372 0.05927618
## chol -0.01424787 0.07425934 0.10024418
## fbs -0.06190237 0.13715626 -0.04217732
## restecg 0.08608609 -0.07807235 -0.02050406
## thalach 0.39530784 -0.20788842 -0.09806817
## exang -0.26733547 0.10784854 0.19720104
## oldpeak -0.57518854 0.22181603 0.20267203
## slope 1.00000000 -0.07344041 -0.09409006
## ca -0.07344041 1.00000000 0.14901387
## thal -0.09409006 0.14901387 1.00000000
Split the data for females and males in order to find the most important factor leading to heart disease in each gender.
#Create a subset for males only.
Male_Data <- subset(Data, sex==1)
#Create another subset for females only.
Female_Date <- subset(Data, sex != 1)
#Feature selection using random forest technique
Feature_Importance_Males = randomForest(target~., data=Male_Data)
# Create an importance based on mean decreasing gini
importance(Feature_Importance_Males)
## MeanDecreaseGini
## age 12.759597
## sex 0.000000
## cp 12.462631
## trestbps 9.075635
## chol 7.477978
## fbs 1.499751
## restecg 3.403075
## thalach 9.121434
## exang 10.717974
## oldpeak 16.221557
## slope 8.874301
## ca 11.131809
## thal 19.887746
varImp(Feature_Importance_Males)
## Overall
## age 12.759597
## sex 0.000000
## cp 12.462631
## trestbps 9.075635
## chol 7.477978
## fbs 1.499751
## restecg 3.403075
## thalach 9.121434
## exang 10.717974
## oldpeak 16.221557
## slope 8.874301
## ca 11.131809
## thal 19.887746
varImpPlot(Feature_Importance_Males, col= "red", pch= 20)
Feature_Importance_Females = randomForest(target~., data=Female_Date)
# Create an importance based on mean decreasing gini
importance(Feature_Importance_Females)
## MeanDecreaseGini
## age 33.768879
## sex 0.000000
## cp 45.846075
## trestbps 27.677086
## chol 32.230865
## fbs 4.468138
## restecg 6.968435
## thalach 51.957658
## exang 13.220629
## oldpeak 39.155201
## slope 17.413060
## ca 47.126068
## thal 21.242575
varImp(Feature_Importance_Females)
## Overall
## age 33.768879
## sex 0.000000
## cp 45.846075
## trestbps 27.677086
## chol 32.230865
## fbs 4.468138
## restecg 6.968435
## thalach 51.957658
## exang 13.220629
## oldpeak 39.155201
## slope 17.413060
## ca 47.126068
## thal 21.242575
varImpPlot(Feature_Importance_Females, col= "red", pch= 20)
Hypothesis statement
\(H_0\) = There is association between chest pain and heart disease diagnosis
\(H_A\) = There is no association between chest pain and heart disease diagnosis
qqnorm(Data$age)
qqline(Data$age)
#Split the Data to training and testing data to conduct a logistic regression model
set.seed(123)
split=sample.split(Data$target, SplitRatio = 0.75)
Train_Data=subset(Data,split == TRUE)
Test_Data=subset(Data,split == FALSE)
#Perform a logistic regression model
Log_model <- glm(target ~., data=Train_Data, family = "binomial")
summary(Log_model)
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = Train_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5089 -0.3761 0.1144 0.5971 2.6952
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.903072 1.787465 2.743 0.006087 **
## age -0.006921 0.014644 -0.473 0.636483
## sex -1.772722 0.291722 -6.077 1.23e-09 ***
## cp 0.842312 0.115374 7.301 2.86e-13 ***
## trestbps -0.020367 0.006726 -3.028 0.002460 **
## chol -0.005041 0.002349 -2.146 0.031881 *
## fbs -0.411562 0.332874 -1.236 0.216314
## restecg 0.426110 0.216430 1.969 0.048975 *
## thalach 0.026101 0.006489 4.022 5.76e-05 ***
## exang -1.001130 0.260629 -3.841 0.000122 ***
## oldpeak -0.588309 0.135907 -4.329 1.50e-05 ***
## slope 0.379709 0.222490 1.707 0.087889 .
## ca -0.735775 0.118129 -6.229 4.71e-10 ***
## thal -0.937204 0.179351 -5.226 1.74e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1064.15 on 767 degrees of freedom
## Residual deviance: 548.96 on 754 degrees of freedom
## AIC: 576.96
##
## Number of Fisher Scoring iterations: 6
There is a strong association between cp(chest pain) and heart disease diagnosis giving the p-value of 2.86e-13. 1- 2.86e-13 = 0.999 or 99% of confidence level. This accepts the null hypothesis.
predictTrain = predict(Log_model, type='response')
#Confusion matrix using threshold of 0.5
table(Train_Data$target, predictTrain>0.5)
##
## FALSE TRUE
## 0 295 79
## 1 35 359
#Calculate the accuracy on the training set
(295+359)/nrow(Train_Data)
## [1] 0.8515625
#Predictions on Test set
predictTest = predict(Log_model, newdata=Test_Data, type='response')
#Confusion matrix using threshold of 0.5
table(Test_Data$target, predictTest>0.5)
##
## FALSE TRUE
## 0 103 22
## 1 12 120
#Accuracy
(103+120)/(nrow(Test_Data))
## [1] 0.8677043
Anova test
anova(Log_model, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: target
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 767 1064.15
## age 1 40.290 766 1023.86 2.189e-10 ***
## sex 1 80.482 765 943.38 < 2.2e-16 ***
## cp 1 142.059 764 801.32 < 2.2e-16 ***
## trestbps 1 17.529 763 783.79 2.830e-05 ***
## chol 1 5.270 762 778.52 0.02170 *
## fbs 1 3.061 761 775.46 0.08019 .
## restecg 1 2.469 760 772.99 0.11610
## thalach 1 71.812 759 701.18 < 2.2e-16 ***
## exang 1 25.360 758 675.82 4.756e-07 ***
## oldpeak 1 50.334 757 625.49 1.297e-12 ***
## slope 1 0.343 756 625.14 0.55791
## ca 1 48.249 755 576.90 3.755e-12 ***
## thal 1 27.931 754 548.96 1.257e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t-test
t-test is statistical method used to determine the significant difference between the means of two groups.
# t-test to confirm the association between chest pain and heart disease
ttest_age <- t.test(Data$cp ~ Data$target, var.equal= TRUE)
ttest_age
##
## Two Sample t-test
##
## data: Data$cp by Data$target
## t = -15.445, df = 1023, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.009114 -0.781608
## sample estimates:
## mean in group 0 mean in group 1
## 1.482966 2.378327
Chi-test
CHI_cp <- chisq.test(Data$cp, Data$target)
# Print the results to see if p<0.05.
print(CHI_cp)
##
## Pearson's Chi-squared test
##
## data: Data$cp and Data$target
## X-squared = 280.98, df = 3, p-value < 2.2e-16
1- Males are more vulnerable to be diagnosed with heart disease than females.
2- Chest Pain is most common factor that leads to heart disease for males and females.
3- Maximum heart rate achieved is the highest cause factor to cause heart disease for females where is Thalassemia is the highest to cause heart disease for males.
4- There is a high association between chest pain and heart disease diagnosis.
The dataset is missing some useful information such as smoking, obesity or family history that can help in predicting.