In this project, the data from kaggle about individuals with heart disease was analyzed. From the analysis, there is sufficient statistical evidence that there is no difference in mean RestingBP of individuals with HeartDisease and those without HeartDisease. However, there is no sufficient statistical evidence that there is no difference in the Cholesterol level of individuals with HeartDisease and those without HeartDisease. i.e. There is a significant difference in the cholesterol level of individuals with HeartDisease and those without HeartDisease, whereas there is no significant difference in their Resting Blood Pressure. Also, a logistic regression model to predict whether an individual will develop HeartDisease or not was developed in this project based on the features present in the data.
The data was gotten from Kaggle
According to the kaggle source, this dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:
Total: 1190 observations
Duplicated: 272 observations
Final dataset: 918 observations
Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/
This is an observational study
There are 12 variables and 918 observations. Eleven (11) of the 12 variables are explanatory variables
There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:
Required Libraries
library(tidyverse)
library(caTools) # To split data into training and test data
library(Amelia) # To visualize missing data
library(cowplot) # To combine plots in a gridLoad the data
url <- "https://raw.githubusercontent.com/chinedu2301/DATA606-Statistics-and-Probability-for-Data-Analytics/main/heart.csv"
heart <- read_csv(url)Check the head of the data
# Check the head of the data
head(heart)## # A tibble: 6 x 12
## Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 40 M ATA 140 289 0 Normal 172
## 2 49 F NAP 160 180 0 Normal 156
## 3 37 M ATA 130 283 0 ST 98
## 4 48 F ASY 138 214 0 Normal 108
## 5 54 M NAP 150 195 0 Normal 122
## 6 39 M NAP 120 339 0 Normal 170
## # ... with 4 more variables: ExerciseAngina <chr>, Oldpeak <dbl>,
## # ST_Slope <chr>, HeartDisease <dbl>
Get a glimpse of the data types and structure
glimpse(heart)## Rows: 918
## Columns: 12
## $ Age <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
## $ RestingBP <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
## $ Cholesterol <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
## $ FastingBS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
## $ MaxHR <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
## $ HeartDisease <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~
There are 918 observations and 12 variables in the dataset
Get the summary
summary(heart)## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and maxHR is 136.8
Compute the mean and standard deviations for the RestingBP and Cholesterol levels for both those with Heart Disease and those without Heart Disease.
# Get the mean and standard deviations of RestingBP and Cholesterol level for individuals with HeartDisease
heart_1 <- heart %>% filter(HeartDisease == 1)
meanBP_heart_1 <- mean(heart_1$RestingBP)
stdBP_heart_1 <- sd(heart_1$RestingBP)
meanCh_heart_1 <- mean(heart_1$Cholesterol)
stdCh_heart_1 <- sd(heart_1$Cholesterol)
n_heart_1 <- nrow(heart_1)
# Get the mean and standard deviations of RestingBP and Cholesterol level for individuals without HeartDisease
heart_0 <- heart %>% filter(HeartDisease == 0)
meanBP_heart_0 <- mean(heart_0$RestingBP)
stdBP_heart_0 <- sd(heart_0$RestingBP)
meanCh_heart_0 <- mean(heart_0$Cholesterol)
stdCh_heart_0 <- sd(heart_0$Cholesterol)
n_heart_0 <- nrow(heart_0)
# Arrange the values in a dataframe
meanBP <- c(meanBP_heart_1, meanBP_heart_0)
stdBP <- c(stdBP_heart_1, stdBP_heart_0)
meanCho <- c(meanCh_heart_1, meanCh_heart_0)
stdCho <- c(stdCh_heart_1, stdCh_heart_0)
table <- data.frame(meanBP, stdBP, meanCho, stdCho)
row.names(table) <- c("Heart Disease", "No Heart Disease")
headers <- c("Mean RestingBP", "Std RestingBP", "Mean Cholesterol", "Std Cholesterol")
colnames(table) <- headers
table## Mean RestingBP Std RestingBP Mean Cholesterol Std Cholesterol
## Heart Disease 134.1850 19.82868 175.9409 126.39140
## No Heart Disease 130.1805 16.49958 227.1220 74.63466
State the Null and Alternative Hypothesis
Null Hypothesis, \(H_{0}\) : There is no difference in the mean RestingBP of those with Heart Disease and those without Heart Disease. \(\mu_{BPHeartDisease} - \mu_{BPNoHeartDisease} = 0\)
Alternative Hypothesis, \(H_{1}\) : There is some difference in the mean RestingBP of those with Heart Disease and those without Heart Disease. \(\mu_{BPHeartDisease} - \mu_{BPNoHeartDisease} \neq 0\)
Check conditions:
Independence: The sample come from difference random samples. Hence, the independence criteria is satisfied.
Normality: The sample size is large enough. Hence, we can assume a nearly normal distribution.
Compute Test Statistics:
\(SE_{diff} = \sqrt{\frac{s^{2}}{n_{hd}} + \frac{s^{2}}{n_{nhd}}}\)
\(\bar{x}_{BpHD - BpNHD} = meanBPheart_1 - meanBPheart_0\)
Test statistic \(T = \frac{\bar{x}_{diff} - \mu_{diff}}{SE_{diff}}\)
mu_diff <- 0
xbar_diff <- meanBP_heart_1 - meanBP_heart_0
SE_diff <- round((sqrt((stdBP_heart_1^2)/n_heart_1 + (stdBP_heart_0^2)/n_heart_0)),4)
t <- round((xbar_diff - mu_diff)/SE_diff, 4)
paste0("The test statistic, t is : ", t)## [1] "The test statistic, t is : 3.3394"
Compute the p - value:
alpha <- 0.05
df <- n_heart_0 + n_heart_1 - 2
p_value <- round(2*pt(t, df), 6)
paste0("The p-value is ", p_value)## [1] "The p-value is 1.999127"
Conclusion:
Since the p-value is greater than 0.05, we do not reject the null hypothesis at \(\alpha = 0.05\). Therefore, there is sufficient statistical evidence to support the null hypothesis that there is no difference in the mean RestingBP of those with Heart Disease and those without Heart Disease.
State the Null and Alternative Hypothesis
Null Hypothesis, \(H_{0}\) : There is no difference in the mean Cholesterol level of those with Heart Disease and those without Heart Disease. \(\mu_{CLHeartDisease} - \mu_{CLNoHeartDisease} = 0\)
Alternative Hypothesis, \(H_{1}\) : There is some difference in the mean Cholesterol level of those with Heart Disease and those without Heart Disease. \(\mu_{CLHeartDisease} - \mu_{CLNoHeartDisease} \neq 0\)
Check conditions:
Independence: The sample come from difference random samples. Hence, the independence criteria is satisfied.
Normality: The sample size is large enough. Hence, we can assume a nearly normal distribution.
Compute Test Statistics:
\(SE_{diff} = \sqrt{\frac{s^{2}}{n_{hd}} + \frac{s^{2}}{n_{nhd}}}\)
\(\bar{x}_{Cl,HD - Cl,NHD} = meanCLheart_1 - meanCLheart_0\)
Test statistic \(T = \frac{\bar{x}_{diff} - \mu_{diff}}{SE_{diff}}\)
mu_diff <- 0
xbar_diff_cl <- meanCh_heart_1 - meanCh_heart_0
SE_diff <- round((sqrt((stdCh_heart_1^2)/n_heart_1 + (stdCh_heart_0^2)/n_heart_0)),4)
t2 <- round((xbar_diff_cl - mu_diff)/SE_diff, 4)
paste0("The test statistic, t is : ", t2)## [1] "The test statistic, t is : -7.6269"
Compute the p - value:
alpha <- 0.05
df <- n_heart_0 + n_heart_1 - 2
p_value2 <- round(2*pt(t2, df), 6)
paste0("The p-value is ", p_value2)## [1] "The p-value is 0"
Conclusion:
Since the p-value is less than 0.05, we reject the null hypothesis at \(\alpha = 0.05\). Therefore, there is no sufficient statistical evidence to support the null hypothesis that the mean Cholesterol level of those with Heart Disease is the same as those without Heart Disease.
Check for Null values
# Check for NA values
any(is.na(heart))## [1] FALSE
Visualize the na values
# use missmap function from the Amelia package to check for NA values
missmap(heart, main = "Heart Data - Missing Values", col = c("yellow", "black"), legend = FALSE)## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
There are no NA values in the dataset
Bar Graph by Gender
# Bar Chart by Sex for the entire data set
p1 <- ggplot(heart, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - All") + ylab(NULL)
# Bar plot by Sex for only those with Heart Disease
p2 <- ggplot(heart_1, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - Heart Disease") + ylab(NULL)
# Bar plot by Sex for only those with no hear disease
p3 <- ggplot(heart_0, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
labs(title = "Bar Graph by Sex - No Heart Disease") + ylab(NULL)
# Bar plot of individuals who have heart disease by Sex
p4 <- heart %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>%
ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() + ylab(NULL) +
labs(title = "HeartDisease vs No HeartDisease")
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)Histogram to show distribution by age
# Histogram to show age distribution in the dataset
p5 <- heart |> ggplot(aes(x = Age)) + geom_histogram(fill = "brown", binwidth = 2) + theme_bw() +
labs(title = "Distribution by Age") + ylab(NULL)
# Histogram of Cholesterol level
p6 <- ggplot(heart, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") +
labs(title = "Distribution of Cholesterol level") + ylab(NULL) + theme_bw()
# Histogram of RestingBP
p7 <- heart %>% ggplot(aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") +
labs(title = "Distribution of RestingBP") + ylab(NULL) + theme_bw()
# Plot all the histograms in a grid
plot_grid(p5, p6, p7)Scatter plot of RestingBP vs Cholesterol
# RestingBP vs Cholesterol
heart |> ggplot(aes(x = Cholesterol, y = RestingBP, color = RestingECG)) + geom_point() +
labs(title = "RestingBP vs Cholesterol") + theme_bw()Box Plot of RestingBP for each ChestPainType
# Boxplot by ChestPainType
heart |> ggplot() + geom_boxplot(aes(x = ChestPainType, y = RestingBP)) +
labs(title = "Box Plot of Resting BP vs ChestPainType") + theme_bw()Use the CaTools library to split the dataset into training and testing datasets
# Set a seed
set.seed(101)
#Split the sample
sample <- sample.split(heart$HeartDisease, SplitRatio = 0.8)
# Training Data
heart_train <- subset(heart, sample == TRUE)
# Testing Data
heart_test <- subset(heart, sample == FALSE)Train the model using a logistic model
# Train the model
heart_logistic_model <- glm(formula = HeartDisease ~ . , family = binomial(link = 'logit'),
data = heart_train)Get the summary of the model
# Get the summary of the logistic model
summary(heart_logistic_model)##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial(link = "logit"),
## data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7225 -0.4271 0.1908 0.4654 2.5234
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.293756 1.507163 -0.195 0.845466
## Age 0.009792 0.014414 0.679 0.496944
## SexM 1.355461 0.303400 4.468 7.91e-06 ***
## ChestPainTypeATA -1.555284 0.346433 -4.489 7.14e-06 ***
## ChestPainTypeNAP -1.595361 0.295502 -5.399 6.71e-08 ***
## ChestPainTypeTA -1.319753 0.462158 -2.856 0.004295 **
## RestingBP 0.002430 0.006451 0.377 0.706407
## Cholesterol -0.004666 0.001204 -3.876 0.000106 ***
## FastingBS 0.940464 0.291908 3.222 0.001274 **
## RestingECGNormal -0.287463 0.296022 -0.971 0.331507
## RestingECGST -0.270019 0.378956 -0.713 0.476134
## MaxHR -0.005918 0.005358 -1.105 0.269320
## ExerciseAnginaY 0.886983 0.269180 3.295 0.000984 ***
## Oldpeak 0.452125 0.133341 3.391 0.000697 ***
## ST_SlopeFlat 1.529064 0.466951 3.275 0.001058 **
## ST_SlopeUp -0.743409 0.483267 -1.538 0.123976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1009.24 on 733 degrees of freedom
## Residual deviance: 495.73 on 718 degrees of freedom
## AIC: 527.73
##
## Number of Fisher Scoring iterations: 5
Predict values using the model
fit_heart_probabilities <- predict(heart_logistic_model, newdata = heart_test, type = "response")Properly group the probabilities
# Make probabilities greater than 0.5 to be 1
fit_heart_results <- ifelse(fit_heart_probabilities > 0.5, 1, 0)Accuracy
# Misclassification Error
misclassError <- mean(fit_heart_results != heart_test$HeartDisease)
accuracy = round((1 - misclassError), 4) * 100
paste0("The accuracy of the logistic regression model is ", accuracy, "%")## [1] "The accuracy of the logistic regression model is 88.59%"
Confusion Matrix
print("-CONFUSION MATRIX-")## [1] "-CONFUSION MATRIX-"
table(heart_test$HeartDisease, fit_heart_results > 0.5)##
## FALSE TRUE
## 0 68 14
## 1 7 95
From exploratory data analysis, we see that Males are more likely to have HeartDisease than females. Also, from the summary of the logistic model, “SexM - Male Gender” is a significant predictor of HeartDisease. Furthermore, from hypothesis testing of the difference in mean cholesterol level for those with and without heart disease, we see that there is a significant difference in their cholesterol level, but there is no significant difference in their RestingBP. Looking at the summary of the logistic model, we can easily see that Cholesterol level is also a significant predictor of HeartDisease, while RestingBP is not a significant predictor of HeartDisease. In addition, some other significant predictors of HeartDisease from the model summary are: ChestPainType, whether the individual ExerciseAngina, and Fasting Blood Sugar.
`