CUNY SPS DATA606 Final Project

ABSTRACT

In this project, the data from kaggle about individuals with heart disease was analyzed. From the analysis, there is sufficient statistical evidence that there is no difference in mean RestingBP of individuals with HeartDisease and those without HeartDisease. However, there is no sufficient statistical evidence that there is no difference in the Cholesterol level of individuals with HeartDisease and those without HeartDisease. i.e. There is a significant difference in the cholesterol level of individuals with HeartDisease and those without HeartDisease, whereas there is no significant difference in their Resting Blood Pressure. Also, a logistic regression model to predict whether an individual will develop HeartDisease or not was developed in this project based on the features present in the data.

INTRODUCTION

This project aims to:

Determine if the mean Resting Blood Pressure (RestingBP) of individuals in the dataset who develop heart disease is significantly different from the mean Resting BloodPressure of individuals who do not develop heart disease.

Determine if the mean Cholesterol level of individuals who develop heart disease is significantly different from the mean Cholesterol level of individuals who do not develop heart disease.

Predict whether an individual will develop heart disease or not using Logistic Regression model in R.

DATA

Data Source:

The data was gotten from Kaggle

Data Collection:

According to the kaggle source, this dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

Cleveland: 303 observations

Hungarian: 294 observations

Switzerland: 123 observations

Long Beach VA: 200 observations

Stalog (Heart) Data Set: 270 observations

Total: 1190 observations
Duplicated: 272 observations

Final dataset: 918 observations

Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

Type of Study:

This is an observational study

Cases:

There are 12 variables and 918 observations. Eleven (11) of the 12 variables are explanatory variables

Response Variable (Dependent Variable) :

The dependent variable is “HeartDisease” which is coded as 1 if the individual has Heart Disease and as 0 if the individual does not have Heart Disease. The HeartDisease is a two level categorical variable.

HeartDisease: output class [1: heart disease, 0: Normal]

Explanatory Variables (Independent Variable) :

There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:

Age: age of the patient [years] - Numerical variable

Sex: sex of the patient [M: Male, F: Female] - Two level categorical variable

ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] - Four level categorical variable

RestingBP: resting blood pressure [mm Hg] - Numerical variable

Cholesterol: serum cholesterol [mm/dl] - Numerical variable

FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] - Two level categorical variable

RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria] - Numerical variable

MaxHR: maximum heart rate achieved [Numeric value between 60 and 202] - Numerical variable

ExerciseAngina: exercise-induced angina [Y: Yes, N: No] - Two level categorical variable

Oldpeak: oldpeak = ST [Numeric value measured in depression] - Numerical variable

ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] - Three level categorical variable

DATA ANALYSIS

Required Libraries

library(tidyverse)
library(caTools) # To split data into training and test data
library(Amelia) # To visualize missing data
library(cowplot) # To combine plots in a grid

Load the data

url <- "https://raw.githubusercontent.com/chinedu2301/DATA606-Statistics-and-Probability-for-Data-Analytics/main/heart.csv"
heart <- read_csv(url)

Check the head of the data

# Check the head of the data
head(heart)

## # A tibble: 6 x 12
##     Age Sex   ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
##   <dbl> <chr> <chr>             <dbl>       <dbl>     <dbl> <chr>      <dbl>
## 1    40 M     ATA                 140         289         0 Normal       172
## 2    49 F     NAP                 160         180         0 Normal       156
## 3    37 M     ATA                 130         283         0 ST            98
## 4    48 F     ASY                 138         214         0 Normal       108
## 5    54 M     NAP                 150         195         0 Normal       122
## 6    39 M     NAP                 120         339         0 Normal       170
## # ... with 4 more variables: ExerciseAngina <chr>, Oldpeak <dbl>,
## #   ST_Slope <chr>, HeartDisease <dbl>

Get a glimpse of the data types and structure

glimpse(heart)

## Rows: 918
## Columns: 12
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
## $ FastingBS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
## $ HeartDisease   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~

There are 918 observations and 12 variables in the dataset

Get the summary

summary(heart)

##       Age            Sex            ChestPainType        RestingBP    
##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
##  Mean   :53.51                                         Mean   :132.4  
##  3rd Qu.:60.00                                         3rd Qu.:140.0  
##  Max.   :77.00                                         Max.   :200.0  
##   Cholesterol      FastingBS       RestingECG            MaxHR      
##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
##                     Mean   : 0.8874                      Mean   :0.5534  
##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
##                     Max.   : 6.2000                      Max.   :1.0000

From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and maxHR is 136.8

Compute the mean and standard deviations for the RestingBP and Cholesterol levels for both those with Heart Disease and those without Heart Disease.

# Get the mean and standard deviations of RestingBP and Cholesterol level for individuals with HeartDisease
heart_1 <- heart %>% filter(HeartDisease == 1)
meanBP_heart_1 <- mean(heart_1$RestingBP)
stdBP_heart_1 <- sd(heart_1$RestingBP)
meanCh_heart_1 <- mean(heart_1$Cholesterol)
stdCh_heart_1 <- sd(heart_1$Cholesterol)
n_heart_1 <- nrow(heart_1)

# Get the mean and standard deviations of RestingBP and Cholesterol level for individuals without HeartDisease
heart_0 <- heart %>% filter(HeartDisease == 0)
meanBP_heart_0 <- mean(heart_0$RestingBP)
stdBP_heart_0 <- sd(heart_0$RestingBP)
meanCh_heart_0 <- mean(heart_0$Cholesterol)
stdCh_heart_0 <- sd(heart_0$Cholesterol)
n_heart_0 <- nrow(heart_0)

# Arrange the values in a dataframe
meanBP <- c(meanBP_heart_1, meanBP_heart_0)
stdBP <- c(stdBP_heart_1, stdBP_heart_0)
meanCho <- c(meanCh_heart_1, meanCh_heart_0)
stdCho <- c(stdCh_heart_1, stdCh_heart_0)
table <- data.frame(meanBP, stdBP, meanCho, stdCho)
row.names(table) <- c("Heart Disease", "No Heart Disease")
headers <- c("Mean RestingBP", "Std RestingBP", "Mean Cholesterol", "Std Cholesterol")
colnames(table) <- headers
table

##                  Mean RestingBP Std RestingBP Mean Cholesterol Std Cholesterol
## Heart Disease          134.1850      19.82868         175.9409       126.39140
## No Heart Disease       130.1805      16.49958         227.1220        74.63466

Hypothesis Testing for difference in mean RestingBP of individuals with Heart Disease and those without Heart Disease

State the Null and Alternative Hypothesis
Null Hypothesis, \(H_{0}\) : There is no difference in the mean RestingBP of those with Heart Disease and those without Heart Disease. \(\mu_{BPHeartDisease} - \mu_{BPNoHeartDisease} = 0\)
Alternative Hypothesis, \(H_{1}\) : There is some difference in the mean RestingBP of those with Heart Disease and those without Heart Disease. \(\mu_{BPHeartDisease} - \mu_{BPNoHeartDisease} \neq 0\)

Check conditions:
Independence: The sample come from difference random samples. Hence, the independence criteria is satisfied.
Normality: The sample size is large enough. Hence, we can assume a nearly normal distribution.

Compute Test Statistics:
\(SE_{diff} = \sqrt{\frac{s^{2}}{n_{hd}} + \frac{s^{2}}{n_{nhd}}}\)
\(\bar{x}_{BpHD - BpNHD} = meanBPheart_1 - meanBPheart_0\)
Test statistic \(T = \frac{\bar{x}_{diff} - \mu_{diff}}{SE_{diff}}\)

mu_diff <- 0
xbar_diff <- meanBP_heart_1 - meanBP_heart_0
SE_diff <- round((sqrt((stdBP_heart_1^2)/n_heart_1 + (stdBP_heart_0^2)/n_heart_0)),4)
t <- round((xbar_diff - mu_diff)/SE_diff, 4)
paste0("The test statistic, t is : ", t)

## [1] "The test statistic, t is : 3.3394"

Compute the p - value:

alpha <- 0.05
df <- n_heart_0 + n_heart_1 - 2
p_value <- round(2*pt(t, df), 6)
paste0("The p-value is ", p_value)

## [1] "The p-value is 1.999127"

Conclusion:
Since the p-value is greater than 0.05, we do not reject the null hypothesis at \(\alpha = 0.05\). Therefore, there is sufficient statistical evidence to support the null hypothesis that there is no difference in the mean RestingBP of those with Heart Disease and those without Heart Disease.

Hypothesis Testing for difference in mean Cholesterol level of individuals with Heart Disease and those without Heart Disease

State the Null and Alternative Hypothesis
Null Hypothesis, \(H_{0}\) : There is no difference in the mean Cholesterol level of those with Heart Disease and those without Heart Disease. \(\mu_{CLHeartDisease} - \mu_{CLNoHeartDisease} = 0\)
Alternative Hypothesis, \(H_{1}\) : There is some difference in the mean Cholesterol level of those with Heart Disease and those without Heart Disease. \(\mu_{CLHeartDisease} - \mu_{CLNoHeartDisease} \neq 0\)

Compute Test Statistics:
\(SE_{diff} = \sqrt{\frac{s^{2}}{n_{hd}} + \frac{s^{2}}{n_{nhd}}}\)
\(\bar{x}_{Cl,HD - Cl,NHD} = meanCLheart_1 - meanCLheart_0\)
Test statistic \(T = \frac{\bar{x}_{diff} - \mu_{diff}}{SE_{diff}}\)

mu_diff <- 0
xbar_diff_cl <- meanCh_heart_1 - meanCh_heart_0
SE_diff <- round((sqrt((stdCh_heart_1^2)/n_heart_1 + (stdCh_heart_0^2)/n_heart_0)),4)
t2 <- round((xbar_diff_cl - mu_diff)/SE_diff, 4)
paste0("The test statistic, t is : ", t2)

## [1] "The test statistic, t is : -7.6269"

Compute the p - value:

alpha <- 0.05
df <- n_heart_0 + n_heart_1 - 2
p_value2 <- round(2*pt(t2, df), 6)
paste0("The p-value is ", p_value2)

## [1] "The p-value is 0"

Conclusion:
Since the p-value is less than 0.05, we reject the null hypothesis at \(\alpha = 0.05\). Therefore, there is no sufficient statistical evidence to support the null hypothesis that the mean Cholesterol level of those with Heart Disease is the same as those without Heart Disease.

Logistic Regression Model to predict whether an Individual will have heart disease or not.

Exploratory Data Analysis:

Check for Null values

# Check for NA values
any(is.na(heart))

## [1] FALSE

Visualize the na values

# use missmap function from the Amelia package to check for NA values
missmap(heart, main = "Heart Data - Missing Values", col = c("yellow", "black"), legend = FALSE)

## Warning: Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `imputations`.

There are no NA values in the dataset

Bar Graph by Gender

# Bar Chart by Sex for the entire data set
p1 <- ggplot(heart, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - All") + ylab(NULL)

# Bar plot by Sex for only those with Heart Disease
p2 <- ggplot(heart_1, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - Heart Disease") + ylab(NULL)

# Bar plot by Sex for only those with no hear disease
p3 <- ggplot(heart_0, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - No Heart Disease") + ylab(NULL)

# Bar plot of individuals who have heart disease by Sex
p4 <- heart %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>%
  ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() + ylab(NULL) +
  labs(title = "HeartDisease vs No HeartDisease")

# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)

Histogram to show distribution by age

# Histogram to show age distribution in the dataset
p5 <- heart |> ggplot(aes(x = Age)) + geom_histogram(fill = "brown", binwidth = 2) + theme_bw() + 
  labs(title = "Distribution by Age") + ylab(NULL)

# Histogram of Cholesterol level
p6 <- ggplot(heart, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") +
  labs(title = "Distribution of Cholesterol level") + ylab(NULL) + theme_bw()

# Histogram of RestingBP
p7 <- heart %>% ggplot(aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") +
  labs(title = "Distribution of RestingBP") + ylab(NULL) + theme_bw()

# Plot all the histograms in a grid
plot_grid(p5, p6, p7)

Scatter plot of RestingBP vs Cholesterol

# RestingBP vs Cholesterol
heart |> ggplot(aes(x = Cholesterol, y = RestingBP, color = RestingECG)) + geom_point() +
  labs(title = "RestingBP vs Cholesterol") + theme_bw()

Box Plot of RestingBP for each ChestPainType

# Boxplot by ChestPainType
heart |> ggplot() + geom_boxplot(aes(x = ChestPainType, y = RestingBP)) + 
  labs(title = "Box Plot of Resting BP vs ChestPainType") + theme_bw()

Train Test Split

Use the CaTools library to split the dataset into training and testing datasets

# Set a seed
set.seed(101)

#Split the sample
sample <- sample.split(heart$HeartDisease, SplitRatio = 0.8) 

# Training Data
heart_train <- subset(heart, sample == TRUE)

# Testing Data
heart_test <- subset(heart, sample == FALSE)

Train the model

Train the model using a logistic model

# Train the model
heart_logistic_model <- glm(formula = HeartDisease ~ . , family = binomial(link = 'logit'), 
                            data = heart_train)

Get the summary of the model

# Get the summary of the logistic model
summary(heart_logistic_model)

## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial(link = "logit"), 
##     data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7225  -0.4271   0.1908   0.4654   2.5234  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.293756   1.507163  -0.195 0.845466    
## Age               0.009792   0.014414   0.679 0.496944    
## SexM              1.355461   0.303400   4.468 7.91e-06 ***
## ChestPainTypeATA -1.555284   0.346433  -4.489 7.14e-06 ***
## ChestPainTypeNAP -1.595361   0.295502  -5.399 6.71e-08 ***
## ChestPainTypeTA  -1.319753   0.462158  -2.856 0.004295 ** 
## RestingBP         0.002430   0.006451   0.377 0.706407    
## Cholesterol      -0.004666   0.001204  -3.876 0.000106 ***
## FastingBS         0.940464   0.291908   3.222 0.001274 ** 
## RestingECGNormal -0.287463   0.296022  -0.971 0.331507    
## RestingECGST     -0.270019   0.378956  -0.713 0.476134    
## MaxHR            -0.005918   0.005358  -1.105 0.269320    
## ExerciseAnginaY   0.886983   0.269180   3.295 0.000984 ***
## Oldpeak           0.452125   0.133341   3.391 0.000697 ***
## ST_SlopeFlat      1.529064   0.466951   3.275 0.001058 ** 
## ST_SlopeUp       -0.743409   0.483267  -1.538 0.123976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1009.24  on 733  degrees of freedom
## Residual deviance:  495.73  on 718  degrees of freedom
## AIC: 527.73
## 
## Number of Fisher Scoring iterations: 5

Fit the model

Predict values using the model

fit_heart_probabilities <- predict(heart_logistic_model, newdata = heart_test, type = "response")

Properly group the probabilities

# Make probabilities greater than 0.5 to be 1
fit_heart_results <- ifelse(fit_heart_probabilities > 0.5, 1, 0)

Evaluate the model

Accuracy

# Misclassification Error
misclassError <- mean(fit_heart_results != heart_test$HeartDisease)
accuracy = round((1 - misclassError), 4) * 100
paste0("The accuracy of the logistic regression model is ", accuracy, "%")

## [1] "The accuracy of the logistic regression model is 88.59%"

Confusion Matrix

print("-CONFUSION MATRIX-")

## [1] "-CONFUSION MATRIX-"

table(heart_test$HeartDisease, fit_heart_results > 0.5)

##    
##     FALSE TRUE
##   0    68   14
##   1     7   95

CONCLUSION

From exploratory data analysis, we see that Males are more likely to have HeartDisease than females. Also, from the summary of the logistic model, “SexM - Male Gender” is a significant predictor of HeartDisease. Furthermore, from hypothesis testing of the difference in mean cholesterol level for those with and without heart disease, we see that there is a significant difference in their cholesterol level, but there is no significant difference in their RestingBP. Looking at the summary of the logistic model, we can easily see that Cholesterol level is also a significant predictor of HeartDisease, while RestingBP is not a significant predictor of HeartDisease. In addition, some other significant predictors of HeartDisease from the model summary are: ChestPainType, whether the individual ExerciseAngina, and Fasting Blood Sugar.