ABSTRACT

In this project, the data from kaggle about individuals with heart disease was analyzed. From the analysis, there is sufficient statistical evidence that there is no difference in mean RestingBP of individuals with HeartDisease and those without HeartDisease. However, there is no sufficient statistical evidence that there is no difference in the Cholesterol level of individuals with HeartDisease and those without HeartDisease. i.e. There is a significant difference in the cholesterol level of individuals with HeartDisease and those without HeartDisease, whereas there is no significant difference in their Resting Blood Pressure. Also, a logistic regression model to predict whether an individual will develop HeartDisease or not was developed in this project based on the features present in the data.

INTRODUCTION

This project aims to:
  • Determine if the mean Resting Blood Pressure (RestingBP) of individuals in the dataset who develop heart disease is significantly different from the mean Resting BloodPressure of individuals who do not develop heart disease.
  • Determine if the mean Cholesterol level of individuals who develop heart disease is significantly different from the mean Cholesterol level of individuals who do not develop heart disease.
  • Predict whether an individual will develop heart disease or not using Logistic Regression model in R.
  • DATA

    Data Source:

    The data was gotten from Kaggle

    Data Collection:

    According to the kaggle source, this dataset was created by combining different datasets already available independently but not combined before. In this dataset, 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

  • Cleveland: 303 observations
  • Hungarian: 294 observations
  • Switzerland: 123 observations
  • Long Beach VA: 200 observations
  • Stalog (Heart) Data Set: 270 observations

  • Total: 1190 observations
    Duplicated: 272 observations

    Final dataset: 918 observations

    Every dataset used can be found under the Index of heart disease datasets from UCI Machine Learning Repository on the following link: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

    Type of Study:

    This is an observational study

    Cases:

    There are 12 variables and 918 observations. Eleven (11) of the 12 variables are explanatory variables

    Response Variable (Dependent Variable) :

    The dependent variable is “HeartDisease” which is coded as 1 if the individual has Heart Disease and as 0 if the individual does not have Heart Disease. The HeartDisease is a two level categorical variable.
  • HeartDisease: output class [1: heart disease, 0: Normal]
  • Explanatory Variables (Independent Variable) :

    There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:

  • Age: age of the patient [years] - Numerical variable
  • Sex: sex of the patient [M: Male, F: Female] - Two level categorical variable
  • ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] - Four level categorical variable
  • RestingBP: resting blood pressure [mm Hg] - Numerical variable
  • Cholesterol: serum cholesterol [mm/dl] - Numerical variable
  • FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] - Two level categorical variable
  • RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria] - Numerical variable
  • MaxHR: maximum heart rate achieved [Numeric value between 60 and 202] - Numerical variable
  • ExerciseAngina: exercise-induced angina [Y: Yes, N: No] - Two level categorical variable
  • Oldpeak: oldpeak = ST [Numeric value measured in depression] - Numerical variable
  • ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] - Three level categorical variable
  • DATA ANALYSIS

    Required Libraries

    library(tidyverse)
    library(caTools) # To split data into training and test data
    library(Amelia) # To visualize missing data
    library(cowplot) # To combine plots in a grid

    Load the data

    url <- "https://raw.githubusercontent.com/chinedu2301/DATA606-Statistics-and-Probability-for-Data-Analytics/main/heart.csv"
    heart <- read_csv(url)

    Check the head of the data

    # Check the head of the data
    head(heart)
    ## # A tibble: 6 x 12
    ##     Age Sex   ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
    ##   <dbl> <chr> <chr>             <dbl>       <dbl>     <dbl> <chr>      <dbl>
    ## 1    40 M     ATA                 140         289         0 Normal       172
    ## 2    49 F     NAP                 160         180         0 Normal       156
    ## 3    37 M     ATA                 130         283         0 ST            98
    ## 4    48 F     ASY                 138         214         0 Normal       108
    ## 5    54 M     NAP                 150         195         0 Normal       122
    ## 6    39 M     NAP                 120         339         0 Normal       170
    ## # ... with 4 more variables: ExerciseAngina <chr>, Oldpeak <dbl>,
    ## #   ST_Slope <chr>, HeartDisease <dbl>

    Get a glimpse of the data types and structure

    glimpse(heart)
    ## Rows: 918
    ## Columns: 12
    ## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
    ## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
    ## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
    ## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
    ## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
    ## $ FastingBS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
    ## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
    ## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
    ## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
    ## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
    ## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
    ## $ HeartDisease   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~

    There are 918 observations and 12 variables in the dataset

    Get the summary

    summary(heart)
    ##       Age            Sex            ChestPainType        RestingBP    
    ##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
    ##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
    ##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
    ##  Mean   :53.51                                         Mean   :132.4  
    ##  3rd Qu.:60.00                                         3rd Qu.:140.0  
    ##  Max.   :77.00                                         Max.   :200.0  
    ##   Cholesterol      FastingBS       RestingECG            MaxHR      
    ##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
    ##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
    ##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
    ##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
    ##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
    ##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
    ##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
    ##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
    ##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
    ##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
    ##                     Mean   : 0.8874                      Mean   :0.5534  
    ##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
    ##                     Max.   : 6.2000                      Max.   :1.0000

    From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and maxHR is 136.8

    Compute the mean and standard deviations for the RestingBP and Cholesterol levels for both those with Heart Disease and those without Heart Disease.

    # Get the mean and standard deviations of RestingBP and Cholesterol level for individuals with HeartDisease
    heart_1 <- heart %>% filter(HeartDisease == 1)
    meanBP_heart_1 <- mean(heart_1$RestingBP)
    stdBP_heart_1 <- sd(heart_1$RestingBP)
    meanCh_heart_1 <- mean(heart_1$Cholesterol)
    stdCh_heart_1 <- sd(heart_1$Cholesterol)
    n_heart_1 <- nrow(heart_1)
    
    # Get the mean and standard deviations of RestingBP and Cholesterol level for individuals without HeartDisease
    heart_0 <- heart %>% filter(HeartDisease == 0)
    meanBP_heart_0 <- mean(heart_0$RestingBP)
    stdBP_heart_0 <- sd(heart_0$RestingBP)
    meanCh_heart_0 <- mean(heart_0$Cholesterol)
    stdCh_heart_0 <- sd(heart_0$Cholesterol)
    n_heart_0 <- nrow(heart_0)
    
    # Arrange the values in a dataframe
    meanBP <- c(meanBP_heart_1, meanBP_heart_0)
    stdBP <- c(stdBP_heart_1, stdBP_heart_0)
    meanCho <- c(meanCh_heart_1, meanCh_heart_0)
    stdCho <- c(stdCh_heart_1, stdCh_heart_0)
    table <- data.frame(meanBP, stdBP, meanCho, stdCho)
    row.names(table) <- c("Heart Disease", "No Heart Disease")
    headers <- c("Mean RestingBP", "Std RestingBP", "Mean Cholesterol", "Std Cholesterol")
    colnames(table) <- headers
    table
    ##                  Mean RestingBP Std RestingBP Mean Cholesterol Std Cholesterol
    ## Heart Disease          134.1850      19.82868         175.9409       126.39140
    ## No Heart Disease       130.1805      16.49958         227.1220        74.63466

    Hypothesis Testing for difference in mean RestingBP of individuals with Heart Disease and those without Heart Disease

    State the Null and Alternative Hypothesis
    Null Hypothesis, \(H_{0}\) : There is no difference in the mean RestingBP of those with Heart Disease and those without Heart Disease. \(\mu_{BPHeartDisease} - \mu_{BPNoHeartDisease} = 0\)
    Alternative Hypothesis, \(H_{1}\) : There is some difference in the mean RestingBP of those with Heart Disease and those without Heart Disease. \(\mu_{BPHeartDisease} - \mu_{BPNoHeartDisease} \neq 0\)

    Check conditions:
    Independence: The sample come from difference random samples. Hence, the independence criteria is satisfied.
    Normality: The sample size is large enough. Hence, we can assume a nearly normal distribution.

    Compute Test Statistics:
    \(SE_{diff} = \sqrt{\frac{s^{2}}{n_{hd}} + \frac{s^{2}}{n_{nhd}}}\)
    \(\bar{x}_{BpHD - BpNHD} = meanBPheart_1 - meanBPheart_0\)
    Test statistic \(T = \frac{\bar{x}_{diff} - \mu_{diff}}{SE_{diff}}\)

    mu_diff <- 0
    xbar_diff <- meanBP_heart_1 - meanBP_heart_0
    SE_diff <- round((sqrt((stdBP_heart_1^2)/n_heart_1 + (stdBP_heart_0^2)/n_heart_0)),4)
    t <- round((xbar_diff - mu_diff)/SE_diff, 4)
    paste0("The test statistic, t is : ", t)
    ## [1] "The test statistic, t is : 3.3394"

    Compute the p - value:

    alpha <- 0.05
    df <- n_heart_0 + n_heart_1 - 2
    p_value <- round(2*pt(t, df), 6)
    paste0("The p-value is ", p_value)
    ## [1] "The p-value is 1.999127"

    Conclusion:
    Since the p-value is greater than 0.05, we do not reject the null hypothesis at \(\alpha = 0.05\). Therefore, there is sufficient statistical evidence to support the null hypothesis that there is no difference in the mean RestingBP of those with Heart Disease and those without Heart Disease.

    Hypothesis Testing for difference in mean Cholesterol level of individuals with Heart Disease and those without Heart Disease

    State the Null and Alternative Hypothesis
    Null Hypothesis, \(H_{0}\) : There is no difference in the mean Cholesterol level of those with Heart Disease and those without Heart Disease. \(\mu_{CLHeartDisease} - \mu_{CLNoHeartDisease} = 0\)
    Alternative Hypothesis, \(H_{1}\) : There is some difference in the mean Cholesterol level of those with Heart Disease and those without Heart Disease. \(\mu_{CLHeartDisease} - \mu_{CLNoHeartDisease} \neq 0\)

    Check conditions:
    Independence: The sample come from difference random samples. Hence, the independence criteria is satisfied.
    Normality: The sample size is large enough. Hence, we can assume a nearly normal distribution.

    Compute Test Statistics:
    \(SE_{diff} = \sqrt{\frac{s^{2}}{n_{hd}} + \frac{s^{2}}{n_{nhd}}}\)
    \(\bar{x}_{Cl,HD - Cl,NHD} = meanCLheart_1 - meanCLheart_0\)
    Test statistic \(T = \frac{\bar{x}_{diff} - \mu_{diff}}{SE_{diff}}\)

    mu_diff <- 0
    xbar_diff_cl <- meanCh_heart_1 - meanCh_heart_0
    SE_diff <- round((sqrt((stdCh_heart_1^2)/n_heart_1 + (stdCh_heart_0^2)/n_heart_0)),4)
    t2 <- round((xbar_diff_cl - mu_diff)/SE_diff, 4)
    paste0("The test statistic, t is : ", t2)
    ## [1] "The test statistic, t is : -7.6269"

    Compute the p - value:

    alpha <- 0.05
    df <- n_heart_0 + n_heart_1 - 2
    p_value2 <- round(2*pt(t2, df), 6)
    paste0("The p-value is ", p_value2)
    ## [1] "The p-value is 0"

    Conclusion:
    Since the p-value is less than 0.05, we reject the null hypothesis at \(\alpha = 0.05\). Therefore, there is no sufficient statistical evidence to support the null hypothesis that the mean Cholesterol level of those with Heart Disease is the same as those without Heart Disease.

    Logistic Regression Model to predict whether an Individual will have heart disease or not.

    Exploratory Data Analysis:

    Check for Null values

    # Check for NA values
    any(is.na(heart))
    ## [1] FALSE

    Visualize the na values

    # use missmap function from the Amelia package to check for NA values
    missmap(heart, main = "Heart Data - Missing Values", col = c("yellow", "black"), legend = FALSE)
    ## Warning: Unknown or uninitialised column: `arguments`.
    
    ## Warning: Unknown or uninitialised column: `arguments`.
    ## Warning: Unknown or uninitialised column: `imputations`.

    There are no NA values in the dataset

    Bar Graph by Gender

    # Bar Chart by Sex for the entire data set
    p1 <- ggplot(heart, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
      labs(title = "Bar Graph by Sex - All") + ylab(NULL)
    
    # Bar plot by Sex for only those with Heart Disease
    p2 <- ggplot(heart_1, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
      labs(title = "Bar Graph by Sex - Heart Disease") + ylab(NULL)
    
    # Bar plot by Sex for only those with no hear disease
    p3 <- ggplot(heart_0, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
      labs(title = "Bar Graph by Sex - No Heart Disease") + ylab(NULL)
    
    # Bar plot of individuals who have heart disease by Sex
    p4 <- heart %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>%
      ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() + ylab(NULL) +
      labs(title = "HeartDisease vs No HeartDisease")
    
    # Plot all bar graphs in a grid
    plot_grid(p1, p2, p3, p4)

    Histogram to show distribution by age

    # Histogram to show age distribution in the dataset
    p5 <- heart |> ggplot(aes(x = Age)) + geom_histogram(fill = "brown", binwidth = 2) + theme_bw() + 
      labs(title = "Distribution by Age") + ylab(NULL)
    
    # Histogram of Cholesterol level
    p6 <- ggplot(heart, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") +
      labs(title = "Distribution of Cholesterol level") + ylab(NULL) + theme_bw()
    
    # Histogram of RestingBP
    p7 <- heart %>% ggplot(aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") +
      labs(title = "Distribution of RestingBP") + ylab(NULL) + theme_bw()
    
    # Plot all the histograms in a grid
    plot_grid(p5, p6, p7)

    Scatter plot of RestingBP vs Cholesterol

    # RestingBP vs Cholesterol
    heart |> ggplot(aes(x = Cholesterol, y = RestingBP, color = RestingECG)) + geom_point() +
      labs(title = "RestingBP vs Cholesterol") + theme_bw()

    Box Plot of RestingBP for each ChestPainType

    # Boxplot by ChestPainType
    heart |> ggplot() + geom_boxplot(aes(x = ChestPainType, y = RestingBP)) + 
      labs(title = "Box Plot of Resting BP vs ChestPainType") + theme_bw()

    Train Test Split

    Use the CaTools library to split the dataset into training and testing datasets

    # Set a seed
    set.seed(101)
    
    #Split the sample
    sample <- sample.split(heart$HeartDisease, SplitRatio = 0.8) 
    
    # Training Data
    heart_train <- subset(heart, sample == TRUE)
    
    # Testing Data
    heart_test <- subset(heart, sample == FALSE)

    Train the model

    Train the model using a logistic model

    # Train the model
    heart_logistic_model <- glm(formula = HeartDisease ~ . , family = binomial(link = 'logit'), 
                                data = heart_train)

    Get the summary of the model

    # Get the summary of the logistic model
    summary(heart_logistic_model)
    ## 
    ## Call:
    ## glm(formula = HeartDisease ~ ., family = binomial(link = "logit"), 
    ##     data = heart_train)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -2.7225  -0.4271   0.1908   0.4654   2.5234  
    ## 
    ## Coefficients:
    ##                   Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)      -0.293756   1.507163  -0.195 0.845466    
    ## Age               0.009792   0.014414   0.679 0.496944    
    ## SexM              1.355461   0.303400   4.468 7.91e-06 ***
    ## ChestPainTypeATA -1.555284   0.346433  -4.489 7.14e-06 ***
    ## ChestPainTypeNAP -1.595361   0.295502  -5.399 6.71e-08 ***
    ## ChestPainTypeTA  -1.319753   0.462158  -2.856 0.004295 ** 
    ## RestingBP         0.002430   0.006451   0.377 0.706407    
    ## Cholesterol      -0.004666   0.001204  -3.876 0.000106 ***
    ## FastingBS         0.940464   0.291908   3.222 0.001274 ** 
    ## RestingECGNormal -0.287463   0.296022  -0.971 0.331507    
    ## RestingECGST     -0.270019   0.378956  -0.713 0.476134    
    ## MaxHR            -0.005918   0.005358  -1.105 0.269320    
    ## ExerciseAnginaY   0.886983   0.269180   3.295 0.000984 ***
    ## Oldpeak           0.452125   0.133341   3.391 0.000697 ***
    ## ST_SlopeFlat      1.529064   0.466951   3.275 0.001058 ** 
    ## ST_SlopeUp       -0.743409   0.483267  -1.538 0.123976    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 1009.24  on 733  degrees of freedom
    ## Residual deviance:  495.73  on 718  degrees of freedom
    ## AIC: 527.73
    ## 
    ## Number of Fisher Scoring iterations: 5

    Fit the model

    Predict values using the model

    fit_heart_probabilities <- predict(heart_logistic_model, newdata = heart_test, type = "response")

    Properly group the probabilities

    # Make probabilities greater than 0.5 to be 1
    fit_heart_results <- ifelse(fit_heart_probabilities > 0.5, 1, 0)

    Evaluate the model

    Accuracy

    # Misclassification Error
    misclassError <- mean(fit_heart_results != heart_test$HeartDisease)
    accuracy = round((1 - misclassError), 4) * 100
    paste0("The accuracy of the logistic regression model is ", accuracy, "%")
    ## [1] "The accuracy of the logistic regression model is 88.59%"

    Confusion Matrix

    print("-CONFUSION MATRIX-")
    ## [1] "-CONFUSION MATRIX-"
    table(heart_test$HeartDisease, fit_heart_results > 0.5)
    ##    
    ##     FALSE TRUE
    ##   0    68   14
    ##   1     7   95

    CONCLUSION

    From exploratory data analysis, we see that Males are more likely to have HeartDisease than females. Also, from the summary of the logistic model, “SexM - Male Gender” is a significant predictor of HeartDisease. Furthermore, from hypothesis testing of the difference in mean cholesterol level for those with and without heart disease, we see that there is a significant difference in their cholesterol level, but there is no significant difference in their RestingBP. Looking at the summary of the logistic model, we can easily see that Cholesterol level is also a significant predictor of HeartDisease, while RestingBP is not a significant predictor of HeartDisease. In addition, some other significant predictors of HeartDisease from the model summary are: ChestPainType, whether the individual ExerciseAngina, and Fasting Blood Sugar.

    `