Data Preparation

Load required libraries:

library(tidyverse)

Load the data from github repo

url <- "https://raw.githubusercontent.com/chinedu2301/DATA606-Statistics-and-Probability-for-Data-Analytics/main/heart.csv"
heart <- read_csv(url)

Look at the head of the data

head(heart, n = 10)

Get a glimpse of the variables in the datasets.

# get a glimpse of the variables
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,~
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", ~
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",~
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, ~
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, ~
## $ FastingBS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",~
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9~
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", ~
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, ~
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl~
## $ HeartDisease   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1~

There are 908 rows and 12 columns in this dataset.

Research question

This project aims to:
  • Determine if the mean Resting Blood Pressure (RestingBP) of individuals in the dataset who develop heart disease is significantly different from the mean Resting BloodPressure of individuals who do not develop heart disease.
  • Determine if the mean Cholesterol level of individuals who develop heart disease is significantly different from the mean Cholesterol level of individuals who do not develop heart disease.
  • Predict whether an individual will develop heart disease or not using Logistic Regression model in R.
  • Cases

    There are 12 variables and 918 observations in the dataset. Eleven(11) of the 12 variables in the dataset are potential predictors of the twelfth(12th) variable - HeartDisease.
    Each observation represents the characteristics of an individual such as Age, Sex, RestingBP, Cholesterol level, etc. and whether that individual has a Heart Disease or not.

    Data collection

    This dataset was downloaded from Kaggle and then uploaded to my github repository.

    Type of study

    This is an observational study as there is no control group.

    Data Source

    This data was collected from kaggle and it’s available here

    Response Variable (Dependent Variable)

    The dependent variable is “HeartDisease” which is coded as 1 if the individual has Heart Disease and as 0 if the individual does not have Heart Disease. The HeartDisease is a two level categorical variable.
  • HeartDisease: output class [1: heart disease, 0: Normal]
  • Independent Variable (Explanatory or predictor variables)

    There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:

  • Age: age of the patient [years]
  • Sex: sex of the patient [M: Male, F: Female]
  • ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  • RestingBP: resting blood pressure [mm Hg]
  • Cholesterol: serum cholesterol [mm/dl]
  • FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  • RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
  • MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  • ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  • Oldpeak: oldpeak = ST [Numeric value measured in depression]
  • ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  • Relevant Summary statistics

    Relevant statistics are:

    Summary statistics of all variables

    summary(heart)
    ##       Age            Sex            ChestPainType        RestingBP    
    ##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
    ##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
    ##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
    ##  Mean   :53.51                                         Mean   :132.4  
    ##  3rd Qu.:60.00                                         3rd Qu.:140.0  
    ##  Max.   :77.00                                         Max.   :200.0  
    ##   Cholesterol      FastingBS       RestingECG            MaxHR      
    ##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
    ##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
    ##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
    ##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
    ##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
    ##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
    ##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
    ##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
    ##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
    ##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
    ##                     Mean   : 0.8874                      Mean   :0.5534  
    ##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
    ##                     Max.   : 6.2000                      Max.   :1.0000

    From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and mean maxHR is 136.8

    Visualizations

    # Bar chart by gender
    ggplot(heart, aes(x = Sex)) + geom_bar(fill = "brown") + theme_bw() + 
      labs(title = "Bar Graph of total count by Gender") + ylab(NULL)

    The bar chart shows that there are way more Males in the dataset than Females.

    # Barchart of individuals who have heart disease by gender
    heart %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>% 
      ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() + 
      labs(title = "Bar Graph by Individuals who have HeartDisease") + xlab("HeartDisease") + ylab(NULL)

    # Histogram of RestingBP
    ggplot(heart, aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") + 
      labs(title = "Distribution of RestingBP") + ylab(NULL)

    # Histogram of Age
    ggplot(heart, aes(x = Age)) + geom_histogram(binwidth = 2, fill = "brown") + 
      labs(title = "Distribution of Age") + ylab(NULL)

    # Histogram of Cholesterol level
    ggplot(heart, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") + 
      labs(title = "Distribution of Cholesterol level") + ylab(NULL)