CUNY SPS DATA 622 HW1 - Exploratory Analysis

Datasets To be Analyzed

1.0 First Dataset - Heart Disease Prediction

This analyzes whether an individual will have heart disease

Load required libraries:

library(tidyverse)
library(corrplot)
library(ggcorrplot)
library(Amelia)
library(caTools)
library(cowplot)

Load the data from github repo

url <- "https://raw.githubusercontent.com/chinedu2301/data622-machine-learning-and-big-data/main/data/heart_disease.csv"
heart_disease <- read_csv(url)

Look at the head of the data

head(heart_disease, n = 10)

## # A tibble: 10 × 12
##      Age Sex   ChestPain…¹ Resti…² Chole…³ Fasti…⁴ Resti…⁵ MaxHR Exerc…⁶ Oldpeak
##    <dbl> <chr> <chr>         <dbl>   <dbl>   <dbl> <chr>   <dbl> <chr>     <dbl>
##  1    40 M     ATA             140     289       0 Normal    172 N           0  
##  2    49 F     NAP             160     180       0 Normal    156 N           1  
##  3    37 M     ATA             130     283       0 ST         98 N           0  
##  4    48 F     ASY             138     214       0 Normal    108 Y           1.5
##  5    54 M     NAP             150     195       0 Normal    122 N           0  
##  6    39 M     NAP             120     339       0 Normal    170 N           0  
##  7    45 F     ATA             130     237       0 Normal    170 N           0  
##  8    54 M     ATA             110     208       0 Normal    142 N           0  
##  9    37 M     ASY             140     207       0 Normal    130 Y           1.5
## 10    48 F     ATA             120     284       0 Normal    120 N           0  
## # … with 2 more variables: ST_Slope <chr>, HeartDisease <dbl>, and abbreviated
## #   variable names ¹ChestPainType, ²RestingBP, ³Cholesterol, ⁴FastingBS,
## #   ⁵RestingECG, ⁶ExerciseAngina

Get a glimpse of the variables in the datasets.

# get a glimpse of the variables
glimpse(heart_disease)

## Rows: 918
## Columns: 12
## $ Age            <dbl> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <dbl> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <dbl> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <dbl> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…

There are 918 rows and 12 columns in this data set.

There are 12 variables and 918 observations in the dataset. Eleven(11) of the 12 variables in the dataset are potential predictors of the twelfth(12th) variable - HeartDisease. The data is labelled.
Each observation represents the characteristics of an individual such as Age, Sex, RestingBP, Cholesterol level, etc. and whether that individual has a Heart Disease or not.

Questions

This project aims to:

Explore the dataset to understand the data better.

Find if the columns of the data correlated?.

Predict whether an individual will develop heart disease or not using Logistic Regression model in R.

Data Source

This dataset was downloaded from Kaggle and then uploaded to my github repository.

Response Variable (Dependent Variable)

The dependent variable is “HeartDisease” which is coded as 1 if the individual has Heart Disease and as 0 if the individual does not have Heart Disease. The HeartDisease is a two level categorical variable.

HeartDisease: output class [1: heart disease, 0: Normal]

Independent Variable (Explanatory or predictor variables)

There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:

Age: age of the patient [years]

Sex: sex of the patient [M: Male, F: Female]

ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]

RestingBP: resting blood pressure [mm Hg]

Cholesterol: serum cholesterol [mm/dl]

FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]

RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]

MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]

ExerciseAngina: exercise-induced angina [Y: Yes, N: No]

Oldpeak: oldpeak = ST [Numeric value measured in depression]

ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

Relevant Summary statistics

Relevant statistics are:

Summary statistics of all variables

summary(heart_disease)

##       Age            Sex            ChestPainType        RestingBP    
##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
##  Mean   :53.51                                         Mean   :132.4  
##  3rd Qu.:60.00                                         3rd Qu.:140.0  
##  Max.   :77.00                                         Max.   :200.0  
##   Cholesterol      FastingBS       RestingECG            MaxHR      
##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
##                     Mean   : 0.8874                      Mean   :0.5534  
##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
##                     Max.   : 6.2000                      Max.   :1.0000

From the summary statistics, we can see that the average age of individuals in the dataset is 53 while the median age is 54. Also, the mean RestingBP is 132, the mean Cholesterol level is 198.8, and mean maxHR is 136.8

Exploratory Data Analysis:

Check for Correlation

model.matrix(~0+., data=heart_disease) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag=FALSE, type="lower", lab=TRUE, lab_size=2)

The predictors are largely not correlated with one another as can be seen from the correlation plot.

Check for Null values

# Check for NA values
any(is.na(heart_disease))

## [1] FALSE

Visualize the na values

# use missmap function from the Amelia package to check for NA values
missmap(heart_disease, main = "Heart Data - Missing Values", col = c("yellow", "black"), legend = FALSE)

## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `imputations`.

There are no NA values in the dataset

Bar Graph by Gender

#filter the dataset for only those with heart disease
heart_1 <- heart_disease %>% filter(HeartDisease == 1)
#filter the dataset for only those without heart disease
heart_0 <- heart_disease %>% filter(HeartDisease == 0)
# Bar Chart by Sex for the entire data set
p1 <- ggplot(heart_disease, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - All") + ylab(NULL)
# Bar plot by Sex for only those with Heart Disease
p2 <- ggplot(heart_1, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - Heart Disease") + ylab(NULL)
# Bar plot by Sex for only those with no hear disease
p3 <- ggplot(heart_0, aes(x =Sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - No Heart Disease") + ylab(NULL)
# Bar plot of individuals who have heart disease by Sex
p4 <- heart_disease %>% mutate(heart_prob = ifelse(HeartDisease == 1, "Yes", "No")) %>%
  ggplot(aes(x = heart_prob, fill = Sex)) + geom_bar() + theme_bw() + ylab(NULL) +
  labs(title = "HeartDisease vs No HeartDisease")
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)

Histogram to show distribution by age

# Histogram to show age distribution in the dataset
p5 <- heart_disease |> ggplot(aes(x = Age)) + geom_histogram(fill = "brown", binwidth = 2) + theme_bw() + 
  labs(title = "Distribution by Age") + ylab(NULL)
# Histogram of Cholesterol level
p6 <- ggplot(heart_disease, aes(x = Cholesterol)) + geom_histogram(binwidth = 12, fill = "brown") +
  labs(title = "Distribution of Cholesterol level") + ylab(NULL) + theme_bw()
# Histogram of RestingBP
p7 <- heart_disease |> ggplot(aes(x = RestingBP)) + geom_histogram(binwidth = 15, fill = "brown") +
  labs(title = "Distribution of RestingBP") + ylab(NULL) + theme_bw()
# Plot all the histograms in a grid
plot_grid(p5, p6, p7)

Scatter plot of RestingBP vs Cholesterol

# RestingBP vs Cholesterol
heart_disease |> ggplot(aes(x = Cholesterol, y = RestingBP, color = RestingECG)) + geom_point() +
  labs(title = "RestingBP vs Cholesterol") + theme_bw()

Box Plot of RestingBP for each ChestPainType

# Boxplot by ChestPainType
heart_disease |> ggplot() + geom_boxplot(aes(x = ChestPainType, y = RestingBP)) + 
  labs(title = "Box Plot of Resting BP vs ChestPainType") + theme_bw()

Train Test Split

Use the CaTools library to split the dataset into training and testing datasets

# Set a seed
set.seed(1994)
#Split the sample
sample <- sample.split(heart_disease$HeartDisease, SplitRatio = 0.8) 
# Training Data
heart_train <- subset(heart_disease, sample == TRUE)
# Testing Data
heart_test <- subset(heart_disease, sample == FALSE)

Train the model

Train the model using a logistic model

# Train the model
heart_logistic_model <- glm(formula = HeartDisease ~ . , family = binomial(link = 'logit'), 
                            data = heart_train)

Get the summary of the model

# Get the summary of the logistic model
summary(heart_logistic_model)

## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial(link = "logit"), 
##     data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5574  -0.3803   0.1880   0.4854   2.4617  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.6741258  1.5794027  -1.060 0.289156    
## Age               0.0165346  0.0145153   1.139 0.254657    
## SexM              1.2972248  0.3093051   4.194 2.74e-05 ***
## ChestPainTypeATA -1.7950352  0.3586640  -5.005 5.59e-07 ***
## ChestPainTypeNAP -1.7578665  0.2888355  -6.086 1.16e-09 ***
## ChestPainTypeTA  -1.2334065  0.4718782  -2.614 0.008954 ** 
## RestingBP         0.0033623  0.0064796   0.519 0.603821    
## Cholesterol      -0.0035538  0.0011489  -3.093 0.001980 ** 
## FastingBS         1.2876688  0.3020131   4.264 2.01e-05 ***
## RestingECGNormal  0.0221109  0.3018480   0.073 0.941606    
## RestingECGST     -0.0787364  0.3897392  -0.202 0.839898    
## MaxHR            -0.0003464  0.0055610  -0.062 0.950334    
## ExerciseAnginaY   1.0607645  0.2745094   3.864 0.000111 ***
## Oldpeak           0.3001046  0.1344795   2.232 0.025641 *  
## ST_SlopeFlat      1.3126680  0.4668419   2.812 0.004926 ** 
## ST_SlopeUp       -1.0289223  0.4971077  -2.070 0.038469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1009.24  on 733  degrees of freedom
## Residual deviance:  495.17  on 718  degrees of freedom
## AIC: 527.17
## 
## Number of Fisher Scoring iterations: 5

Fit the model

Predict values using the model

fit_heart_probabilities <- predict(heart_logistic_model, newdata = heart_test, type = "response")

Properly group the probabilities

# Make probabilities greater than 0.5 to be 1
fit_heart_results <- ifelse(fit_heart_probabilities > 0.5, 1, 0)

Evaluate the model

Accuracy

# Mis-classification Error
misclassError <- mean(fit_heart_results != heart_test$HeartDisease)
accuracy = round((1 - misclassError), 4) * 100
paste0("The accuracy of the logistic regression model is ", accuracy, "%")

## [1] "The accuracy of the logistic regression model is 90.22%"

Confusion Matrix

print("-CONFUSION MATRIX-")

## [1] "-CONFUSION MATRIX-"

table(heart_test$HeartDisease, fit_heart_results > 0.5)

##    
##     FALSE TRUE
##   0    73    9
##   1     9   93

CONCLUSION

From exploratory data analysis, we see that Males are more likely to have HeartDisease than females. Also, from the summary of the logistic model, “SexM - Male Gender” is a significant predictor of HeartDisease. Furthermore, from hypothesis testing of the difference in mean cholesterol level for those with and without heart disease, we see that there is a significant difference in their cholesterol level, but there is no significant difference in their RestingBP. Looking at the summary of the logistic model, we can easily see that Cholesterol level is also a significant predictor of HeartDisease, while RestingBP is not a significant predictor of HeartDisease. In addition, some other significant predictors of HeartDisease from the model summary are: ChestPainType, whether the individual ExerciseAngina, and Fasting Blood Sugar.

2.0 Second Dataset - Death Event Prediction

This analyzes if an individual will have a death_event from a heart attack.

Load the data from github repo

url <- "https://raw.githubusercontent.com/chinedu2301/data622-machine-learning-and-big-data/main/data/heart_failure_clinical_records_dataset.csv"
heart_failure <- read_csv(url)

Look at the head of the data

head(heart_failure, n = 10)

## # A tibble: 10 × 13
##      age anaemia creatin…¹ diabe…² eject…³ high_…⁴ plate…⁵ serum…⁶ serum…⁷   sex
##    <dbl>   <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>
##  1    75       0       582       0      20       1 265000      1.9     130     1
##  2    55       0      7861       0      38       0 263358.     1.1     136     1
##  3    65       0       146       0      20       0 162000      1.3     129     1
##  4    50       1       111       0      20       0 210000      1.9     137     1
##  5    65       1       160       1      20       0 327000      2.7     116     0
##  6    90       1        47       0      40       1 204000      2.1     132     1
##  7    75       1       246       0      15       0 127000      1.2     137     1
##  8    60       1       315       1      60       0 454000      1.1     131     1
##  9    65       0       157       0      65       0 263358.     1.5     138     0
## 10    80       1       123       0      35       1 388000      9.4     133     1
## # … with 3 more variables: smoking <dbl>, time <dbl>, DEATH_EVENT <dbl>, and
## #   abbreviated variable names ¹creatinine_phosphokinase, ²diabetes,
## #   ³ejection_fraction, ⁴high_blood_pressure, ⁵platelets, ⁶serum_creatinine,
## #   ⁷serum_sodium

Get a glimpse of the variables in the datasets.

# get a glimpse of the variables
glimpse(heart_failure)

## Rows: 299
## Columns: 13
## $ age                      <dbl> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 6…
## $ anaemia                  <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, …
## $ creatinine_phosphokinase <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, …
## $ diabetes                 <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ ejection_fraction        <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 2…
## $ high_blood_pressure      <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, …
## $ platelets                <dbl> 265000, 263358, 162000, 210000, 327000, 20400…
## $ serum_creatinine         <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.1…
## $ serum_sodium             <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, …
## $ sex                      <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, …
## $ smoking                  <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, …
## $ time                     <dbl> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11,…
## $ DEATH_EVENT              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …

There are 299 rows and 13 columns in this data set.

There are 13 variables and 299 observations in the dataset. Twelve(12) of the 13 variables in the dataset are potential predictors of the thirteenth(13th) variable - DEATH_EVENT. The data is labelled.
Each observation represents the characteristics of an individual such as Age, Anaemia, Diabetes, etc. and whether that individual has a DEATH_EVENT/Heart Failure.

Questions

This project aims to:

Explore the dataset to understand the data better.

Find if the columns of the data correlated?.

Predict whether an individual will have a death event(heart failure) or not using Logistic Regression model in R.

Data Source

This dataset was downloaded from Kaggle and then uploaded to my github repository.

Response Variable (Dependent Variable)

The dependent variable is “DEATH_EVENT” which is coded as 1 if the individual had a death_event and as 0 if the individual does not have death_event. The death_event is a two level categorical variable.

death_event: output class [1: death_event, 0: no death_event]

Independent Variable (Explanatory or predictor variables)

There are eleven (11) explanatory variables most of which are numerical and some are categorical. The explanatory variables are:

age: age of the patient [years]

anaemia: Decrease of red blood cells or hemoglobin [1: Yes, 0: No]

creatinine_phosphokinase: Level of the CPK enzyme in the blood [mcg/L]

diabetes: If the patient has diabetes [1: Yes, 0: No]

ejection_fraction: Percentage of blood leaving the heart at each contraction [percentage]

high_blood_pressure: If the patient has hypertension [1: Yes, 0: No]

platelets: Platelets in the blood [kiloplatelets/mL]

serum_creatinine: Level of serum creatinine in the blood [mg/dL]

serum_sodium: Level of serum sodium in the blood [mEq/L]

sex: Woman or man [1: Man, 0: Woman]

smoking: If the person smokes or not [1: Yes, 0: No]

time:

DEATH_EVENT: Whether the individual had a death event or not [1: Yes, 0: No]

Relevant Summary statistics

Relevant statistics are:

Summary statistics of all variables

summary(heart_failure)

##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000  
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000  
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000  
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181  
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500   
##  1st Qu.:30.00     1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900   
##  Median :38.00     Median :0.0000      Median :262000   Median :1.100   
##  Mean   :38.08     Mean   :0.3512      Mean   :263358   Mean   :1.394   
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400   
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :9.400   
##   serum_sodium        sex            smoking            time      
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.0  
##  1st Qu.:134.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 73.0  
##  Median :137.0   Median :1.0000   Median :0.0000   Median :115.0  
##  Mean   :136.6   Mean   :0.6488   Mean   :0.3211   Mean   :130.3  
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:203.0  
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0  
##   DEATH_EVENT    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3211  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

From the summary statistics, we can see that the average age of individuals in the dataset is 60 and the median age is about 60 as well.

Exploratory Data Analysis:

Check for Correlation

heart_failure_correlation = cor(heart_failure, method = c("spearman"))
corrplot(heart_failure_correlation)

The predictors are largely not correlated with one another as can be seen from the correlation plot.

Check for Null values

# Check for NA values
any(is.na(heart_failure))

## [1] FALSE

Visualize the na values

# use missmap function from the Amelia package to check for NA values
missmap(heart_failure, main = "Heart Failure/Attack - Missing Values", col = c("yellow", "black"), legend = FALSE)

## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `imputations`.

There are no NA values in the dataset

#filter the dataset for only those with heart disease
deathEvent_1 <- heart_failure %>% filter(DEATH_EVENT == 1)
#filter the dataset for only those without heart disease
deathEvent_0 <- heart_failure %>% filter(DEATH_EVENT == 0)
p1 <- ggplot(heart_failure, aes(x = sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - All") + ylab(NULL)
# Bar plot by Sex for only those with Heart Disease
p2 <- ggplot(deathEvent_1, aes(x = sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - DEATH_EVENT") + ylab(NULL)
# Bar plot by Sex for only those with no hear disease
p3 <- ggplot(deathEvent_0, aes(x = sex)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph by Sex - No DEATH_EVENT") + ylab(NULL)
# Bar plot of individuals who have heart disease by Sex
p4 <- heart_failure %>% mutate(heart_prob = ifelse(DEATH_EVENT == 1, "Yes", "No")) %>%
  ggplot(aes(x = heart_prob, fill = sex)) + geom_bar() + theme_bw() + ylab(NULL) +
  labs(title = "DEATH_EVENT vs No DEATH_EVENT")
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)

## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

# Bar Charts
p1 <- ggplot(deathEvent_1, aes(x =anaemia)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph - Anaemia") + ylab(NULL)
p2 <- ggplot(deathEvent_1, aes(x =high_blood_pressure)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph - HBP") + ylab(NULL)
p3 <- ggplot(deathEvent_1, aes(x =diabetes)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph - Diabetes") + ylab(NULL)
p4 <- ggplot(deathEvent_1, aes(x =smoking)) + geom_bar(fill = "brown") + theme_bw() +
  labs(title = "Bar Graph - Smoking") + ylab(NULL)
# Plot all bar graphs in a grid
plot_grid(p1, p2, p3, p4)

Histogram to show distribution by age

# Histogram of ejection_fraction
ejection_fraction <- ggplot(heart_failure, aes(x = ejection_fraction)) + geom_histogram(binwidth = 12, fill = "brown") +
  labs(title = "Distribution of ejection_fraction") + ylab(NULL) + theme_bw()
ejection_fraction

Train Test Split

Use the CaTools library to split the dataset into training and testing datasets

# Set a seed
set.seed(1994)
#Split the sample
sample <- sample.split(heart_failure$DEATH_EVENT, SplitRatio = 0.8) 
# Training Data
heart_failure_train <- subset(heart_failure, sample == TRUE)
# Testing Data
heart_failure_test <- subset(heart_failure, sample == FALSE)

Train the model

Train the model using a logistic model

# Train the model
deathEvent_logistic_model <- glm(formula = DEATH_EVENT ~ . , family = binomial(link = 'logit'), 
                            data = heart_failure_train)

Get the summary of the model

# Get the summary of the logistic model
summary(deathEvent_logistic_model)

## 
## Call:
## glm(formula = DEATH_EVENT ~ ., family = binomial(link = "logit"), 
##     data = heart_failure_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2584  -0.5229  -0.2296   0.4473   2.3042  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               7.355e+00  6.123e+00   1.201 0.229684    
## age                       5.523e-02  1.804e-02   3.062 0.002200 ** 
## anaemia                   3.646e-01  4.088e-01   0.892 0.372524    
## creatinine_phosphokinase  2.726e-04  1.952e-04   1.396 0.162567    
## diabetes                  1.631e-01  4.108e-01   0.397 0.691433    
## ejection_fraction        -7.595e-02  1.804e-02  -4.210 2.55e-05 ***
## high_blood_pressure      -8.921e-02  4.227e-01  -0.211 0.832845    
## platelets                -1.016e-06  2.200e-06  -0.462 0.644112    
## serum_creatinine          8.009e-01  2.084e-01   3.843 0.000121 ***
## serum_sodium             -5.531e-02  4.313e-02  -1.282 0.199689    
## sex                      -1.444e-01  4.706e-01  -0.307 0.758945    
## smoking                  -3.494e-01  4.619e-01  -0.756 0.449355    
## time                     -2.061e-02  3.333e-03  -6.183 6.28e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 300.42  on 238  degrees of freedom
## Residual deviance: 168.39  on 226  degrees of freedom
## AIC: 194.39
## 
## Number of Fisher Scoring iterations: 6

Fit the model

Predict values using the model

fit_deathEvent_probabilities <- predict(deathEvent_logistic_model, newdata = heart_failure_test, type = "response")

Properly group the probabilities

# Make probabilities greater than 0.5 to be 1
fit_deathEvent_results <- ifelse(fit_deathEvent_probabilities > 0.5, 1, 0)

Evaluate the model

Accuracy

# Mis-classification Error
misclassError <- mean(fit_deathEvent_results != heart_failure_test$DEATH_EVENT)
accuracy = round((1 - misclassError), 4) * 100
paste0("The accuracy of the logistic regression model is ", accuracy, "%")

## [1] "The accuracy of the logistic regression model is 83.33%"

Confusion Matrix

print("-CONFUSION MATRIX-")

## [1] "-CONFUSION MATRIX-"

table(heart_failure_test$DEATH_EVENT, fit_deathEvent_results > 0.5)

##    
##     FALSE TRUE
##   0    39    2
##   1     8   11

CONCLUSION

From exploratory data analysis, we see that Males are more likely to have Death Events from a heart attack.

CUNY SPS DATA 622 HW1 - Exploratory Analysis

Chinedu Onyeka

March 1st, 2023

Datasets To be Analyzed

1.0 First Dataset - Heart Disease Prediction

Questions

Data Source

Response Variable (Dependent Variable)

Independent Variable (Explanatory or predictor variables)

Relevant Summary statistics

Exploratory Data Analysis:

Train Test Split

Train the model

Fit the model

Evaluate the model

CONCLUSION

2.0 Second Dataset - Death Event Prediction

Questions

Data Source

Response Variable (Dependent Variable)

Independent Variable (Explanatory or predictor variables)

Relevant Summary statistics

Exploratory Data Analysis:

Train Test Split

Train the model

Fit the model

Evaluate the model

CONCLUSION