Methods

`dplyr`

Isnstalldplyr package and add them to our library.

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

`ggplot2` & more

#install.packages("ggplot2")
library(ggplot2)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(Rmisc)

## Loading required package: lattice

## Loading required package: plyr

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(extrafont)

## Registering fonts with R

library(ggthemes)

Reading the Data

Our data: I found my dataset from Kaggle. You can access the data and more information about it here

“This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The”goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4."

Content

Attribute Information:

1). age: Age of the patient, in years
2). sex: (1= male, 0 = female)
3). cp: chest pain type (4 values) Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
4). trestbps: resting blood pressure
5). chol: serum cholestoral in mg/dl
6). fbs: fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7). restecg: resting electrocardiographic results (values 0,1,2) Value 0:normal, Value 1: having ST-T wave abnormality, Value 2: showing probable or definite left ventricular hypertropy by Estes
8). thalach: maximum heart rate achievecd
9). exang: exercise induced angina (1 = yes, 0 = no)
10).oldpeak: = ST depression induced by exercise relative to rest
11).slope: the slope of the peak exercise ST segment 12).ca: number of major vessels (0-3) colored by flourosopy 13).thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14). target: Diagnoses of heart disease (Value 0: <50% diameter narrowing, Value 1: > 50% diameter narrowing)

Read the data in using read.csv

heart <-read.csv("heart.csv")
heart

Missing Values

#number of missing values we have in our dataset
sum(is.na(heart))

## [1] 0

Luckily we have no missing values, so we can continue our EDA without having to worry about NULL values*

str(heart)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

We have 303 observations and 14 variables.All of the variables are integers.
Variables like sex, , which should be categorical, are integers too. We could change them to categorical, however I think it easier to just use as.factor() when graphing instead of changing the data.

# class of our data
class(heart)

## [1] "data.frame"

# gives us the first 5 observations 
head(heart, 5)

#gives us the last 5 obersations
tail(heart, 5)

Plotting the Data

Before we take a look at variables and the target, lets plot some data points vs age to see if and how they are related.
Lets first take a look at age.

# age and cholesterol
g_age_chol <- ggplot(heart,aes(x=age,y=chol))+
    geom_point()+
    geom_smooth(method = "lm", se = FALSE)+
    scale_x_continuous(name="Age")+
    scale_y_continuous(name="Chol Level")+
    theme_economist_white(gray_bg = FALSE)+
    ggtitle("Age & Cholesterol")+
    theme(plot.title = element_text(hjust = 0.5))

# age and max heart rate
g_age_maxhr <- ggplot(heart,aes(x=age,y=thalach))+
    geom_point()+geom_smooth(method = "lm", se= FALSE)+
    scale_x_continuous(name="Age")+
    scale_y_continuous(name="Max heart rate")+
    theme_economist_white(gray_bg = FALSE)+
    ggtitle("Age & Max Heart Rate")+
    theme(plot.title = element_text(hjust = 0.5))

g_age_chol

g_age_maxhr

There is a positive correlation between age and cholesterol level.
After some research I found that cholesterol levels with “a reading of 240mg/dL and above is considered high”. We can see here that the majority of the poplation has a cholesterol level of over 240.

It looks like there is a negative correlation between age and max heart rate, so the older someone gets the lower their max heart rate is. Makes sense.

Total Target

# total cases of heart diease (target = 1)
ggplot(heart, aes(as.factor(target),fill=as.factor(target)))+
  geom_bar(stat="count")+
  guides(fill=F)+
  labs(x="Target", y = "count", caption = "    0 = no heart diease
    1 = heart diease")+
  theme_economist_white(gray_bg = FALSE)+
  theme(plot.caption = element_text(hjust = 0.5))+
  ggtitle("Total target")+
  theme(plot.title = element_text(hjust = 0.5))

Age

# quick summary for age statistics
summary(heart$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   47.50   55.00   54.37   61.00   77.00

# age by sex boxplot
g1 <- ggplot(heart, aes(x = as.factor(sex),y = age,fill=as.factor(sex)))+
  geom_boxplot() +
  theme_economist_white(gray_bg = FALSE)+
  labs(x="Sex", caption = "          0 = female
       1 = male", fill = "sex")+
  theme(plot.caption = element_text(hjust = 0.5))

# age bargraph 
g2 <- ggplot(heart,aes(as.factor(sex), fill=as.factor(sex)))+
  geom_bar()+
  theme_economist_white(gray_bg = FALSE)+
  labs(x="sex",fill="Sex")

# age and target density
g3 <- ggplot(heart,aes(age,col=as.factor(target),fill=as.factor(target)))+
  geom_density(alpha=0.2)+
  theme_economist_white(gray_bg = FALSE)+
  guides(col=F)+
  labs(fill="Target",x="Age")

# age and target boxplot
g4 <- ggplot(heart,aes(x = as.factor(target),y =age,fill=as.factor(target)))+
  geom_boxplot()+
  theme_economist_white(gray_bg = FALSE)+
  labs(y="Age",x="Target",fill="Target")

grid.arrange(g2, g1, nrow = 1)

multiplot(g3, g4, cols = 2)

Resting blood pressure

# resting blood pressure and target density
g1 <- ggplot(heart, aes(trestbps, col=as.factor(target), fill=as.factor(target)))+
  geom_density(alpha = 0.2)+
   theme_economist_white(gray_bg = FALSE)+
  guides(col = F)+
  labs(fill = "Target", x = "Resting Blood Pressure", y = "")

g2 <- ggplot(heart,aes(as.factor(target),trestbps,fill=as.factor(target)))+
  geom_boxplot()+
  labs(y="Resting Blood Pressure",x="Target",fill="Target")+
  theme_economist_white(gray_bg = FALSE)

multiplot(g1, g2, cols = 2)

Resting blood pressure doesn’t seem to have much of an impact on target.

Chest pain type

# chest pain type bargraph
g1 <- ggplot(heart,aes(as.factor(cp),fill=as.factor(target)))+
  geom_bar(stat="count",position="fill")+
  theme_economist_white(gray_bg = FALSE)+
  labs(x="Chest Paint Type",fill="Target",y="stacked count")


g1

Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic

Max heart rate

# quick summary of max heart rates
summary(heart$thalach)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.5   153.0   149.6   166.0   202.0

#max heart rate and target density
g1 <- ggplot(heart,aes(thalach,col=as.factor(target),fill=as.factor(target)))+
  geom_density(alpha=0.2)+
  guides(col=F)+
  labs(fill="Target",x="Maximum heart rate achieved")+
  theme_economist_white(gray_bg = FALSE)

# max heart rate and target boxplot
g2 <- ggplot(heart,aes(as.factor(target),thalach,fill=as.factor(target)))+
  geom_boxplot()+
  labs(y="Maximum Heart Rate Achieved",x="Target",fill="Target")+
  theme_economist_white(gray_bg = FALSE)


grid.arrange(g1, g2, nrow = 1)

We can see that there is clearly a higher heart rate level achieved with the target.

EDA: Heart Disease

Alex Krasniewski

June 18th, 2019

Intro & Goals

Questions

Methods

`dplyr`

`ggplot2` & more

Reading the Data

Missing Values

Plotting the Data

Total Target

Age

Resting blood pressure

Chest pain type

Max heart rate

Results

Discussion

EDA: Heart Disease

Alex Krasniewski

June 18th, 2019

Intro & Goals

Questions

Methods

dplyr

ggplot2 & more

Reading the Data

Missing Values

Plotting the Data

Total Target

Age

Resting blood pressure

Chest pain type

Max heart rate

Results

Discussion

`dplyr`

`ggplot2` & more