This project will follow up on Project 1. Using a new dataset, we will perform datacleaning on a new dataset as we did in project 1, and we will do an exploratory data analysis using R packages ggplot2, to visualize the data, make plots, etc. The beginning portion of this project will be very similar to our Project 1.
Some questions I wanted to answer are:
1). Which variable is most related with heart disease?
2). Which variable is least related with heart disease?
Since the variable target is 0 = no heart disease; 1 = heart disease, we will concentrate on this variable. Wr want to try and see which variable will have the greatest impact of having a target of 0 or 1.
dplyrIsnstalldplyr package and add them to our library.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot2 & more#install.packages("ggplot2")
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(extrafont)
## Registering fonts with R
library(ggthemes)
Our data: I found my dataset from Kaggle. You can access the data and more information about it here
“This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The”goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4."
Content
Attribute Information:
1). age: Age of the patient, in years
2). sex: (1= male, 0 = female)
3). cp: chest pain type (4 values) Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
4). trestbps: resting blood pressure
5). chol: serum cholestoral in mg/dl
6). fbs: fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7). restecg: resting electrocardiographic results (values 0,1,2) Value 0:normal, Value 1: having ST-T wave abnormality, Value 2: showing probable or definite left ventricular hypertropy by Estes
8). thalach: maximum heart rate achievecd
9). exang: exercise induced angina (1 = yes, 0 = no)
10).oldpeak: = ST depression induced by exercise relative to rest
11).slope: the slope of the peak exercise ST segment 12).ca: number of major vessels (0-3) colored by flourosopy 13).thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14). target: Diagnoses of heart disease (Value 0: <50% diameter narrowing, Value 1: > 50% diameter narrowing)
Read the data in using read.csv
heart <-read.csv("heart.csv")
heart
#number of missing values we have in our dataset
sum(is.na(heart))
## [1] 0
Luckily we have no missing values, so we can continue our EDA without having to worry about NULL values*
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
We have 303 observations and 14 variables.All of the variables are integers.
Variables like sex, , which should be categorical, are integers too. We could change them to categorical, however I think it easier to just use as.factor() when graphing instead of changing the data.
# class of our data
class(heart)
## [1] "data.frame"
# gives us the first 5 observations
head(heart, 5)
#gives us the last 5 obersations
tail(heart, 5)
Before we take a look at variables and the target, lets plot some data points vs age to see if and how they are related.
Lets first take a look at age.
# age and cholesterol
g_age_chol <- ggplot(heart,aes(x=age,y=chol))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
scale_x_continuous(name="Age")+
scale_y_continuous(name="Chol Level")+
theme_economist_white(gray_bg = FALSE)+
ggtitle("Age & Cholesterol")+
theme(plot.title = element_text(hjust = 0.5))
# age and max heart rate
g_age_maxhr <- ggplot(heart,aes(x=age,y=thalach))+
geom_point()+geom_smooth(method = "lm", se= FALSE)+
scale_x_continuous(name="Age")+
scale_y_continuous(name="Max heart rate")+
theme_economist_white(gray_bg = FALSE)+
ggtitle("Age & Max Heart Rate")+
theme(plot.title = element_text(hjust = 0.5))
g_age_chol
g_age_maxhr
There is a positive correlation between age and cholesterol level.
After some research I found that cholesterol levels with “a reading of 240mg/dL and above is considered high”. We can see here that the majority of the poplation has a cholesterol level of over 240.
It looks like there is a negative correlation between age and max heart rate, so the older someone gets the lower their max heart rate is. Makes sense.
# total cases of heart diease (target = 1)
ggplot(heart, aes(as.factor(target),fill=as.factor(target)))+
geom_bar(stat="count")+
guides(fill=F)+
labs(x="Target", y = "count", caption = " 0 = no heart diease
1 = heart diease")+
theme_economist_white(gray_bg = FALSE)+
theme(plot.caption = element_text(hjust = 0.5))+
ggtitle("Total target")+
theme(plot.title = element_text(hjust = 0.5))
# quick summary for age statistics
summary(heart$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 47.50 55.00 54.37 61.00 77.00
# age by sex boxplot
g1 <- ggplot(heart, aes(x = as.factor(sex),y = age,fill=as.factor(sex)))+
geom_boxplot() +
theme_economist_white(gray_bg = FALSE)+
labs(x="Sex", caption = " 0 = female
1 = male", fill = "sex")+
theme(plot.caption = element_text(hjust = 0.5))
# age bargraph
g2 <- ggplot(heart,aes(as.factor(sex), fill=as.factor(sex)))+
geom_bar()+
theme_economist_white(gray_bg = FALSE)+
labs(x="sex",fill="Sex")
# age and target density
g3 <- ggplot(heart,aes(age,col=as.factor(target),fill=as.factor(target)))+
geom_density(alpha=0.2)+
theme_economist_white(gray_bg = FALSE)+
guides(col=F)+
labs(fill="Target",x="Age")
# age and target boxplot
g4 <- ggplot(heart,aes(x = as.factor(target),y =age,fill=as.factor(target)))+
geom_boxplot()+
theme_economist_white(gray_bg = FALSE)+
labs(y="Age",x="Target",fill="Target")
grid.arrange(g2, g1, nrow = 1)
multiplot(g3, g4, cols = 2)
# resting blood pressure and target density
g1 <- ggplot(heart, aes(trestbps, col=as.factor(target), fill=as.factor(target)))+
geom_density(alpha = 0.2)+
theme_economist_white(gray_bg = FALSE)+
guides(col = F)+
labs(fill = "Target", x = "Resting Blood Pressure", y = "")
g2 <- ggplot(heart,aes(as.factor(target),trestbps,fill=as.factor(target)))+
geom_boxplot()+
labs(y="Resting Blood Pressure",x="Target",fill="Target")+
theme_economist_white(gray_bg = FALSE)
multiplot(g1, g2, cols = 2)
Resting blood pressure doesn’t seem to have much of an impact on target.
# chest pain type bargraph
g1 <- ggplot(heart,aes(as.factor(cp),fill=as.factor(target)))+
geom_bar(stat="count",position="fill")+
theme_economist_white(gray_bg = FALSE)+
labs(x="Chest Paint Type",fill="Target",y="stacked count")
g1
Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
# quick summary of max heart rates
summary(heart$thalach)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.0 133.5 153.0 149.6 166.0 202.0
#max heart rate and target density
g1 <- ggplot(heart,aes(thalach,col=as.factor(target),fill=as.factor(target)))+
geom_density(alpha=0.2)+
guides(col=F)+
labs(fill="Target",x="Maximum heart rate achieved")+
theme_economist_white(gray_bg = FALSE)
# max heart rate and target boxplot
g2 <- ggplot(heart,aes(as.factor(target),thalach,fill=as.factor(target)))+
geom_boxplot()+
labs(y="Maximum Heart Rate Achieved",x="Target",fill="Target")+
theme_economist_white(gray_bg = FALSE)
grid.arrange(g1, g2, nrow = 1)
We can see that there is clearly a higher heart rate level achieved with the target.
We saw some interesting observations. It looks like there are more younger people in this dataset with target = 1, than target = 0. We would expect older people to have a higher level of heart disease, however that is not the case for this data set.
Since we did not explore all the variables in the EDA, it is not possible to conclude which variable has the greatest, or least greatest correlation with heart disease. We would have to explore this dataset further in order to make more further conclusions.
This dataset is a small sample population of the total number of individuals with heart disease, in a very specific region.
You could furthure explore this dataset to find trends in this heart data to predict cardiovascular events, or to see if you can find any other clear indications of heart health.