This project will follow up on Project 1 and 2. The beginning portion will be similiar to projects 1 and 2. Using a dataset, we will perform datacleaning on a new dataset using the dplyr package. We will visualize the data as we did in project 2 using R packages ggplot2. Lastly we will accept or reject our null hypothesis.
dplyrIsnstalldplyr package and add them to our library.
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot2 & more#install.packages("ggplot2")
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(extrafont)
## Registering fonts with R
library(ggthemes)
library(DataExplorer)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(dplyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Our data: I found my dataset from Kaggle. You can access the data and more information about it here
“This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The”goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4."
Content
Attribute Information:
1). age: Age of the patient, in years
2). sex: (1= male, 0 = female)
3). cp: chest pain type (4 values) Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
4). trestbps: resting blood pressure
5). chol: serum cholestoral in mg/dl
6). fbs: fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7). restecg: resting electrocardiographic results (values 0,1,2) Value 0:normal, Value 1: having ST-T wave abnormality, Value 2: showing probable or definite left ventricular hypertropy by Estes
8). thalach: maximum heart rate achievecd
9). exang: exercise induced angina (1 = yes, 0 = no)
10).oldpeak: = ST depression induced by exercise relative to rest
11).slope: the slope of the peak exercise ST segment 12).ca: number of major vessels (0-3) colored by flourosopy 13).thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14). target: Diagnoses of heart disease (Value 0: <50% diameter narrowing, Value 1: > 50% diameter narrowing)
Null Hypothesis: Heart disease (target) is unaffected by any of these variables.
Alternative hypothesis: Heart disease (target) is affected by any of these variables.
Read the data in using read.csv
heart <-read.csv("heart.csv")
as.tbl(heart)
## # A tibble: 303 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3
## 2 37 1 2 130 250 0 1 187 0 3.5
## 3 41 0 1 130 204 0 0 172 0 1.4
## 4 56 1 1 120 236 0 1 178 0 0.8
## 5 57 0 0 120 354 0 1 163 1 0.6
## 6 57 1 0 140 192 0 1 148 0 0.4
## 7 56 0 1 140 294 0 0 153 0 1.3
## 8 44 1 1 120 263 0 1 173 0 0
## 9 52 1 2 172 199 1 1 162 0 0.5
## 10 57 1 2 150 168 0 1 174 0 1.6
## # … with 293 more rows, and 4 more variables: slope <int>, ca <int>,
## # thal <int>, target <int>
#number of missing values we have in our dataset
sum(is.na(heart))
## [1] 0
Luckily we have no missing values, so we can continue our EDA without having to worry about NULL values*
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
We have 303 observations and 14 variables.
# class of our data
class(heart)
## [1] "data.frame"
# gives us the first 5 observations
head(heart, 5)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
#gives us the last 5 obersations
tail(heart, 5)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 299 57 0 0 140 241 0 1 123 1 0.2 1 0
## 300 45 1 3 110 264 0 1 132 0 1.2 1 0
## 301 68 1 0 144 193 1 1 141 0 3.4 1 2
## 302 57 1 0 130 131 0 1 115 1 1.2 1 1
## 303 57 0 1 130 236 0 0 174 0 0.0 1 1
## thal target
## 299 3 0
## 300 3 0
## 301 3 0
## 302 3 0
## 303 2 0
factorData <- copy(heart)
factorData$sex <- factor(heart$sex)
factorData$cp <- factor(heart$cp)
factorData$fbs <- factor(heart$fbs)
factorData$restecg <- factor(heart$restecg)
factorData$exang <- factor(heart$exang)
factorData$ca <- factor(heart$ca)
factorData$thal <- factor(heart$thal)
factorData$target <- factor(heart$target)
describe(factorData)
## vars n mean sd median trimmed mad min max range skew
## age 1 303 54.37 9.08 55.0 54.54 10.38 29 77.0 48.0 -0.20
## sex* 2 303 1.68 0.47 2.0 1.73 0.00 1 2.0 1.0 -0.78
## cp* 3 303 1.97 1.03 2.0 1.86 1.48 1 4.0 3.0 0.48
## trestbps 4 303 131.62 17.54 130.0 130.44 14.83 94 200.0 106.0 0.71
## chol 5 303 246.26 51.83 240.0 243.49 47.44 126 564.0 438.0 1.13
## fbs* 6 303 1.15 0.36 1.0 1.06 0.00 1 2.0 1.0 1.97
## restecg* 7 303 1.53 0.53 2.0 1.52 0.00 1 3.0 2.0 0.16
## thalach 8 303 149.65 22.91 153.0 150.98 22.24 71 202.0 131.0 -0.53
## exang* 9 303 1.33 0.47 1.0 1.28 0.00 1 2.0 1.0 0.74
## oldpeak 10 303 1.04 1.16 0.8 0.86 1.19 0 6.2 6.2 1.26
## slope 11 303 1.40 0.62 1.0 1.46 1.48 0 2.0 2.0 -0.50
## ca* 12 303 1.73 1.02 1.0 1.54 0.00 1 5.0 4.0 1.30
## thal* 13 303 3.31 0.61 3.0 3.36 0.00 1 4.0 3.0 -0.47
## target* 14 303 1.54 0.50 2.0 1.56 0.00 1 2.0 1.0 -0.18
## kurtosis se
## age -0.57 0.52
## sex* -1.39 0.03
## cp* -1.21 0.06
## trestbps 0.87 1.01
## chol 4.36 2.98
## fbs* 1.88 0.02
## restecg* -1.37 0.03
## thalach -0.10 1.32
## exang* -1.46 0.03
## oldpeak 1.50 0.07
## slope -0.65 0.04
## ca* 0.78 0.06
## thal* 0.25 0.04
## target* -1.97 0.03
plot_histogram(heart)
plot_density(select(heart, c(age, trestbps, chol, thalach, oldpeak)))
plot_correlation(heart)
plot_correlation(factorData)
You can see that chest pain type, exercise induced angia, ST depression induced by exercise relative to rest, and max heart ratem are the highest correlated with the target.
It looks like fasting blood sugar and cholesterol are not correlated at all.
# age and cholesterol
g_age_chol <- ggplot(heart,aes(x=age,y=chol))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
scale_x_continuous(name="Age")+
scale_y_continuous(name="Chol Level")+
theme_economist_white(gray_bg = FALSE)+
ggtitle("Age & Cholesterol")+
theme(plot.title = element_text(hjust = 0.5))
# age and max heart rate
g_age_maxhr <- ggplot(heart,aes(x=age,y=thalach))+
geom_point()+geom_smooth(method = "lm", se= FALSE)+
scale_x_continuous(name="Age")+
scale_y_continuous(name="Max heart rate")+
theme_economist_white(gray_bg = FALSE)+
ggtitle("Age & Max Heart Rate")+
theme(plot.title = element_text(hjust = 0.5))
g_age_chol
g_age_maxhr
There is a positive correlation between age and cholesterol level.
After some research I found that cholesterol levels with “a reading of 240mg/dL and above is considered high”. We can see here that the majority of the poplation has a cholesterol level of over 240.
It looks like there is a negative correlation between age and max heart rate, so the older someone gets the lower their max heart rate is. Makes sense.
# total cases of heart diease (target = 1)
ggplot(heart, aes(as.factor(target),fill=as.factor(target)))+
geom_bar(stat="count")+
guides(fill=F)+
labs(x="Target", y = "count", caption = " 0 = no heart diease
1 = heart diease")+
theme_economist_white(gray_bg = FALSE)+
theme(plot.caption = element_text(hjust = 0.5))+
ggtitle("Total target")+
theme(plot.title = element_text(hjust = 0.5))
# quick summary for age statistics
summary(heart$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 47.50 55.00 54.37 61.00 77.00
# age by sex boxplot
g1 <- ggplot(heart, aes(x = as.factor(sex),y = age,fill=as.factor(sex)))+
geom_boxplot() +
theme_economist_white(gray_bg = FALSE)+
labs(x="Sex", caption = " 0 = female
1 = male", fill = "sex")+
theme(plot.caption = element_text(hjust = 0.5))
# age bargraph
g2 <- ggplot(heart,aes(as.factor(sex), fill=as.factor(sex)))+
geom_bar()+
theme_economist_white(gray_bg = FALSE)+
labs(x="sex",fill="Sex")
# age and target density
g3 <- ggplot(heart,aes(age,col=as.factor(target),fill=as.factor(target)))+
geom_density(alpha=0.2)+
theme_economist_white(gray_bg = FALSE)+
guides(col=F)+
labs(fill="Target",x="Age")
# age and target boxplot
g4 <- ggplot(heart,aes(x = as.factor(target),y =age,fill=as.factor(target)))+
geom_boxplot()+
theme_economist_white(gray_bg = FALSE)+
labs(y="Age",x="Target",fill="Target")
grid.arrange(g2, g1, nrow = 1)
multiplot(g3, g4, cols = 2)
# resting blood pressure and target density
g1 <- ggplot(heart, aes(trestbps, col=as.factor(target), fill=as.factor(target)))+
geom_density(alpha = 0.2)+
theme_economist_white(gray_bg = FALSE)+
guides(col = F)+
labs(fill = "Target", x = "Resting Blood Pressure", y = "")
g2 <- ggplot(heart,aes(as.factor(target),trestbps,fill=as.factor(target)))+
geom_boxplot()+
labs(y="Resting Blood Pressure",x="Target",fill="Target")+
theme_economist_white(gray_bg = FALSE)
multiplot(g1, g2, cols = 2)
Resting blood pressure doesn’t seem to have much of an impact on target.
# quick summary of max heart rates
summary(heart$thalach)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.0 133.5 153.0 149.6 166.0 202.0
#max heart rate and target density
g1 <- ggplot(heart,aes(thalach,col=as.factor(target),fill=as.factor(target)))+
geom_density(alpha=0.2)+
guides(col=F)+
labs(fill="Target",x="Maximum heart rate achieved")+
theme_economist_white(gray_bg = FALSE)
# max heart rate and target boxplot
g2 <- ggplot(heart,aes(as.factor(target),thalach,fill=as.factor(target)))+
geom_boxplot()+
labs(y="Maximum Heart Rate Achieved",x="Target",fill="Target")+
theme_economist_white(gray_bg = FALSE)
grid.arrange(g1, g2, nrow = 1)
We can see that there is clearly a higher heart rate level achieved with the target.
The mean of max heart rate achieved is much higher for target = 1 (heart disease).
We can reject our null hypothesis. We can clearly see that there is a relation between the target, and one or more of the variables in the dataset.