Intro & Goals

This project will follow up on Project 1 and 2. The beginning portion will be similiar to projects 1 and 2. Using a dataset, we will perform datacleaning on a new dataset using the dplyr package. We will visualize the data as we did in project 2 using R packages ggplot2. Lastly we will accept or reject our null hypothesis.

Methods

dplyr

Isnstalldplyr package and add them to our library.

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

ggplot2 & more

#install.packages("ggplot2")
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(Rmisc)
## Loading required package: lattice
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(extrafont)
## Registering fonts with R
library(ggthemes)
library(DataExplorer)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(dplyr)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Reading the Data

Our data: I found my dataset from Kaggle. You can access the data and more information about it here

“This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The”goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4."

Content

Attribute Information:

1). age: Age of the patient, in years
2). sex: (1= male, 0 = female)
3). cp: chest pain type (4 values) Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic
4). trestbps: resting blood pressure
5). chol: serum cholestoral in mg/dl
6). fbs: fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7). restecg: resting electrocardiographic results (values 0,1,2) Value 0:normal, Value 1: having ST-T wave abnormality, Value 2: showing probable or definite left ventricular hypertropy by Estes
8). thalach: maximum heart rate achievecd
9). exang: exercise induced angina (1 = yes, 0 = no)
10).oldpeak: = ST depression induced by exercise relative to rest
11).slope: the slope of the peak exercise ST segment 12).ca: number of major vessels (0-3) colored by flourosopy 13).thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
14). target: Diagnoses of heart disease (Value 0: <50% diameter narrowing, Value 1: > 50% diameter narrowing)

Hypothesis

Null Hypothesis: Heart disease (target) is unaffected by any of these variables.

Alternative hypothesis: Heart disease (target) is affected by any of these variables.

Read the data in using read.csv

heart <-read.csv("heart.csv")
as.tbl(heart)
## # A tibble: 303 x 14
##      age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak
##    <int> <int> <int>    <int> <int> <int>   <int>   <int> <int>   <dbl>
##  1    63     1     3      145   233     1       0     150     0     2.3
##  2    37     1     2      130   250     0       1     187     0     3.5
##  3    41     0     1      130   204     0       0     172     0     1.4
##  4    56     1     1      120   236     0       1     178     0     0.8
##  5    57     0     0      120   354     0       1     163     1     0.6
##  6    57     1     0      140   192     0       1     148     0     0.4
##  7    56     0     1      140   294     0       0     153     0     1.3
##  8    44     1     1      120   263     0       1     173     0     0  
##  9    52     1     2      172   199     1       1     162     0     0.5
## 10    57     1     2      150   168     0       1     174     0     1.6
## # … with 293 more rows, and 4 more variables: slope <int>, ca <int>,
## #   thal <int>, target <int>

Missing Values

#number of missing values we have in our dataset
sum(is.na(heart))
## [1] 0

Luckily we have no missing values, so we can continue our EDA without having to worry about NULL values*

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

We have 303 observations and 14 variables.

# class of our data
class(heart)
## [1] "data.frame"
# gives us the first 5 observations 
head(heart, 5)
##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2  37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3  41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4  56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5  57   0  0      120  354   0       1     163     1     0.6     2  0    2
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
#gives us the last 5 obersations
tail(heart, 5)
##     age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 299  57   0  0      140  241   0       1     123     1     0.2     1  0
## 300  45   1  3      110  264   0       1     132     0     1.2     1  0
## 301  68   1  0      144  193   1       1     141     0     3.4     1  2
## 302  57   1  0      130  131   0       1     115     1     1.2     1  1
## 303  57   0  1      130  236   0       0     174     0     0.0     1  1
##     thal target
## 299    3      0
## 300    3      0
## 301    3      0
## 302    3      0
## 303    2      0
factorData <- copy(heart)
factorData$sex <- factor(heart$sex)
factorData$cp <- factor(heart$cp)
factorData$fbs <- factor(heart$fbs)
factorData$restecg <- factor(heart$restecg)
factorData$exang <- factor(heart$exang)
factorData$ca <- factor(heart$ca)
factorData$thal <- factor(heart$thal)
factorData$target <- factor(heart$target)

describe(factorData)
##          vars   n   mean    sd median trimmed   mad min   max range  skew
## age         1 303  54.37  9.08   55.0   54.54 10.38  29  77.0  48.0 -0.20
## sex*        2 303   1.68  0.47    2.0    1.73  0.00   1   2.0   1.0 -0.78
## cp*         3 303   1.97  1.03    2.0    1.86  1.48   1   4.0   3.0  0.48
## trestbps    4 303 131.62 17.54  130.0  130.44 14.83  94 200.0 106.0  0.71
## chol        5 303 246.26 51.83  240.0  243.49 47.44 126 564.0 438.0  1.13
## fbs*        6 303   1.15  0.36    1.0    1.06  0.00   1   2.0   1.0  1.97
## restecg*    7 303   1.53  0.53    2.0    1.52  0.00   1   3.0   2.0  0.16
## thalach     8 303 149.65 22.91  153.0  150.98 22.24  71 202.0 131.0 -0.53
## exang*      9 303   1.33  0.47    1.0    1.28  0.00   1   2.0   1.0  0.74
## oldpeak    10 303   1.04  1.16    0.8    0.86  1.19   0   6.2   6.2  1.26
## slope      11 303   1.40  0.62    1.0    1.46  1.48   0   2.0   2.0 -0.50
## ca*        12 303   1.73  1.02    1.0    1.54  0.00   1   5.0   4.0  1.30
## thal*      13 303   3.31  0.61    3.0    3.36  0.00   1   4.0   3.0 -0.47
## target*    14 303   1.54  0.50    2.0    1.56  0.00   1   2.0   1.0 -0.18
##          kurtosis   se
## age         -0.57 0.52
## sex*        -1.39 0.03
## cp*         -1.21 0.06
## trestbps     0.87 1.01
## chol         4.36 2.98
## fbs*         1.88 0.02
## restecg*    -1.37 0.03
## thalach     -0.10 1.32
## exang*      -1.46 0.03
## oldpeak      1.50 0.07
## slope       -0.65 0.04
## ca*          0.78 0.06
## thal*        0.25 0.04
## target*     -1.97 0.03
plot_histogram(heart)

plot_density(select(heart, c(age, trestbps, chol, thalach, oldpeak)))

plot_correlation(heart)

plot_correlation(factorData)

You can see that chest pain type, exercise induced angia, ST depression induced by exercise relative to rest, and max heart ratem are the highest correlated with the target.

It looks like fasting blood sugar and cholesterol are not correlated at all.

More plotting

# age and cholesterol
g_age_chol <- ggplot(heart,aes(x=age,y=chol))+
    geom_point()+
    geom_smooth(method = "lm", se = FALSE)+
    scale_x_continuous(name="Age")+
    scale_y_continuous(name="Chol Level")+
    theme_economist_white(gray_bg = FALSE)+
    ggtitle("Age & Cholesterol")+
    theme(plot.title = element_text(hjust = 0.5))

# age and max heart rate
g_age_maxhr <- ggplot(heart,aes(x=age,y=thalach))+
    geom_point()+geom_smooth(method = "lm", se= FALSE)+
    scale_x_continuous(name="Age")+
    scale_y_continuous(name="Max heart rate")+
    theme_economist_white(gray_bg = FALSE)+
    ggtitle("Age & Max Heart Rate")+
    theme(plot.title = element_text(hjust = 0.5))

g_age_chol

g_age_maxhr

There is a positive correlation between age and cholesterol level.
After some research I found that cholesterol levels with “a reading of 240mg/dL and above is considered high”. We can see here that the majority of the poplation has a cholesterol level of over 240.

It looks like there is a negative correlation between age and max heart rate, so the older someone gets the lower their max heart rate is. Makes sense.

Total Target

# total cases of heart diease (target = 1)
ggplot(heart, aes(as.factor(target),fill=as.factor(target)))+
  geom_bar(stat="count")+
  guides(fill=F)+
  labs(x="Target", y = "count", caption = "    0 = no heart diease
    1 = heart diease")+
  theme_economist_white(gray_bg = FALSE)+
  theme(plot.caption = element_text(hjust = 0.5))+
  ggtitle("Total target")+
  theme(plot.title = element_text(hjust = 0.5))

Age

# quick summary for age statistics
summary(heart$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   47.50   55.00   54.37   61.00   77.00
# age by sex boxplot
g1 <- ggplot(heart, aes(x = as.factor(sex),y = age,fill=as.factor(sex)))+
  geom_boxplot() +
  theme_economist_white(gray_bg = FALSE)+
  labs(x="Sex", caption = "          0 = female
       1 = male", fill = "sex")+
  theme(plot.caption = element_text(hjust = 0.5))

# age bargraph 
g2 <- ggplot(heart,aes(as.factor(sex), fill=as.factor(sex)))+
  geom_bar()+
  theme_economist_white(gray_bg = FALSE)+
  labs(x="sex",fill="Sex")

# age and target density
g3 <- ggplot(heart,aes(age,col=as.factor(target),fill=as.factor(target)))+
  geom_density(alpha=0.2)+
  theme_economist_white(gray_bg = FALSE)+
  guides(col=F)+
  labs(fill="Target",x="Age")

# age and target boxplot
g4 <- ggplot(heart,aes(x = as.factor(target),y =age,fill=as.factor(target)))+
  geom_boxplot()+
  theme_economist_white(gray_bg = FALSE)+
  labs(y="Age",x="Target",fill="Target")

grid.arrange(g2, g1, nrow = 1)

multiplot(g3, g4, cols = 2)

Resting blood pressure

# resting blood pressure and target density
g1 <- ggplot(heart, aes(trestbps, col=as.factor(target), fill=as.factor(target)))+
  geom_density(alpha = 0.2)+
   theme_economist_white(gray_bg = FALSE)+
  guides(col = F)+
  labs(fill = "Target", x = "Resting Blood Pressure", y = "")

g2 <- ggplot(heart,aes(as.factor(target),trestbps,fill=as.factor(target)))+
  geom_boxplot()+
  labs(y="Resting Blood Pressure",x="Target",fill="Target")+
  theme_economist_white(gray_bg = FALSE)

multiplot(g1, g2, cols = 2)

Resting blood pressure doesn’t seem to have much of an impact on target.

Max heart rate

# quick summary of max heart rates
summary(heart$thalach)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.5   153.0   149.6   166.0   202.0
#max heart rate and target density
g1 <- ggplot(heart,aes(thalach,col=as.factor(target),fill=as.factor(target)))+
  geom_density(alpha=0.2)+
  guides(col=F)+
  labs(fill="Target",x="Maximum heart rate achieved")+
  theme_economist_white(gray_bg = FALSE)

# max heart rate and target boxplot
g2 <- ggplot(heart,aes(as.factor(target),thalach,fill=as.factor(target)))+
  geom_boxplot()+
  labs(y="Maximum Heart Rate Achieved",x="Target",fill="Target")+
  theme_economist_white(gray_bg = FALSE)


grid.arrange(g1, g2, nrow = 1)

We can see that there is clearly a higher heart rate level achieved with the target.

The mean of max heart rate achieved is much higher for target = 1 (heart disease).

Results

We can reject our null hypothesis. We can clearly see that there is a relation between the target, and one or more of the variables in the dataset.