This data has been taken from kaggle https://www.kaggle.com/ronitf/heart-disease-uci. This dataset is about heart disease and I want to study whether various parameters like cholestrol, heart rate, blood sugar etc. contribute to angina or not. Excercise/stress test help diagnose our heart condition. Also I want to see which sex group has more angina problems while exercising? Does age play a role in this heart disease? and how other below mentioned factors contribute to angina disease? Through various models I want to look how different parameters are impacting dependent variable (angina). In my data set binary dependent variable is “exang” i.e exercise induced angina (1 = yes; 0 = no) and independent variables are age, sex(1=male; 0=female), Serum Cholestrol in mg/dl(chol), fasting blood sugar >120 mg/dl, 1=true; 0=false (fbs), heart rate(thalach), resting blood pressure (trestbps). The generic function “na.fail” is used below to confirm NA/missing values in my data frame. Dimension of my dataset is 303x14 i.e 303 observations and 14 features. Through “str” I am looking at variables/features to make sure they are in numeric format.
Let’s load a file and take a look. I am also looking for NA values and looking at structure for numerice values.
library(readr)
library(tidyverse)
HeartDisease<-read_csv("/Users/kanwallatif/Documents/heart.csv", col_names = TRUE) # Importing dataset file.
na.fail(HeartDisease) #Confirming dataset for na values.
## # A tibble: 303 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3
## 2 37 1 2 130 250 0 1 187 0 3.5
## 3 41 0 1 130 204 0 0 172 0 1.4
## 4 56 1 1 120 236 0 1 178 0 0.8
## 5 57 0 0 120 354 0 1 163 1 0.6
## 6 57 1 0 140 192 0 1 148 0 0.4
## 7 56 0 1 140 294 0 0 153 0 1.3
## 8 44 1 1 120 263 0 1 173 0 0
## 9 52 1 2 172 199 1 1 162 0 0.5
## 10 57 1 2 150 168 0 1 174 0 1.6
## # … with 293 more rows, and 4 more variables: slope <dbl>, ca <dbl>,
## # thal <dbl>, target <dbl>
str(HeartDisease) #Looking at the structure of dataset.
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : num 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : num 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: num 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : num 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : num 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : num 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : num 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : num 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : num 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : num 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : num 1 2 2 2 2 1 2 3 3 2 ...
## $ target : num 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_double(),
## .. cp = col_double(),
## .. trestbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_double(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exang = col_double(),
## .. oldpeak = col_double(),
## .. slope = col_double(),
## .. ca = col_double(),
## .. thal = col_double(),
## .. target = col_double()
## .. )
HeartDisease$sex=as.factor(HeartDisease$sex) #Storing sex as a categorical variable.
Since my data is clean I am moving to my regression models.
Let’s try and predict exercise induced angina with adding various parameters through simple regression to see what they mean.
#With age independent variable.
m1<- glm(exang~age, family="binomial", data=HeartDisease)
summary(m1)
##
## Call:
## glm(formula = exang ~ age, family = "binomial", data = HeartDisease)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0812 -0.9178 -0.8176 1.4303 1.7010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.98760 0.76812 -2.588 0.00966 **
## age 0.02312 0.01378 1.678 0.09334 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 382.90 on 302 degrees of freedom
## Residual deviance: 380.03 on 301 degrees of freedom
## AIC: 384.03
##
## Number of Fisher Scoring iterations: 4
#Adding sex with age.
m2<- glm(exang~age+sex, family="binomial", data=HeartDisease)
summary(m2)
##
## Call:
## glm(formula = exang ~ age + sex, family = "binomial", data = HeartDisease)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1422 -0.9378 -0.7725 1.3333 1.9046
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76690 0.84141 -3.288 0.00101 **
## age 0.02759 0.01413 1.952 0.05096 .
## sex1 0.75242 0.28635 2.628 0.00860 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 382.90 on 302 degrees of freedom
## Residual deviance: 372.69 on 300 degrees of freedom
## AIC: 378.69
##
## Number of Fisher Scoring iterations: 4
#Adding cholestrol, blood pressure with age and sex.
m3<- glm(exang~age+sex+chol+trestbps, family="binomial", data=HeartDisease)
summary(m3)
##
## Call:
## glm(formula = exang ~ age + sex + chol + trestbps, family = "binomial",
## data = HeartDisease)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2152 -0.9379 -0.7530 1.3276 1.9301
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.088387 1.263200 -3.237 0.00121 **
## age 0.021358 0.014923 1.431 0.15237
## sex1 0.844566 0.296377 2.850 0.00438 **
## chol 0.003403 0.002511 1.355 0.17531
## trestbps 0.005702 0.007373 0.773 0.43928
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 382.90 on 302 degrees of freedom
## Residual deviance: 370.15 on 298 degrees of freedom
## AIC: 380.15
##
## Number of Fisher Scoring iterations: 4
Now I am including interaction terms in to my model and take a look.
#Adding age, sex and heart rate, and interaction between blood pressure and cholestrol.
m4<- glm(exang~age+sex+thalach+trestbps*chol, family="binomial", data=HeartDisease)
summary(m4)
##
## Call:
## glm(formula = exang ~ age + sex + thalach + trestbps * chol,
## family = "binomial", data = HeartDisease)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1657 -0.7936 -0.5893 0.9022 2.2795
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.9533966 5.4886451 1.449 0.1473
## age -0.0186058 0.0175700 -1.059 0.2896
## sex1 0.8163084 0.3187588 2.561 0.0104 *
## thalach -0.0408603 0.0068971 -5.924 3.14e-09 ***
## trestbps -0.0268643 0.0423369 -0.635 0.5257
## chol -0.0133621 0.0213271 -0.627 0.5310
## trestbps:chol 0.0001411 0.0001615 0.874 0.3822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 382.90 on 302 degrees of freedom
## Residual deviance: 326.62 on 296 degrees of freedom
## AIC: 340.62
##
## Number of Fisher Scoring iterations: 4
From above models I can see that on average increase in age by 1 year the log odds of exercise induced angina increases by 0.02313. When factor sex is added comparing to females on average males have higher log odds of exercise induced angina by 0.75. Sex plays a significant role. An increase in cholestrol and heart rate by 1 unit there is an increase in log odds of excercise induced angina by 0.003403 and 0.005702 respectively. In my last model it evaluates the interaction between blood pressure and cholestrol on the log odds of excercise induced angina. This model is hard to interpret but I can see not much increase in log odds of angina, only increased by 0001411. I assume pateints are on medicines and in better condition.
library(texreg)
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
##
## extract
screenreg(list(m1, m2, m3, m4), doctype = FALSE)
##
## ===============================================================
## Model 1 Model 2 Model 3 Model 4
## ---------------------------------------------------------------
## (Intercept) -1.99 ** -2.77 ** -4.09 ** 7.95
## (0.77) (0.84) (1.26) (5.49)
## age 0.02 0.03 0.02 -0.02
## (0.01) (0.01) (0.01) (0.02)
## sex1 0.75 ** 0.84 ** 0.82 *
## (0.29) (0.30) (0.32)
## chol 0.00 -0.01
## (0.00) (0.02)
## trestbps 0.01 -0.03
## (0.01) (0.04)
## thalach -0.04 ***
## (0.01)
## trestbps:chol 0.00
## (0.00)
## ---------------------------------------------------------------
## AIC 384.03 378.69 380.15 340.62
## BIC 391.46 389.83 398.72 366.62
## Log Likelihood -190.02 -186.34 -185.07 -163.31
## Deviance 380.03 372.69 370.15 326.62
## Num. obs. 303 303 303 303
## ===============================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Looking at both AIC and BIC, I can see that my models get better with complexity. Lowest values of BIC and AIC is of model 4. The best performing model based on both of these metrics is model 4 which includes more features.
anova(m1, m2, m3, m4, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: exang ~ age
## Model 2: exang ~ age + sex
## Model 3: exang ~ age + sex + chol + trestbps
## Model 4: exang ~ age + sex + thalach + trestbps * chol
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 301 380.03
## 2 300 372.69 1 7.344 0.006729 **
## 3 298 370.15 2 2.537 0.281243
## 4 296 326.62 2 43.528 3.532e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a significant difference between model 4 and other models. This means Model 4 which includes inetrection between blood pressure and cholestrol fits the data signicantly better than other models. Deviance is a measure of error, lower deviance means better fit to data. In my analysis of deviance, I can see that model 4 is the best based on deviance as it has the lowest compared to the other models. It is statistically significance at a p-value of 0.001.
library(visreg)
visreg(m1, "age", scale = "response", line=list(col="orange"),fill=list(col="brown"),
xlab="age") + theme_bw()
## NULL
visreg(m3, "trestbps", scale = "response", line=list(col="orange"),fill=list(col="brown"),
xlab="trestbps") + theme_bw()
## NULL
visreg(m4, "chol", scale = "response" , line=list(col="orange"),fill=list(col="brown"),
xlab="chol") + theme_bw()
## Warning: Note that you are attempting to plot a 'main effect' in a model that contains an
## interaction. This is potentially misleading; you may wish to consider using the 'by'
## argument.
## Conditions used in construction of plot
## age: 55
## sex: 1
## thalach: 153
## trestbps: 130
## NULL
Looking at above visuals with excercise induced angina as a response, mentioned factors like age, blood pressure, cholestrol etc. they all contribute to a heart disease. An increase in these factors will increas the probality of having angina.
visreg(m2, "sex", by = "age", scale = "response",line=list(col="darkgoldenrod1"),
fill=list(col="firebrick4"), xlab="sex")
visreg(m2, "age", by = "sex", scale = "response", ,line=list(col="darkgoldenrod1"),
fill=list(col="firebrick4"), xlab="sex")
In the above model I can see that as age increases males have higher angina problem as compared to females. In all age groups males have higher probability of having heart disease.
From my above analysis I can tell that age, cholestrol, blood pressure have an imapct on our health and have heart disease like angina. More interestingly, males have the high probability of angina as compared to females.Though females have the probability of having anigina but have less probability as compared to males. As per visuals males have steeper and higher probability of having angina. A larger value in above selected variables will lead to greater log odds and a greater probability of having anigina.