library(MASS)
library(caret)
library(data.table)
This document represents an attempt to review the heart dataset and perform a logistic regression on it. As this is an explanatory document, the code will be interspersed with the findings. In a formal report, code would be relegated to a code appendix.
Before we read the data, it is wise to review the meaning of the variables, which is listed at its source. From this we see that the following variables are actually categories:
While ca is technically an integer variable, since it seems restricted as well we will consider it a factor too.
heart <- fread('./Basic Data/heart.csv',
colClasses = c('integer', rep('factor', 2), rep('integer', 2),
rep('factor', 2), 'integer', 'factor', 'double',
rep('factor', 3), 'integer'))
nobs <- dim(heart)[[1]]
nvars <- dim(heart)[[2]] - 1
The dataset is composed of 303 observations of 13 features and one target variable which comprises only 0’s and 1’s. This dataset is a prime candidate for logistic regression.
Unfortunately, the values for thal do not reflect those of the data’s source. The source has “thal: 3 = normal; 6 = fixed defect; 7 = reversable defect”. The data set only contains the following values:
unique(heart$thal)
## [1] 1 2 3 0
## Levels: 0 1 2 3
Even worse, there are only three values for thal listed in the data source yet four in the data. Therefore, for the time being, we will choose not to include thal as we are concerned about its provenance and accuracy.
heart[, thal := NULL]
anyNA(heart)
## [1] FALSE
There are no NAs, so thankfully we will not need to impute missing values.
While we will most likely center and scale the non-factor variables, it is instructive to look at their ranges in normal space.
boxplot(data.frame(age = heart$age, trestbps = heart$trestbps,
chol = heart$chol, thalach = heart$thalach,
oldpeak = heart$oldpeak))
As these too seem to span different ranges, it’s clearer if they are all plotted separately with their own axes.
par(mfrow = c(3, 2))
boxplot(heart$age, main = "age")
boxplot(heart$trestbps, main = "trestbps")
boxplot(heart$chol, main = "chol")
boxplot(heart$thalach, main = "thalach")
boxplot(heart$oldpeak, main = "oldpeak")
par(mfrow = c(1, 1))
We will use the caret package to fit a logistic regression. While there are many flavors of logistic regression, we will stick with the simple version.
Note that in caret, simple logistic regression is handled as in R, via a call to glm with family = binomial(link = "logit").
We will coerce first convert the factors to dummies.
heartDum <- dummyVars(target ~ ., data = heart, fullRank = TRUE)
dheart <- predict(heartDum, heart)
The next step is to break the data into 75% training and 25% testing sets. The value of using the createDataPartition function from the caret package as opposed to simply using sample is that the former attempts to create sets which are balanced in their distribution of the target variable.
set.seed(17895)
trainID <- createDataPartition(heart$target, p = .75, list = FALSE)
trnX <- dheart[trainID, ]
trnY <- factor(heart$target[trainID], labels = c("Absent", "Present"))
tstX <- dheart[-trainID, ]
tstY <- factor(heart$target[-trainID], labels = c("Absent", "Present"))
table(trnY)
## trnY
## Absent Present
## 101 127
table(tstY)
## tstY
## Absent Present
## 37 38
Note that the ratio is about 1:1 in both sets.
Finally, we will use 10-fold cross validation to select the “best” logistic regression. Since this is a binary classification problem, we will use the F measure as our metric, as it combines both precision and recall, which are often better indicators than accuracy. We will compare this to the simple, saturated logistic regression which would have been found via a direct call to glm.
trC <- trainControl(method = "cv", number = 10,
summaryFunction = prSummary, classProbs = TRUE)
At this point we can call caret to do its job.
caretModel <- train(x = trnX, y = trnY, method = 'glm',
family = binomial(link = "logit"),
trControl = trC, metric = 'F')
combTrain <- data.frame(target = trnY, trnX)
simpleModel <- glm(target ~ ., family = binomial(link = "logit"),
data = combTrain)
caretModel
## Generalized Linear Model
##
## 228 samples
## 19 predictor
## 2 classes: 'Absent', 'Present'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 205, 205, 205, 205, 205, 205, ...
## Resampling results:
##
## AUC Precision Recall F
## 0.7683322 0.7809163 0.7736364 0.7719735
summary(caretModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6841 -0.3784 0.1447 0.4541 2.5976
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.998059 3.216067 1.243 0.213812
## age 0.041920 0.028996 1.446 0.148260
## sex.1 -2.329490 0.626832 -3.716 0.000202 ***
## cp.1 1.651003 0.682807 2.418 0.015608 *
## cp.2 2.089151 0.580165 3.601 0.000317 ***
## cp.3 1.856511 0.756799 2.453 0.014163 *
## trestbps -0.035653 0.012785 -2.789 0.005294 **
## chol -0.008570 0.004811 -1.781 0.074852 .
## fbs.1 0.595442 0.638497 0.933 0.351043
## restecg.1 0.374036 0.438307 0.853 0.393458
## restecg.2 -0.282382 4.109558 -0.069 0.945218
## thalach 0.018810 0.012913 1.457 0.145198
## exang.1 -0.743068 0.489365 -1.518 0.128905
## oldpeak -0.452165 0.264570 -1.709 0.087440 .
## slope.1 -0.564704 1.197147 -0.472 0.637135
## slope.2 0.546457 1.247844 0.438 0.661444
## ca.1 -1.823620 0.570838 -3.195 0.001400 **
## ca.2 -3.505793 0.902289 -3.885 0.000102 ***
## ca.3 -2.230546 0.953211 -2.340 0.019282 *
## ca.4 0.199652 1.689932 0.118 0.905955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.10 on 227 degrees of freedom
## Residual deviance: 150.33 on 208 degrees of freedom
## AIC: 190.33
##
## Number of Fisher Scoring iterations: 6
summary(simpleModel)
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"),
## data = combTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6841 -0.3784 0.1447 0.4541 2.5976
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.998059 3.216067 1.243 0.213812
## age 0.041920 0.028996 1.446 0.148260
## sex.1 -2.329490 0.626832 -3.716 0.000202 ***
## cp.1 1.651003 0.682807 2.418 0.015608 *
## cp.2 2.089151 0.580165 3.601 0.000317 ***
## cp.3 1.856511 0.756799 2.453 0.014163 *
## trestbps -0.035653 0.012785 -2.789 0.005294 **
## chol -0.008570 0.004811 -1.781 0.074852 .
## fbs.1 0.595442 0.638497 0.933 0.351043
## restecg.1 0.374036 0.438307 0.853 0.393458
## restecg.2 -0.282382 4.109558 -0.069 0.945218
## thalach 0.018810 0.012913 1.457 0.145198
## exang.1 -0.743068 0.489365 -1.518 0.128905
## oldpeak -0.452165 0.264570 -1.709 0.087440 .
## slope.1 -0.564704 1.197147 -0.472 0.637135
## slope.2 0.546457 1.247844 0.438 0.661444
## ca.1 -1.823620 0.570838 -3.195 0.001400 **
## ca.2 -3.505793 0.902289 -3.885 0.000102 ***
## ca.3 -2.230546 0.953211 -2.340 0.019282 *
## ca.4 0.199652 1.689932 0.118 0.905955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.10 on 227 degrees of freedom
## Residual deviance: 150.33 on 208 degrees of freedom
## AIC: 190.33
##
## Number of Fisher Scoring iterations: 6
Note that the results are the same. While it may seem surprising, it actually is not. For simple logistic regression, there are no tuning parameters. So caret really has nothing to do.
Should we have used a logistic-based regression tree or a stepwise regression, the results would have almost surely been different.
As the two models are identical, they will have the same accuracy scores, but we will show them anyway.
caretP <- predict(caretModel, newdata = tstX)
simpleP <- predict(simpleModel, newdata = as.data.frame(tstX),
type = 'response')
simpleP <- as.factor(ifelse(simpleP < 0.5, "Absent", "Present"))
names(simpleP) <- NULL
prSummary(data.frame(obs = tstY, pred = caretP), lev = c("Absent", "Present"))
## AUC Precision Recall F
## NA 0.8750000 0.7567568 0.8115942
prSummary(data.frame(obs = tstY, pred = simpleP), lev = c("Absent", "Present"))
## AUC Precision Recall F
## NA 0.8750000 0.7567568 0.8115942
confusionMatrix(tstY, simpleP)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Absent Present
## Absent 28 9
## Present 4 34
##
## Accuracy : 0.8267
## 95% CI : (0.7219, 0.9043)
## No Information Rate : 0.5733
## P-Value [Acc > NIR] : 2.822e-06
##
## Kappa : 0.6527
##
## Mcnemar's Test P-Value : 0.2673
##
## Sensitivity : 0.8750
## Specificity : 0.7907
## Pos Pred Value : 0.7568
## Neg Pred Value : 0.8947
## Prevalence : 0.4267
## Detection Rate : 0.3733
## Detection Prevalence : 0.4933
## Balanced Accuracy : 0.8328
##
## 'Positive' Class : Absent
##
While it would be nice to think that the increase in the F-score was a result of cross-validation, we know better. It is the vagaries of the train/test split.
Since a logistic model has a well-defined likelihood, we can use the AIC to select models which exhibit a better response factoring in parsimony as well. The MASS package allows us to do this easily. Since simpleModel has all the predictors, and we are not considering interactions, it is a natural starting point for the stepwise procedure.
stepModel <- stepAIC(simpleModel, trace = 0, direction = 'both',
scope = list(upper = ~ ., lower = ~ 1))
summary(stepModel)
##
## Call:
## glm(formula = target ~ sex.1 + cp.1 + cp.2 + cp.3 + trestbps +
## chol + exang.1 + oldpeak + slope.1 + ca.1 + ca.2 + ca.3,
## family = binomial(link = "logit"), data = combTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7520 -0.4293 0.1419 0.4792 2.5429
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.496958 2.084228 4.077 4.57e-05 ***
## sex.1 -2.370282 0.598871 -3.958 7.56e-05 ***
## cp.1 1.789115 0.658522 2.717 0.006590 **
## cp.2 2.397431 0.553541 4.331 1.48e-05 ***
## cp.3 2.034189 0.742450 2.740 0.006147 **
## trestbps -0.027152 0.011571 -2.346 0.018952 *
## chol -0.007523 0.004326 -1.739 0.082076 .
## exang.1 -0.796040 0.469945 -1.694 0.090284 .
## oldpeak -0.517172 0.244040 -2.119 0.034073 *
## slope.1 -1.189447 0.474312 -2.508 0.012151 *
## ca.1 -1.776800 0.535330 -3.319 0.000903 ***
## ca.2 -2.878499 0.809229 -3.557 0.000375 ***
## ca.3 -1.997472 0.887330 -2.251 0.024379 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 313.10 on 227 degrees of freedom
## Residual deviance: 154.99 on 215 degrees of freedom
## AIC: 180.99
##
## Number of Fisher Scoring iterations: 6
Here is an interesting, but predictable result. When looking at the deviance of both the simple and step models, both show improvement, but the simple model shows more improvement. However, the step model has the better AIC by far.
simpleModel$null.deviance - simpleModel$deviance
## [1] 162.7754
stepModel$null.deviance - stepModel$deviance
## [1] 158.1149
AIC(simpleModel)
## [1] 190.3284
AIC(stepModel)
## [1] 180.9888
A difference of 10 means that the higher model basically has no support with respect to the lower model (Burnham and Anderson 2002). The simple model has the lower deviance solely because it has more predictors. From an information criteria perspective, however, it loses more than it gains.
stepP <- predict(stepModel, newdata = as.data.frame(tstX), type = 'response')
stepP <- as.factor(ifelse(stepP < 0.5, "Absent", "Present"))
names(stepP) <- NULL
prSummary(data.frame(obs = tstY, pred = stepP), lev = c("Absent", "Present"))
## AUC Precision Recall F
## NA 0.8823529 0.8108108 0.8450704
confusionMatrix(tstY, stepP)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Absent Present
## Absent 30 7
## Present 4 34
##
## Accuracy : 0.8533
## 95% CI : (0.7527, 0.9244)
## No Information Rate : 0.5467
## P-Value [Acc > NIR] : 1.664e-08
##
## Kappa : 0.7063
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.8824
## Specificity : 0.8293
## Pos Pred Value : 0.8108
## Neg Pred Value : 0.8947
## Prevalence : 0.4533
## Detection Rate : 0.4000
## Detection Prevalence : 0.4933
## Balanced Accuracy : 0.8558
##
## 'Positive' Class : Absent
##
It is unsurprising that the stepModel performs better from an F measure as well. The simple model, having more features, probably suffers from overfitting.
We may be able to find an even better model should we start hand-selecting variables.
Simple logistic regression remains a powerful tool for binary classification. As there is value in parsimony, one should not always simple select the model with the most parameters.
Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Second. New York: Springer Science+Business Media, Inc.