library(MASS)
library(caret)
library(data.table)

Introduction

This document represents an attempt to review the heart dataset and perform a logistic regression on it. As this is an explanatory document, the code will be interspersed with the findings. In a formal report, code would be relegated to a code appendix.

Exploratory Data Analysis

Loading the Data

Before we read the data, it is wise to review the meaning of the variables, which is listed at its source. From this we see that the following variables are actually categories:

  • sex
    • 0: female
    • 1: male)
  • cp (chest pain type)
    1. typical angina
    2. atypical angina
    3. non-anginal pain
    4. asymptomatic
  • fbs: fasting blood sugar > 120 mg/dl
    • 0: FALSE
    • 1: TRUE
  • restecg: resting electrocardiographic results
    • 0: normal
    • 1: having ST-T wave abnormality
    • 2: showing probable or definite left ventricular hypertrophy
  • exang: exercise induced angina
    • 0: No
    • 1: Yes
  • slope: the slope of the peak exercise ST segment
    1. upsloping
    2. flat
    3. downsloping

While ca is technically an integer variable, since it seems restricted as well we will consider it a factor too.

heart <- fread('./Basic Data/heart.csv',
               colClasses = c('integer', rep('factor', 2), rep('integer', 2),
                              rep('factor', 2), 'integer', 'factor', 'double',
                              rep('factor', 3), 'integer'))
nobs <- dim(heart)[[1]]
nvars <- dim(heart)[[2]] - 1

Dataset Composition

The dataset is composed of 303 observations of 13 features and one target variable which comprises only 0’s and 1’s. This dataset is a prime candidate for logistic regression.

Unfortunately, the values for thal do not reflect those of the data’s source. The source has “thal: 3 = normal; 6 = fixed defect; 7 = reversable defect”. The data set only contains the following values:

unique(heart$thal)
## [1] 1 2 3 0
## Levels: 0 1 2 3

Even worse, there are only three values for thal listed in the data source yet four in the data. Therefore, for the time being, we will choose not to include thal as we are concerned about its provenance and accuracy.

heart[, thal := NULL]
anyNA(heart)
## [1] FALSE

There are no NAs, so thankfully we will not need to impute missing values.

Ranges of Non-Factor Data

While we will most likely center and scale the non-factor variables, it is instructive to look at their ranges in normal space.

boxplot(data.frame(age = heart$age, trestbps = heart$trestbps,
                   chol = heart$chol, thalach = heart$thalach,
                   oldpeak = heart$oldpeak))

As these too seem to span different ranges, it’s clearer if they are all plotted separately with their own axes.

par(mfrow = c(3, 2))
boxplot(heart$age, main = "age")
boxplot(heart$trestbps, main = "trestbps")
boxplot(heart$chol, main = "chol")
boxplot(heart$thalach, main = "thalach")
boxplot(heart$oldpeak, main = "oldpeak")
par(mfrow = c(1, 1))

Modeling

We will use the caret package to fit a logistic regression. While there are many flavors of logistic regression, we will stick with the simple version.

Note that in caret, simple logistic regression is handled as in R, via a call to glm with family = binomial(link = "logit").

Pre-processing

We will coerce first convert the factors to dummies.

heartDum <- dummyVars(target ~ ., data = heart, fullRank = TRUE)
dheart <- predict(heartDum, heart)

The next step is to break the data into 75% training and 25% testing sets. The value of using the createDataPartition function from the caret package as opposed to simply using sample is that the former attempts to create sets which are balanced in their distribution of the target variable.

set.seed(17895)
trainID <- createDataPartition(heart$target, p = .75, list = FALSE)
trnX <- dheart[trainID, ]
trnY <- factor(heart$target[trainID], labels = c("Absent", "Present"))
tstX <- dheart[-trainID, ]
tstY <- factor(heart$target[-trainID], labels = c("Absent", "Present"))
table(trnY)
## trnY
##  Absent Present 
##     101     127
table(tstY)
## tstY
##  Absent Present 
##      37      38

Note that the ratio is about 1:1 in both sets.

Finally, we will use 10-fold cross validation to select the “best” logistic regression. Since this is a binary classification problem, we will use the F measure as our metric, as it combines both precision and recall, which are often better indicators than accuracy. We will compare this to the simple, saturated logistic regression which would have been found via a direct call to glm.

trC <- trainControl(method = "cv", number = 10,
                    summaryFunction = prSummary, classProbs = TRUE)

Training the Model

At this point we can call caret to do its job.

caretModel <- train(x = trnX, y = trnY, method = 'glm',
                          family = binomial(link = "logit"),
                          trControl = trC, metric = 'F')
combTrain <- data.frame(target = trnY, trnX)
simpleModel <- glm(target ~ ., family = binomial(link = "logit"),
                         data = combTrain)
caretModel
## Generalized Linear Model 
## 
## 228 samples
##  19 predictor
##   2 classes: 'Absent', 'Present' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 205, 205, 205, 205, 205, 205, ... 
## Resampling results:
## 
##   AUC        Precision  Recall     F        
##   0.7683322  0.7809163  0.7736364  0.7719735
summary(caretModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6841  -0.3784   0.1447   0.4541   2.5976  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.998059   3.216067   1.243 0.213812    
## age          0.041920   0.028996   1.446 0.148260    
## sex.1       -2.329490   0.626832  -3.716 0.000202 ***
## cp.1         1.651003   0.682807   2.418 0.015608 *  
## cp.2         2.089151   0.580165   3.601 0.000317 ***
## cp.3         1.856511   0.756799   2.453 0.014163 *  
## trestbps    -0.035653   0.012785  -2.789 0.005294 ** 
## chol        -0.008570   0.004811  -1.781 0.074852 .  
## fbs.1        0.595442   0.638497   0.933 0.351043    
## restecg.1    0.374036   0.438307   0.853 0.393458    
## restecg.2   -0.282382   4.109558  -0.069 0.945218    
## thalach      0.018810   0.012913   1.457 0.145198    
## exang.1     -0.743068   0.489365  -1.518 0.128905    
## oldpeak     -0.452165   0.264570  -1.709 0.087440 .  
## slope.1     -0.564704   1.197147  -0.472 0.637135    
## slope.2      0.546457   1.247844   0.438 0.661444    
## ca.1        -1.823620   0.570838  -3.195 0.001400 ** 
## ca.2        -3.505793   0.902289  -3.885 0.000102 ***
## ca.3        -2.230546   0.953211  -2.340 0.019282 *  
## ca.4         0.199652   1.689932   0.118 0.905955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.10  on 227  degrees of freedom
## Residual deviance: 150.33  on 208  degrees of freedom
## AIC: 190.33
## 
## Number of Fisher Scoring iterations: 6
summary(simpleModel)
## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = combTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6841  -0.3784   0.1447   0.4541   2.5976  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.998059   3.216067   1.243 0.213812    
## age          0.041920   0.028996   1.446 0.148260    
## sex.1       -2.329490   0.626832  -3.716 0.000202 ***
## cp.1         1.651003   0.682807   2.418 0.015608 *  
## cp.2         2.089151   0.580165   3.601 0.000317 ***
## cp.3         1.856511   0.756799   2.453 0.014163 *  
## trestbps    -0.035653   0.012785  -2.789 0.005294 ** 
## chol        -0.008570   0.004811  -1.781 0.074852 .  
## fbs.1        0.595442   0.638497   0.933 0.351043    
## restecg.1    0.374036   0.438307   0.853 0.393458    
## restecg.2   -0.282382   4.109558  -0.069 0.945218    
## thalach      0.018810   0.012913   1.457 0.145198    
## exang.1     -0.743068   0.489365  -1.518 0.128905    
## oldpeak     -0.452165   0.264570  -1.709 0.087440 .  
## slope.1     -0.564704   1.197147  -0.472 0.637135    
## slope.2      0.546457   1.247844   0.438 0.661444    
## ca.1        -1.823620   0.570838  -3.195 0.001400 ** 
## ca.2        -3.505793   0.902289  -3.885 0.000102 ***
## ca.3        -2.230546   0.953211  -2.340 0.019282 *  
## ca.4         0.199652   1.689932   0.118 0.905955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.10  on 227  degrees of freedom
## Residual deviance: 150.33  on 208  degrees of freedom
## AIC: 190.33
## 
## Number of Fisher Scoring iterations: 6

Note that the results are the same. While it may seem surprising, it actually is not. For simple logistic regression, there are no tuning parameters. So caret really has nothing to do.

Should we have used a logistic-based regression tree or a stepwise regression, the results would have almost surely been different.

Testing the Model

As the two models are identical, they will have the same accuracy scores, but we will show them anyway.

caretP <- predict(caretModel, newdata = tstX)
simpleP <- predict(simpleModel, newdata = as.data.frame(tstX),
                   type = 'response')
simpleP <- as.factor(ifelse(simpleP < 0.5, "Absent", "Present"))
names(simpleP) <- NULL
prSummary(data.frame(obs = tstY, pred = caretP), lev = c("Absent", "Present"))
##       AUC Precision    Recall         F 
##        NA 0.8750000 0.7567568 0.8115942
prSummary(data.frame(obs = tstY, pred = simpleP), lev = c("Absent", "Present"))
##       AUC Precision    Recall         F 
##        NA 0.8750000 0.7567568 0.8115942
confusionMatrix(tstY, simpleP)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Absent Present
##    Absent      28       9
##    Present      4      34
##                                           
##                Accuracy : 0.8267          
##                  95% CI : (0.7219, 0.9043)
##     No Information Rate : 0.5733          
##     P-Value [Acc > NIR] : 2.822e-06       
##                                           
##                   Kappa : 0.6527          
##                                           
##  Mcnemar's Test P-Value : 0.2673          
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.7907          
##          Pos Pred Value : 0.7568          
##          Neg Pred Value : 0.8947          
##              Prevalence : 0.4267          
##          Detection Rate : 0.3733          
##    Detection Prevalence : 0.4933          
##       Balanced Accuracy : 0.8328          
##                                           
##        'Positive' Class : Absent          
## 

While it would be nice to think that the increase in the F-score was a result of cross-validation, we know better. It is the vagaries of the train/test split.

Step AIC

Since a logistic model has a well-defined likelihood, we can use the AIC to select models which exhibit a better response factoring in parsimony as well. The MASS package allows us to do this easily. Since simpleModel has all the predictors, and we are not considering interactions, it is a natural starting point for the stepwise procedure.

stepModel <- stepAIC(simpleModel, trace = 0, direction = 'both',
                     scope = list(upper = ~ ., lower = ~ 1))
summary(stepModel)
## 
## Call:
## glm(formula = target ~ sex.1 + cp.1 + cp.2 + cp.3 + trestbps + 
##     chol + exang.1 + oldpeak + slope.1 + ca.1 + ca.2 + ca.3, 
##     family = binomial(link = "logit"), data = combTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7520  -0.4293   0.1419   0.4792   2.5429  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.496958   2.084228   4.077 4.57e-05 ***
## sex.1       -2.370282   0.598871  -3.958 7.56e-05 ***
## cp.1         1.789115   0.658522   2.717 0.006590 ** 
## cp.2         2.397431   0.553541   4.331 1.48e-05 ***
## cp.3         2.034189   0.742450   2.740 0.006147 ** 
## trestbps    -0.027152   0.011571  -2.346 0.018952 *  
## chol        -0.007523   0.004326  -1.739 0.082076 .  
## exang.1     -0.796040   0.469945  -1.694 0.090284 .  
## oldpeak     -0.517172   0.244040  -2.119 0.034073 *  
## slope.1     -1.189447   0.474312  -2.508 0.012151 *  
## ca.1        -1.776800   0.535330  -3.319 0.000903 ***
## ca.2        -2.878499   0.809229  -3.557 0.000375 ***
## ca.3        -1.997472   0.887330  -2.251 0.024379 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.10  on 227  degrees of freedom
## Residual deviance: 154.99  on 215  degrees of freedom
## AIC: 180.99
## 
## Number of Fisher Scoring iterations: 6

Comparisons

Here is an interesting, but predictable result. When looking at the deviance of both the simple and step models, both show improvement, but the simple model shows more improvement. However, the step model has the better AIC by far.

simpleModel$null.deviance - simpleModel$deviance
## [1] 162.7754
stepModel$null.deviance - stepModel$deviance
## [1] 158.1149
AIC(simpleModel)
## [1] 190.3284
AIC(stepModel)
## [1] 180.9888

A difference of 10 means that the higher model basically has no support with respect to the lower model (Burnham and Anderson 2002). The simple model has the lower deviance solely because it has more predictors. From an information criteria perspective, however, it loses more than it gains.

stepP <- predict(stepModel, newdata = as.data.frame(tstX), type = 'response')
stepP <- as.factor(ifelse(stepP < 0.5, "Absent", "Present"))
names(stepP) <- NULL
prSummary(data.frame(obs = tstY, pred = stepP), lev = c("Absent", "Present"))
##       AUC Precision    Recall         F 
##        NA 0.8823529 0.8108108 0.8450704
confusionMatrix(tstY, stepP)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Absent Present
##    Absent      30       7
##    Present      4      34
##                                           
##                Accuracy : 0.8533          
##                  95% CI : (0.7527, 0.9244)
##     No Information Rate : 0.5467          
##     P-Value [Acc > NIR] : 1.664e-08       
##                                           
##                   Kappa : 0.7063          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.8824          
##             Specificity : 0.8293          
##          Pos Pred Value : 0.8108          
##          Neg Pred Value : 0.8947          
##              Prevalence : 0.4533          
##          Detection Rate : 0.4000          
##    Detection Prevalence : 0.4933          
##       Balanced Accuracy : 0.8558          
##                                           
##        'Positive' Class : Absent          
## 

It is unsurprising that the stepModel performs better from an F measure as well. The simple model, having more features, probably suffers from overfitting.

We may be able to find an even better model should we start hand-selecting variables.

Conclusion

Simple logistic regression remains a powerful tool for binary classification. As there is value in parsimony, one should not always simple select the model with the most parameters.

References

Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Second. New York: Springer Science+Business Media, Inc.