Classification Analysis: Logistic Regression

Preliminary

We will not use any new packages, so we will not install any packages. In this lesson, we will use the DescTools and caret packages. Next, we load these packages for use in the session.

library(caret)
library(DescTools)

In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.

The variables include:

crim: Per capita crime rate by town
zn: Proportion of residential land zoned for lots over 25,000 ft2
indus: Proportion of nonretail business acres per town
chas: Charles River dummy variable (1 if tract bounds river, 0 otherwise)
nox: Nitric oxide concentration (parts per 10 million)
rm: Average number of rooms per dwelling
age: Proportion of owner-occupied units built prior to 1940
dis: Weighted distances to five Boston employment centers
rad: Index of accessibility to radial highways (1-8,24)
tax: Full-value property-tax rate per $10,000
ptratio: Pupil/teacher ratio by town
median_val: Above if the median home value is above the overall median, 30k, otherwise Below

We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.

BH <- read.csv(file = "BostonHousing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.

Abstract(BH)

## ------------------------------------------------------------------------------ 
## BH
## 
## data frame:  505 obs. of  13 variables
##      505 complete cases (100.0%)
## 
##   Nr  ColName     Class      NAs  Levels
##   1   crim        numeric    .          
##   2   zn          numeric    .          
##   3   indus       numeric    .          
##   4   chas        integer    .          
##   5   nox         numeric    .          
##   6   rm          numeric    .          
##   7   age         numeric    .          
##   8   dis         numeric    .          
##   9   rad         integer    .          
##   10  tax         integer    .          
##   11  ptratio     numeric    .          
##   12  b           numeric    .          
##   13  median_val  character  .

Preparing Target (Y) Variables

Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable. Note: we map Below to the 1 category and Above to the 2 by using the levels argument

BH$median_val <- factor(x = BH$median_val,
                        levels = c("Below", "Above"))

We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.

plot(BH$median_val, 
     main = "Median Value")

Preparing Predictor (X) Variables

All of our potential predictors are numeric, but based on the data description, chas is categorical (binary) and rad could either be treated as categorical or numeric. We will convert chas to a nominal variable and leave rad as-is.

BH$chas <- factor(x = BH$chas)

Data Preprocessing & Transformation

For Logistic Regression, we know that we are okay keeping irrelevant and redundant variables, we can incorporate categorical variables as-is (as factors) and we do not need to transform our numerical variables but we need to handle missing values.

Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation.

PlotMiss(x = BH)

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831)
sub <- createDataPartition(y = BH$median_val, 
                           p = 0.80, 
                           list = FALSE)

Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- BH[sub, ] 
test <- BH[-sub, ]

Analysis

We use the glm() function in the stats package to build the generalized linear model and specify family = binomial to create the logistic regression model. The model predicts the higher class level (Above), internally converting the Y variable to 0/1.

Since we are using all other variables in the train dataframe to predict our target (Y) median_val, we can use the syntax Y ~ ..

logmod <- glm(formula = median_val ~ ., 
              family = binomial,
              data = train)

We can view the model by using the summary() function on the glm object created (logmod).

summary(logmod)

## 
## Call:
## glm(formula = median_val ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8978  -0.2039  -0.0867  -0.0176   4.0752  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.785689   7.598899  -2.735  0.00623 ** 
## crim         -0.109724   0.129875  -0.845  0.39820    
## zn            0.024000   0.012425   1.932  0.05340 .  
## indus        -0.151787   0.093135  -1.630  0.10315    
## chas1         0.630732   0.820268   0.769  0.44193    
## nox           1.850439   6.022775   0.307  0.75866    
## rm            3.997623   0.629128   6.354 2.09e-10 ***
## age          -0.018285   0.015756  -1.161  0.24584    
## dis          -0.537005   0.219787  -2.443  0.01455 *  
## rad           0.220627   0.107935   2.044  0.04095 *  
## tax          -0.003969   0.004947  -0.802  0.42242    
## ptratio      -0.425738   0.176999  -2.405  0.01616 *  
## b             0.008584   0.009933   0.864  0.38750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 366.56  on 404  degrees of freedom
## Residual deviance: 122.86  on 392  degrees of freedom
## AIC: 148.86
## 
## Number of Fisher Scoring iterations: 8

Interpreting the Model

We can obtain the coefficients for the predictor variables, which are the log odds, using the coef() function.

coef(logmod)

##   (Intercept)          crim            zn         indus         chas1 
## -20.785688797  -0.109724141   0.024000228  -0.151787128   0.630731844 
##           nox            rm           age           dis           rad 
##   1.850439490   3.997622566  -0.018284616  -0.537004916   0.220627466 
##           tax       ptratio             b 
##  -0.003968621  -0.425737819   0.008583720

We can interpret the coefficient for a particular variable, dis (weighted distance to employment centers).

coef(logmod)["dis"]

##        dis 
## -0.5370049

As shown, the coefficient is negative. This suggests that (holding all other variables constant) a one unit increase in distance is associated with a lower probability that the home will be Above the median.

To obtain the odds, we use the exp() function on the output of the coef() function.

exp(coef(logmod))

##  (Intercept)         crim           zn        indus        chas1          nox 
## 9.394854e-10 8.960813e-01 1.024291e+00 8.591712e-01 1.878985e+00 6.362615e+00 
##           rm          age          dis          rad          tax      ptratio 
## 5.446850e+01 9.818815e-01 5.844962e-01 1.246859e+00 9.960392e-01 6.532876e-01 
##            b 
## 1.008621e+00

We can interpret the odds for a particular variable, nox (nitric oxide concentration), which is positive. For a one-unit increase in nitric oxide concentration, the odds of the home price being Above the median (versus Below) increase by a factor of

exp(coef(logmod))["nox"]

##      nox 
## 6.362615

For categorical variables, the class level that is included in the model is compared to a reference category. For the chas variable, chas = 1 is included in the model (chas1) and chas = 0 is the reference. The coefficient for chas1 is:

coef(logmod)["chas1"]

##     chas1 
## 0.6307318

and the odds are:

exp(coef(logmod))["chas1"]

##    chas1 
## 1.878985

This implies that (holding all other variables constant) the odds that a census tract bordering the Charles River will be above the median relative to a census tract that does not border the Charles River are 1.8789852. Based on the positive coefficient, this implies that those census tracts that border the River are more likely to be Above the median than those that do not.

Model Performance & Fit

Training Performance

To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample.

We can use the fitted() function to get the vector of fitted values from the model.

head(fitted(logmod)) # first 6 fitted values

##          2          3          4          5          6          8 
## 0.01589623 0.31411886 0.23447832 0.32614364 0.02427303 0.01475666

We need choose a cutoff point to assign the probabilities to a class (Above or Below) for our target variable (median_val). A typical cutoff point is 0.5. We can use the ifelse() function to make the class assignments. Then, we can convert the vector to a factor with the same class levels as the median_val variable.

train.pred <- ifelse(test = fitted(logmod) >= 0.5,
                      yes = "Above",
                      no = "Below")
train.pred <- factor(train.pred,
                      levels = c("Below", "Above"))

We can view the first 6 fitted values and classifications side-by-side by combining them column-wise using cbind() and using the head() function to preview.

head(cbind(fitted = fitted(logmod),
      class = train.pred))

##       fitted class
## 2 0.01589623     1
## 3 0.31411886     1
## 4 0.23447832     1
## 5 0.32614364     1
## 6 0.02427303     1
## 8 0.01475666     1

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the training dataset (train). We can set mode = "everything" to obtain all available performance measures. We identify the Above class as positive, since that is the class we are more interested in being able to predict. We will save it so that we can make comparisons.

train_conf <- confusionMatrix(data = train.pred, # predictions
                              reference = train$median_val, # actual
                              positive = "Above",
                              mode = "everything")
train_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Below Above
##      Below   328    14
##      Above     9    54
##                                          
##                Accuracy : 0.9432         
##                  95% CI : (0.916, 0.9637)
##     No Information Rate : 0.8321         
##     P-Value [Acc > NIR] : 1.305e-11      
##                                          
##                   Kappa : 0.7906         
##                                          
##  Mcnemar's Test P-Value : 0.4042         
##                                          
##             Sensitivity : 0.7941         
##             Specificity : 0.9733         
##          Pos Pred Value : 0.8571         
##          Neg Pred Value : 0.9591         
##               Precision : 0.8571         
##                  Recall : 0.7941         
##                      F1 : 0.8244         
##              Prevalence : 0.1679         
##          Detection Rate : 0.1333         
##    Detection Prevalence : 0.1556         
##       Balanced Accuracy : 0.8837         
##                                          
##        'Positive' Class : Above          
##

Testing Performance

To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict() function to obtain the predicted values for the testing data based on our LR model.

LR.test <- predict(object = logmod, # LR model
                   newdata = test, # testing data
                   type = "response")

Again, we use a cutoff point of 0.5 to assign the probabilities to a class (Above or Below) for our target variable (median_val).

test.pred <- ifelse(test = LR.test >= 0.5,
                      yes = "Above",
                      no = "Below")
test.pred <- factor(test.pred,
                      levels = c("Below", "Above"))

Again, we use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model, this time applied to the testing dataset (test).

test_conf <- confusionMatrix(data = test.pred, # predictions
                             reference = test$median_val, #actual
                             positive = "Above",
                             mode = "everything")
test_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Below Above
##      Below    82     3
##      Above     2    13
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8872, 0.9836)
##     No Information Rate : 0.84            
##     P-Value [Acc > NIR] : 0.0006792       
##                                           
##                   Kappa : 0.8092          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8125          
##             Specificity : 0.9762          
##          Pos Pred Value : 0.8667          
##          Neg Pred Value : 0.9647          
##               Precision : 0.8667          
##                  Recall : 0.8125          
##                      F1 : 0.8387          
##              Prevalence : 0.1600          
##          Detection Rate : 0.1300          
##    Detection Prevalence : 0.1500          
##       Balanced Accuracy : 0.8943          
##                                           
##        'Positive' Class : Above           
##

We can describe the overall performance based on our accuracy and kappa values.

test_conf$overall[c("Accuracy", "Kappa")]

##  Accuracy     Kappa 
## 0.9500000 0.8091603

As shown, the overall model has strong performance, with high Accuracy and Kappa values.

We can describe class-level performance for the different class levels. Note: above, we set positive = "Above", since we are more interested in predicting Above median census tracts.

test_conf$byClass

##          Sensitivity          Specificity       Pos Pred Value 
##            0.8125000            0.9761905            0.8666667 
##       Neg Pred Value            Precision               Recall 
##            0.9647059            0.8666667            0.8125000 
##                   F1           Prevalence       Detection Rate 
##            0.8387097            0.1600000            0.1300000 
## Detection Prevalence    Balanced Accuracy 
##            0.1500000            0.8943452

While the overall measures are high, the sensitivity/recall is comparatively lower. Since we want to be able to correctly predict the Above class, this implies that we have some false negatives (the actual class is Above but the model predicts Below). As shown, we have a similar number of false positives (the actual class is Below and the model predicts Above).

Model Goodness of Fit

To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.

First, we can compare the overall performance on the training and testing sets.

cbind(Training = train_conf$overall,
      Testing = test_conf$overall)

##                    Training      Testing
## Accuracy       9.432099e-01 0.9500000000
## Kappa          7.906130e-01 0.8091603053
## AccuracyLower  9.160029e-01 0.8871650889
## AccuracyUpper  9.636630e-01 0.9835681208
## AccuracyNull   8.320988e-01 0.8400000000
## AccuracyPValue 1.304922e-11 0.0006792027
## McnemarPValue  4.042485e-01 1.0000000000

Next, we can compare the class level performance on the training and testing sets.

cbind(Training = train_conf$byClass,
      Testing = test_conf$byClass)

##                       Training   Testing
## Sensitivity          0.7941176 0.8125000
## Specificity          0.9732938 0.9761905
## Pos Pred Value       0.8571429 0.8666667
## Neg Pred Value       0.9590643 0.9647059
## Precision            0.8571429 0.8666667
## Recall               0.7941176 0.8125000
## F1                   0.8244275 0.8387097
## Prevalence           0.1679012 0.1600000
## Detection Rate       0.1333333 0.1300000
## Detection Prevalence 0.1555556 0.1500000
## Balanced Accuracy    0.8837057 0.8943452

As shown, we have similar performance on our training and testing sets. For this reason, we would conclude that the model is balanced. The model is performing slightly better on the testing set.