Classification Analysis: Support Vector Machines

Preliminary

We will not use any new packages, so we will not install any packages. In this lesson, we will use the DescTools, caret and e1071 packages. Next, we load these packages for use in the session.

library(caret)
library(DescTools)
library(e1071)

In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.

The variables include:

crim: Per capita crime rate by town
zn: Proportion of residential land zoned for lots over 25,000 ft2
indus: Proportion of nonretail business acres per town
chas: Charles River dummy variable (1 if tract bounds river, 0 otherwise)
nox: Nitric oxide concentration (parts per 10 million)
rm: Average number of rooms per dwelling
age: Proportion of owner-occupied units built prior to 1940
dis: Weighted distances to five Boston employment centers
rad: Index of accessibility to radial highways (1-8,24)
tax: Full-value property-tax rate per $10,000
ptratio: Pupil/teacher ratio by town
median_val: Above if the median home value is above the overall median, 30k, otherwise Below

We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.

BH <- read.csv(file = "BostonHousing.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.

Abstract(BH)

## ------------------------------------------------------------------------------ 
## BH
## 
## data frame:  505 obs. of  13 variables
##      505 complete cases (100.0%)
## 
##   Nr  ColName     Class      NAs  Levels
##   1   crim        numeric    .          
##   2   zn          numeric    .          
##   3   indus       numeric    .          
##   4   chas        integer    .          
##   5   nox         numeric    .          
##   6   rm          numeric    .          
##   7   age         numeric    .          
##   8   dis         numeric    .          
##   9   rad         integer    .          
##   10  tax         integer    .          
##   11  ptratio     numeric    .          
##   12  b           numeric    .          
##   13  median_val  character  .

Preparing Target (Y) Variables

Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable. Note: we map Below to the 1 category and Above to the 2 by using the levels argument

BH$median_val <- factor(x = BH$median_val,
                        levels = c("Below", "Above"))

We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.

plot(BH$median_val, 
     main = "Median Value")

Preparing Predictor (X) Variables

All of our potential predictors are numeric, but based on the data description, chas is categorical (binary) and rad could either be treated as categorical or numeric. We will leave chas as is (since it is already binary), but convert rad to a nominal variable (since we will need to binarize it and will not need to preserve the ordering).

BH$rad <- factor(x = BH$rad)

Data Preprocessing & Transformation

For Support Vector Machines (SVM), we know that we are okay keeping irrelevant and redundant variables, but we need to handle missing values, binarize categorical variables and rescale numerical variables.

Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation.

PlotMiss(x = BH)

Rescale Numeric Variables SVM performs better with normalized or standardized data. We can use the preProcess() and predict() functions to perform min-max normalization and save the rescaled data as a new dataframe, BH_mm. Alternatively, you can rescale the data when performing the analysis.

cen_mm <- preProcess(x = BH,
                     method = "range")
BH_mm <- predict(object = cen_mm,
                 newdata = BH)

Binarize Categorical Variables If categorical input (X) variables are used in analysis, they must be converted to binary variables using the class2ind() function from the caret package for categorical variables with 2 class levels and the dummyVars() function from the caret package and the predict() function for categorical variables with more than 2 class levels.

We binarize the rad variable and create a matrix (cats_dums) of the dummy variables.

cats <- dummyVars(formula =  ~ rad,
                  data = BH_mm)
cats_dums <- predict(object = cats, 
                     newdata = BH_mm)

Next, we combine binarized variables (cats_dum) with data (transformed numeric variables, factor target variable, and binary chas variable). We save it as a dataframe named BH_mm_dum.

BH_mm_dum <- data.frame(BH_mm[ ,!names(BH) %in% "rad"],
                        cats_dums)

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831)
sub <- createDataPartition(y = BH$median_val, 
                           p = 0.80, 
                           list = FALSE)

Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- BH[sub, ] 
test <- BH[-sub, ]

Analysis

Basic Model, Default Model

We build the basic support vector machines model using the svm() function in the e1071 package.

Since we are performing classification, we specify method = "C-classification". Note: we can set scale = TRUE, to perform rescaling of the data (this is the default, and can be omitted). The default rescaling is standardization. We will set this to FALSE and use the min-max normalized data. The default kernel = "radial". Other kernels that can be used include: “polynomial”, “linear”, “sigmoid”. By default, hyperparameter cost = 1.

We will perform linear support vector machines.

set.seed(831) # initialize random seed
svm_mod <- svm(formula = median_val ~ ., # use all other variables to predict median_val
               data = train, # use train data
               method = "C-classification", # classification
               kernel = "linear", # linear kernel
               scale = FALSE) # no standardization

We can view a summary of the model by using the summary() function.

summary(svm_mod)

## 
## Call:
## svm(formula = median_val ~ ., data = train, method = "C-classification", 
##     kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  69
## 
##  ( 36 33 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Below Above

Model Performance & Fit

Training Performance

To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample. The fitted component of the object created using the svm() function (svm_mod) contains the training predictions. We can use these directly in the confusionMatrix() function (and dont need to use predict()).

SVM_train_conf <- confusionMatrix(data = svm_mod$fitted, # predictions
                                 reference = train$median_val, # actual
                                 positive = "Above",
                                 mode = "everything")
SVM_train_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Below Above
##      Below   331    12
##      Above     6    56
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.9307, 0.9734)
##     No Information Rate : 0.8321          
##     P-Value [Acc > NIR] : 1.703e-14       
##                                           
##                   Kappa : 0.8351          
##                                           
##  Mcnemar's Test P-Value : 0.2386          
##                                           
##             Sensitivity : 0.8235          
##             Specificity : 0.9822          
##          Pos Pred Value : 0.9032          
##          Neg Pred Value : 0.9650          
##               Precision : 0.9032          
##                  Recall : 0.8235          
##                      F1 : 0.8615          
##              Prevalence : 0.1679          
##          Detection Rate : 0.1383          
##    Detection Prevalence : 0.1531          
##       Balanced Accuracy : 0.9029          
##                                           
##        'Positive' Class : Above           
##

Testing Performance

To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict() function to obtain the predicted values for the testing data based on our SVM model.

lin.te.preds <- predict(object = svm_mod,
                        newdata = test, 
                        type = "class")

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the testing dataset (test).

SVM_test_conf <- confusionMatrix(data = lin.te.preds, # predictions
                                reference = test$median_val, # actual
                                positive = "Above",
                                mode = "everything")
SVM_test_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Below Above
##      Below    82     4
##      Above     2    12
##                                          
##                Accuracy : 0.94           
##                  95% CI : (0.874, 0.9777)
##     No Information Rate : 0.84           
##     P-Value [Acc > NIR] : 0.002204       
##                                          
##                   Kappa : 0.7649         
##                                          
##  Mcnemar's Test P-Value : 0.683091       
##                                          
##             Sensitivity : 0.7500         
##             Specificity : 0.9762         
##          Pos Pred Value : 0.8571         
##          Neg Pred Value : 0.9535         
##               Precision : 0.8571         
##                  Recall : 0.7500         
##                      F1 : 0.8000         
##              Prevalence : 0.1600         
##          Detection Rate : 0.1200         
##    Detection Prevalence : 0.1400         
##       Balanced Accuracy : 0.8631         
##                                          
##        'Positive' Class : Above          
##

Model Goodness of Fit

To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.

First, we can compare the overall performance on the training and testing sets.

cbind(Training = SVM_train_conf$overall,
      Testing = SVM_test_conf$overall)

##                    Training     Testing
## Accuracy       9.555556e-01 0.940000000
## Kappa          8.351350e-01 0.764890282
## AccuracyLower  9.306664e-01 0.873970065
## AccuracyUpper  9.734499e-01 0.977665114
## AccuracyNull   8.320988e-01 0.840000000
## AccuracyPValue 1.703459e-14 0.002204197
## McnemarPValue  2.385928e-01 0.683091398

Next, we can compare the class level performance on the training and testing sets.

cbind(Training = SVM_train_conf$byClass,
      Testing = SVM_test_conf$byClass)

##                       Training   Testing
## Sensitivity          0.8235294 0.7500000
## Specificity          0.9821958 0.9761905
## Pos Pred Value       0.9032258 0.8571429
## Neg Pred Value       0.9650146 0.9534884
## Precision            0.9032258 0.8571429
## Recall               0.8235294 0.7500000
## F1                   0.8615385 0.8000000
## Prevalence           0.1679012 0.1600000
## Detection Rate       0.1382716 0.1200000
## Detection Prevalence 0.1530864 0.1400000
## Balanced Accuracy    0.9028626 0.8630952

Hyperparameter Tuning Model

We use the train() function in the caret package to tune our hyperparameter using repeated k-fold cross validation. In the case of Linear SVM, the hyperparameter that we want to tune is C, the cost. We will perform a random search for the optimal C value.

First, we set up a trainControl object (named ctrl) using the trainControl() function in the caret package. We specify that we want to perform 5-fold cross validation, repeated 3 times. We use this object as input to the trControl argument in the train() function below. We set search = "random" to perform a random search.

ctrl <- trainControl(method = "repeatedcv",
                     number = 5,
                     repeats = 3,
                     search = "random")

Next, we initialize a random seed for our cross validation, since we will randomly resample the data.

set.seed(831)

Then, we use the train() function to train the SVM model using 5-Fold Cross Validation (repeated 3 times). We set tuneLength = 10 to try 10 random values for the hyperparameter, C. Setting method = "svmLinear2" uses a the e1071 package to create an svm model with a linear kernel. Arguments from the svm() function can be used.

SVMFit <- train(form = median_val ~ ., # use all other variables to predict median_val
               data = train, # train using train dataframe
               method = "svmLinear2", # linear SVM using e1071
               scale = FALSE, # no standardization
               trControl = ctrl, # sets up search and resampling
               tuneLength = 10) # try 10 random C values

We can view the results of our cross validation across random C values for Accuracy and Kappa. The output will also identify the optimal C.

SVMFit

## Support Vector Machines with Linear Kernel 
## 
## 405 samples
##  12 predictor
##   2 classes: 'Below', 'Above' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 323, 324, 325, 324, 324, 325, ... 
## Resampling results across tuning parameters:
## 
##   cost         Accuracy   Kappa    
##     0.0356825  0.9226473  0.6977734
##     0.1161993  0.9390994  0.7620859
##     0.1180425  0.9390994  0.7620859
##     0.6949474  0.9358072  0.7548144
##     0.8913068  0.9374433  0.7617045
##    18.8313654  0.9300351  0.7340828
##    36.1701293  0.9275665  0.7277662
##    75.8425704  0.9014504  0.6491339
##   216.9404272  0.9259505  0.7192157
##   943.7223518  0.9028430  0.6525476
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cost = 0.1161993.

Model Performance & Fit

Training Performance

We can use our best tuned model to predict our training data.

tune.tr.preds <- predict(object = SVMFit,
                        newdata = train)

Then, we can use the confusionMatrix() function to obtain performance information.

SVM_trtune_conf <- confusionMatrix(data = tune.tr.preds, # predictions
                                  reference = train$median_val, # actual
                                  positive = "Above",
                                  mode = "everything")
SVM_trtune_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Below Above
##      Below   333    16
##      Above     4    52
##                                           
##                Accuracy : 0.9506          
##                  95% CI : (0.9248, 0.9696)
##     No Information Rate : 0.8321          
##     P-Value [Acc > NIR] : 2.822e-13       
##                                           
##                   Kappa : 0.8099          
##                                           
##  Mcnemar's Test P-Value : 0.01391         
##                                           
##             Sensitivity : 0.7647          
##             Specificity : 0.9881          
##          Pos Pred Value : 0.9286          
##          Neg Pred Value : 0.9542          
##               Precision : 0.9286          
##                  Recall : 0.7647          
##                      F1 : 0.8387          
##              Prevalence : 0.1679          
##          Detection Rate : 0.1284          
##    Detection Prevalence : 0.1383          
##       Balanced Accuracy : 0.8764          
##                                           
##        'Positive' Class : Above           
##

Testing Performance

We can use our best tuned model to predict our testing data.

tune.te.preds <- predict(object = SVMFit,
                        newdata = test)

Then, we can use the confusionMatrix() function to obtain performance information.

SVM_tetune_conf <- confusionMatrix(data = tune.te.preds, # predictions
                                  reference = test$median_val, # actual
                                  positive = "Above",
                                  mode = "everything")
SVM_tetune_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Below Above
##      Below    83     4
##      Above     1    12
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8872, 0.9836)
##     No Information Rate : 0.84            
##     P-Value [Acc > NIR] : 0.0006792       
##                                           
##                   Kappa : 0.7987          
##                                           
##  Mcnemar's Test P-Value : 0.3710934       
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.9881          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.9540          
##               Precision : 0.9231          
##                  Recall : 0.7500          
##                      F1 : 0.8276          
##              Prevalence : 0.1600          
##          Detection Rate : 0.1200          
##    Detection Prevalence : 0.1300          
##       Balanced Accuracy : 0.8690          
##                                           
##        'Positive' Class : Above           
##

As shown, we have high overall model performance based on Accuracy and Kappa, but class-level sensitivity/recall is comparatively low. Our F1, which balances precision and recall is slightly higher.

Goodness of Fit

To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.

First, we can compare the overall performance on the training and testing sets.

cbind(Training = SVM_trtune_conf$overall,
      Testing = SVM_tetune_conf$overall)

##                    Training      Testing
## Accuracy       9.506173e-01 0.9500000000
## Kappa          8.098770e-01 0.7987117552
## AccuracyLower  9.247605e-01 0.8871650889
## AccuracyUpper  9.695784e-01 0.9835681208
## AccuracyNull   8.320988e-01 0.8400000000
## AccuracyPValue 2.822185e-13 0.0006792027
## McnemarPValue  1.390630e-02 0.3710933695

Next, we can compare the class-level performance on the training and testing sets.

cbind(Training = SVM_trtune_conf$byClass,
      Testing = SVM_tetune_conf$byClass)

##                       Training   Testing
## Sensitivity          0.7647059 0.7500000
## Specificity          0.9881306 0.9880952
## Pos Pred Value       0.9285714 0.9230769
## Neg Pred Value       0.9541547 0.9540230
## Precision            0.9285714 0.9230769
## Recall               0.7647059 0.7500000
## F1                   0.8387097 0.8275862
## Prevalence           0.1679012 0.1600000
## Detection Rate       0.1283951 0.1200000
## Detection Prevalence 0.1382716 0.1300000
## Balanced Accuracy    0.8764182 0.8690476

As shown, we have similar performance on our training and testing sets. For this reason, we would conclude that the model is balanced.

Classification Analysis: Support Vector Machines

Dr. Chelsey Hill

Preliminary

Data Exploration & Variable Preparation

Preparing Target (Y) Variables

Preparing Predictor (X) Variables

Data Preprocessing & Transformation

Training & Testing Sets

Analysis

Basic Model, Default Model

Model Performance & Fit

Training Performance

Testing Performance

Model Goodness of Fit

Hyperparameter Tuning Model

Model Performance & Fit

Training Performance

Testing Performance

Goodness of Fit