We will not use any new packages, so we will not install any packages. In this lesson, we will use the DescTools, caret and e1071 packages. Next, we load these packages for use in the session.
library(caret)
library(DescTools)
library(e1071)
In the lesson that follows we will use the BostonHousing.csv file. The famous Boston Housing dataset contains data about different census tracts in Boston and their average (or median) values. The variable of interest is median_val, which indicates if the median home value of occupied homes in the area is greater than (Above) or less than (Below) the median value (30k). The census bureau wants to create a predictive model to predict the median_val variable for new census tracts.
The variables include:
We use the read.csv() function to import the CSV file into R as a dataframe named BH. We set stringsAsFactors = FALSE to keep any character columns as-is.
BH <- read.csv(file = "BostonHousing.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the BH dataframe to look at the variable types and to check for missing (NA) values.
Abstract(BH)
## ------------------------------------------------------------------------------
## BH
##
## data frame: 505 obs. of 13 variables
## 505 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 crim numeric .
## 2 zn numeric .
## 3 indus numeric .
## 4 chas integer .
## 5 nox numeric .
## 6 rm numeric .
## 7 age numeric .
## 8 dis numeric .
## 9 rad integer .
## 10 tax integer .
## 11 ptratio numeric .
## 12 b numeric .
## 13 median_val character .
Next, we can convert our target class variable that we want to predict, median_val to a nominal factor variable. Note: we map Below to the 1 category and Above to the 2 by using the levels argument
BH$median_val <- factor(x = BH$median_val,
levels = c("Below", "Above"))
We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more Below median census tracts than Above.
plot(BH$median_val,
main = "Median Value")
All of our potential predictors are numeric, but based on the data description, chas is categorical (binary) and rad could either be treated as categorical or numeric. We will leave chas as is (since it is already binary), but convert rad to a nominal variable (since we will need to binarize it and will not need to preserve the ordering).
BH$rad <- factor(x = BH$rad)
For Support Vector Machines (SVM), we know that we are okay keeping irrelevant and redundant variables, but we need to handle missing values, binarize categorical variables and rescale numerical variables.
na.omit()) or perform imputation.PlotMiss(x = BH)
cen_mm <- preProcess(x = BH,
method = "range")
BH_mm <- predict(object = cen_mm,
newdata = BH)
class2ind() function from the caret package for categorical variables with 2 class levels and the dummyVars() function from the caret package and the predict() function for categorical variables with more than 2 class levels.We binarize the rad variable and create a matrix (cats_dums) of the dummy variables.
cats <- dummyVars(formula = ~ rad,
data = BH_mm)
cats_dums <- predict(object = cats,
newdata = BH_mm)
Next, we combine binarized variables (cats_dum) with data (transformed numeric variables, factor target variable, and binary chas variable). We save it as a dataframe named BH_mm_dum.
BH_mm_dum <- data.frame(BH_mm[ ,!names(BH) %in% "rad"],
cats_dums)
We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (median_val). Since the function takes a random sample, we initialize a random seed first for reproducibility.
set.seed(831)
sub <- createDataPartition(y = BH$median_val,
p = 0.80,
list = FALSE)
Next, we subset the rows of the BH dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.
train <- BH[sub, ]
test <- BH[-sub, ]
We build the basic support vector machines model using the svm() function in the e1071 package.
Since we are performing classification, we specify method = "C-classification". Note: we can set scale = TRUE, to perform rescaling of the data (this is the default, and can be omitted). The default rescaling is standardization. We will set this to FALSE and use the min-max normalized data. The default kernel = "radial". Other kernels that can be used include: “polynomial”, “linear”, “sigmoid”. By default, hyperparameter cost = 1.
We will perform linear support vector machines.
set.seed(831) # initialize random seed
svm_mod <- svm(formula = median_val ~ ., # use all other variables to predict median_val
data = train, # use train data
method = "C-classification", # classification
kernel = "linear", # linear kernel
scale = FALSE) # no standardization
We can view a summary of the model by using the summary() function.
summary(svm_mod)
##
## Call:
## svm(formula = median_val ~ ., data = train, method = "C-classification",
## kernel = "linear", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 69
##
## ( 36 33 )
##
##
## Number of Classes: 2
##
## Levels:
## Below Above
To assess the goodness of fit of the model, we compare the training and testing performance. For this reason, we need to assess how well the model does on the training sample. The fitted component of the object created using the svm() function (svm_mod) contains the training predictions. We can use these directly in the confusionMatrix() function (and dont need to use predict()).
SVM_train_conf <- confusionMatrix(data = svm_mod$fitted, # predictions
reference = train$median_val, # actual
positive = "Above",
mode = "everything")
SVM_train_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below Above
## Below 331 12
## Above 6 56
##
## Accuracy : 0.9556
## 95% CI : (0.9307, 0.9734)
## No Information Rate : 0.8321
## P-Value [Acc > NIR] : 1.703e-14
##
## Kappa : 0.8351
##
## Mcnemar's Test P-Value : 0.2386
##
## Sensitivity : 0.8235
## Specificity : 0.9822
## Pos Pred Value : 0.9032
## Neg Pred Value : 0.9650
## Precision : 0.9032
## Recall : 0.8235
## F1 : 0.8615
## Prevalence : 0.1679
## Detection Rate : 0.1383
## Detection Prevalence : 0.1531
## Balanced Accuracy : 0.9029
##
## 'Positive' Class : Above
##
To assess model performance, we focus on the performance of the model applied to the testing set. Next, we use the predict() function to obtain the predicted values for the testing data based on our SVM model.
lin.te.preds <- predict(object = svm_mod,
newdata = test,
type = "class")
We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model applied to the testing dataset (test).
SVM_test_conf <- confusionMatrix(data = lin.te.preds, # predictions
reference = test$median_val, # actual
positive = "Above",
mode = "everything")
SVM_test_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below Above
## Below 82 4
## Above 2 12
##
## Accuracy : 0.94
## 95% CI : (0.874, 0.9777)
## No Information Rate : 0.84
## P-Value [Acc > NIR] : 0.002204
##
## Kappa : 0.7649
##
## Mcnemar's Test P-Value : 0.683091
##
## Sensitivity : 0.7500
## Specificity : 0.9762
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9535
## Precision : 0.8571
## Recall : 0.7500
## F1 : 0.8000
## Prevalence : 0.1600
## Detection Rate : 0.1200
## Detection Prevalence : 0.1400
## Balanced Accuracy : 0.8631
##
## 'Positive' Class : Above
##
To assess the model goodness of fit, we want to know if the model is balanced, underfitting or overfitting. We compare the performance on the training and testing sets. We can use the cbind() function to compare our confusionMatrix() output side-by-side.
First, we can compare the overall performance on the training and testing sets.
cbind(Training = SVM_train_conf$overall,
Testing = SVM_test_conf$overall)
## Training Testing
## Accuracy 9.555556e-01 0.940000000
## Kappa 8.351350e-01 0.764890282
## AccuracyLower 9.306664e-01 0.873970065
## AccuracyUpper 9.734499e-01 0.977665114
## AccuracyNull 8.320988e-01 0.840000000
## AccuracyPValue 1.703459e-14 0.002204197
## McnemarPValue 2.385928e-01 0.683091398
Next, we can compare the class level performance on the training and testing sets.
cbind(Training = SVM_train_conf$byClass,
Testing = SVM_test_conf$byClass)
## Training Testing
## Sensitivity 0.8235294 0.7500000
## Specificity 0.9821958 0.9761905
## Pos Pred Value 0.9032258 0.8571429
## Neg Pred Value 0.9650146 0.9534884
## Precision 0.9032258 0.8571429
## Recall 0.8235294 0.7500000
## F1 0.8615385 0.8000000
## Prevalence 0.1679012 0.1600000
## Detection Rate 0.1382716 0.1200000
## Detection Prevalence 0.1530864 0.1400000
## Balanced Accuracy 0.9028626 0.8630952
We use the train() function in the caret package to tune our hyperparameter using repeated k-fold cross validation. In the case of Linear SVM, the hyperparameter that we want to tune is C, the cost. We will perform a random search for the optimal C value.
First, we set up a trainControl object (named ctrl) using the trainControl() function in the caret package. We specify that we want to perform 5-fold cross validation, repeated 3 times. We use this object as input to the trControl argument in the train() function below. We set search = "random" to perform a random search.
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
search = "random")
Next, we initialize a random seed for our cross validation, since we will randomly resample the data.
set.seed(831)
Then, we use the train() function to train the SVM model using 5-Fold Cross Validation (repeated 3 times). We set tuneLength = 10 to try 10 random values for the hyperparameter, C. Setting method = "svmLinear2" uses a the e1071 package to create an svm model with a linear kernel. Arguments from the svm() function can be used.
SVMFit <- train(form = median_val ~ ., # use all other variables to predict median_val
data = train, # train using train dataframe
method = "svmLinear2", # linear SVM using e1071
scale = FALSE, # no standardization
trControl = ctrl, # sets up search and resampling
tuneLength = 10) # try 10 random C values
We can view the results of our cross validation across random C values for Accuracy and Kappa. The output will also identify the optimal C.
SVMFit
## Support Vector Machines with Linear Kernel
##
## 405 samples
## 12 predictor
## 2 classes: 'Below', 'Above'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 323, 324, 325, 324, 324, 325, ...
## Resampling results across tuning parameters:
##
## cost Accuracy Kappa
## 0.0356825 0.9226473 0.6977734
## 0.1161993 0.9390994 0.7620859
## 0.1180425 0.9390994 0.7620859
## 0.6949474 0.9358072 0.7548144
## 0.8913068 0.9374433 0.7617045
## 18.8313654 0.9300351 0.7340828
## 36.1701293 0.9275665 0.7277662
## 75.8425704 0.9014504 0.6491339
## 216.9404272 0.9259505 0.7192157
## 943.7223518 0.9028430 0.6525476
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cost = 0.1161993.
We can use our best tuned model to predict our training data.
tune.tr.preds <- predict(object = SVMFit,
newdata = train)
Then, we can use the confusionMatrix() function to obtain performance information.
SVM_trtune_conf <- confusionMatrix(data = tune.tr.preds, # predictions
reference = train$median_val, # actual
positive = "Above",
mode = "everything")
SVM_trtune_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below Above
## Below 333 16
## Above 4 52
##
## Accuracy : 0.9506
## 95% CI : (0.9248, 0.9696)
## No Information Rate : 0.8321
## P-Value [Acc > NIR] : 2.822e-13
##
## Kappa : 0.8099
##
## Mcnemar's Test P-Value : 0.01391
##
## Sensitivity : 0.7647
## Specificity : 0.9881
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.9542
## Precision : 0.9286
## Recall : 0.7647
## F1 : 0.8387
## Prevalence : 0.1679
## Detection Rate : 0.1284
## Detection Prevalence : 0.1383
## Balanced Accuracy : 0.8764
##
## 'Positive' Class : Above
##
We can use our best tuned model to predict our testing data.
tune.te.preds <- predict(object = SVMFit,
newdata = test)
Then, we can use the confusionMatrix() function to obtain performance information.
SVM_tetune_conf <- confusionMatrix(data = tune.te.preds, # predictions
reference = test$median_val, # actual
positive = "Above",
mode = "everything")
SVM_tetune_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Below Above
## Below 83 4
## Above 1 12
##
## Accuracy : 0.95
## 95% CI : (0.8872, 0.9836)
## No Information Rate : 0.84
## P-Value [Acc > NIR] : 0.0006792
##
## Kappa : 0.7987
##
## Mcnemar's Test P-Value : 0.3710934
##
## Sensitivity : 0.7500
## Specificity : 0.9881
## Pos Pred Value : 0.9231
## Neg Pred Value : 0.9540
## Precision : 0.9231
## Recall : 0.7500
## F1 : 0.8276
## Prevalence : 0.1600
## Detection Rate : 0.1200
## Detection Prevalence : 0.1300
## Balanced Accuracy : 0.8690
##
## 'Positive' Class : Above
##
As shown, we have high overall model performance based on Accuracy and Kappa, but class-level sensitivity/recall is comparatively low. Our F1, which balances precision and recall is slightly higher.
To assess if the model is balanced, underfitting or overfitting, we compare the performance on the training and testing. We can use the cbind() function to compare side-by-side.
First, we can compare the overall performance on the training and testing sets.
cbind(Training = SVM_trtune_conf$overall,
Testing = SVM_tetune_conf$overall)
## Training Testing
## Accuracy 9.506173e-01 0.9500000000
## Kappa 8.098770e-01 0.7987117552
## AccuracyLower 9.247605e-01 0.8871650889
## AccuracyUpper 9.695784e-01 0.9835681208
## AccuracyNull 8.320988e-01 0.8400000000
## AccuracyPValue 2.822185e-13 0.0006792027
## McnemarPValue 1.390630e-02 0.3710933695
Next, we can compare the class-level performance on the training and testing sets.
cbind(Training = SVM_trtune_conf$byClass,
Testing = SVM_tetune_conf$byClass)
## Training Testing
## Sensitivity 0.7647059 0.7500000
## Specificity 0.9881306 0.9880952
## Pos Pred Value 0.9285714 0.9230769
## Neg Pred Value 0.9541547 0.9540230
## Precision 0.9285714 0.9230769
## Recall 0.7647059 0.7500000
## F1 0.8387097 0.8275862
## Prevalence 0.1679012 0.1600000
## Detection Rate 0.1283951 0.1200000
## Detection Prevalence 0.1382716 0.1300000
## Balanced Accuracy 0.8764182 0.8690476
As shown, we have similar performance on our training and testing sets. For this reason, we would conclude that the model is balanced.