Support Vector Machines

Forecast Bike Rental Demand

You are working as a data scientist for the government in the city of Washington, D.C. Currently, Washington, D.C has a bike sharing system. People could rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in the Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.

Data Source: from Kaggle at https://www.kaggle.com/c/bike-sharing-demand

Data Dictionary

This dataset contains the following columns:

Variable	Description	Data Type	Rules/Constraints
`datetime`	Date and time of the observation	Timestamp	Must be unique and in ISO 8601 format (YYYY-MM-DD HH:MM:SS).
`season`	Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)	Categorical	Must be an integer between 1 and 4.
`holiday`	Whether the day is a holiday (0: No, 1: Yes)	Binary	Must be 0 or 1.
`workingday`	Whether the day is a working day (0: No, 1: Yes)	Binary	Must be 0 or 1.
`weather`	Weather condition (1 to 4, where 1: Clear, 4: Extreme)	Categorical	Must be an integer between 1 and 4.
`temp`	Normalized temperature in Celsius	Continuous	Must be between 0 and 41.
`atemp`	Normalized “feels like” temperature in Celsius	Continuous	Must be between 0 and 45.455.
`humidity`	Relative humidity (%)	Continuous	Must be between 0 and 100.
`windspeed`	Wind speed in km/h	Continuous	Must be non-negative, and maximum value is approximately 56.9979.
`casual`	Number of casual (non-registered) users	Integer	Must be non-negative, with a maximum of 367.
`registered`	Number of registered users	Integer	Must be non-negative, with a maximum of 886.
`count`	Total number of bike rentals	Integer	Must equal casual + registered, and be non-negative.

Question 1

Load the dataset bike.csv into memory. Convert holiday to a factor using factor() function. Then split the data into training set containing 2/3 of the original data (test set containing remaining 1/3 of the original data).

Read the dataset into memory

Bike.df <- read.csv("data/Bike.csv")

Display the dimensions of the data frame (number of rows and columns)

dim(Bike.df)

## [1] 10886    12

Display the column names of the data frame

colnames(Bike.df)

##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"

Display the first six rows of the data frame to understand its structure

head(Bike.df)

##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count
## 1    0.0000      3         13    16
## 2    0.0000      8         32    40
## 3    0.0000      5         27    32
## 4    0.0000      3         10    13
## 5    0.0000      0          1     1
## 6    6.0032      0          1     1

Convert categorical variables to factors

# 
Bike.df$season = factor(Bike.df$season,
                        levels = c(1, 2, 3, 4),
                        labels = c("Spring", "Summer", "Fall", "Winter")
)

Bike.df$holiday <- factor(Bike.df$holiday, 
                          levels = c(0,1), 
                          labels = c("Non-Holiday", "Holiday")
)

Bike.df$workingday <- factor(Bike.df$workingday,
                             levels = c(0,1), 
                             labels = c("Non-Workingday", "Workingday")
)

Bike.df$weather <- factor(Bike.df$weather,
                          levels = c(1, 2, 3, 4),
                          labels = c("Clear", "Misty_cloudy",
                                     "Light_snow", "Heavy_rain")
)

Split the data into a training set (2/3) and test set (1/3)

# Set seed for reproducibility
set.seed(1)

# Create train index that will contain 2/3 of the dataset
trainIdx = sample(1:nrow(Bike.df), size = 2/3 * nrow(Bike.df))

# Create a training dataset out of "Bike.df"
trainData = Bike.df[trainIdx, ]

# Create a testing dataset from the remaining dataset of "Bike.df"
testData  <- Bike.df[-trainIdx, ]

Check the dimension of training dataset and testing dataset

# Check the dimension of training dataset
dim(trainData)

## [1] 7257   12

# Check the dimension of testing dataset
dim(testData)

## [1] 3629   12

Question 2

Build a support vector machine model.

Section A

The response is holiday and the predictors are: season, workingday, casual, and registered. Please use svm() function with radial kernel and gamma = 10 and cost = 100.

svm.model <- svm(holiday ~ season + workingday + casual + registered,
                 data = trainData,
                 kernel = "radial",
                 gamma = 10,
                 cost = 100)
summary(svm.model)

## 
## Call:
## svm(formula = holiday ~ season + workingday + casual + registered, 
##     data = trainData, kernel = "radial", gamma = 10, cost = 100)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  633
## 
##  ( 428 205 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Non-Holiday Holiday

The SVM model uses a radial kernel for a binary classification task (Holiday vs. Non-Holiday), with a gamma of 10 and cost of 100. These parameters increase model complexity, potentially improving fit but also raising the risk of overfitting.

The model relies on 633 support vectors (428 from one class, 205 from the other), indicating that the data may not be easily separable. This could also reflect some class imbalance.

Section B

Perform a grid search to find the best model with potential cost: 1, 10, 50, 100 and potential gamma: 1, 3, and 5 and using radial kernel and training dataset.

tuned.results <- tune(svm, holiday ~ season + workingday + casual + registered, 
                     data = trainData, kernel = "radial", 
                     ranges = list(cost = c(1, 10, 50, 100), 
                                   gamma = c(1, 3, 5)
                     ))
summary(tuned.results)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##   100     1
## 
## - best performance: 0.02797245 
## 
## - Detailed performance results:
##    cost gamma      error  dispersion
## 1     1     1 0.02879909 0.003578686
## 2    10     1 0.02866154 0.003164643
## 3    50     1 0.02824812 0.003627397
## 4   100     1 0.02797245 0.004158739
## 5     1     3 0.02921269 0.003096545
## 6    10     3 0.02893702 0.003947184
## 7    50     3 0.03003971 0.004246814
## 8   100     3 0.03003952 0.004041429
## 9     1     5 0.02866154 0.003230569
## 10   10     5 0.03045293 0.004517214
## 11   50     5 0.03059048 0.002954565
## 12  100     5 0.03169260 0.002893269

Models with gamma = 1 consistently outperformed those with higher gamma values, regardless of the cost setting. Increasing the cost from 1 to 100 led to steadily lower error rates, with the best performance achieved at cost = 100 and gamma = 1. In contrast, higher gamma values caused the model to overfit, resulting in poorer generalization.

Despite a slightly higher error variance, the best model remains stable and reliable. It strikes a solid balance between accuracy and generalization, making it the ideal configuration. It’s recommended to train the final SVM using these parameters and validate it on a test set to ensure robust performance.

Section C

Print out the model results. What’s the best model parameters?

best_tuned.model <- tuned.results$best.model
summary(best_tuned.model)

## 
## Call:
## best.tune(METHOD = svm, train.x = holiday ~ season + workingday + 
##     casual + registered, data = trainData, ranges = list(cost = c(1, 
##     10, 50, 100), gamma = c(1, 3, 5)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  645
## 
##  ( 440 205 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Non-Holiday Holiday

The SVM model was designed to predict holidays using features like season, workingday, casual, and registered, employing a radial basis function (RBF) kernel to handle non-linear patterns. Through grid search tuning over various cost and gamma values, the optimal parameters were found to be cost = 100 and gamma = 1, resulting in the lowest cross-validation error.

The final model uses 645 support vectors (440 for Non-Holiday, 205 for Holiday), indicating a complex decision boundary and slight class imbalance. This imbalance may affect model predictions, favoring the majority class.

Section D

Forecast holiday using the test dataset and the best model found in c).

predictions <- predict(best_tuned.model, newdata = testData)
summary(predictions)

## Non-Holiday     Holiday 
##        3620           9

This summary suggests a strong imbalance in predicted outcomes, with most predictions being “Non-Holiday”.

Section E

Get the true observations of holiday in the test dataset.

#observations <- testData$holiday
actuals <- Bike.df[-trainIdx, "holiday"]
summary(actuals)

## Non-Holiday     Holiday 
##        3532          97

The summary suggests that most records (3532 out of 3629) are from “Non-Holiday” periods, while only a small subset of records (97) correspond to Holidays. This imbalance in the data might influence modeling results.

Section F

Compute the test error by constructing the confusion matrix. Is it a good model?

conf.matrix <- confusionMatrix(predictions, actuals)
print(conf.matrix)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Non-Holiday Holiday
##   Non-Holiday        3530      90
##   Holiday               2       7
##                                          
##                Accuracy : 0.9746         
##                  95% CI : (0.969, 0.9795)
##     No Information Rate : 0.9733         
##     P-Value [Acc > NIR] : 0.3262         
##                                          
##                   Kappa : 0.1281         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.99943        
##             Specificity : 0.07216        
##          Pos Pred Value : 0.97514        
##          Neg Pred Value : 0.77778        
##              Prevalence : 0.97327        
##          Detection Rate : 0.97272        
##    Detection Prevalence : 0.99752        
##       Balanced Accuracy : 0.53580        
##                                          
##        'Positive' Class : Non-Holiday    
##

Test Error = 1 – Accuracy = 1 – 0.9746 = 0.0254

The model demonstrates high overall accuracy (97.46%) and excellent sensitivity for identifying Non-Holidays, but it performs poorly in detecting Holidays due to significant class imbalance. With low specificity (7.2%), a poor Kappa score (0.1281), and low balanced accuracy (53.58%), the model is heavily biased toward the majority class. Despite appearing effective at first glance, it fails to reliably identify the minority class (Holidays), making it unsuitable if accurate Holiday prediction is important.