Support Vector Machines
Forecast Bike Rental Demand
You are working as a data scientist for the government in the city of Washington, D.C. Currently, Washington, D.C has a bike sharing system. People could rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in the Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.
Data Source: from Kaggle at https://www.kaggle.com/c/bike-sharing-demand
Data Dictionary
This dataset contains the following columns:
Variable | Description | Data Type | Rules/Constraints |
---|---|---|---|
datetime |
Date and time of the observation | Timestamp | Must be unique and in ISO 8601 format (YYYY-MM-DD HH:MM:SS). |
season |
Season (1: Winter, 2: Spring, 3: Summer, 4: Fall) | Categorical | Must be an integer between 1 and 4. |
holiday |
Whether the day is a holiday (0: No, 1: Yes) | Binary | Must be 0 or 1. |
workingday |
Whether the day is a working day (0: No, 1: Yes) | Binary | Must be 0 or 1. |
weather |
Weather condition (1 to 4, where 1: Clear, 4: Extreme) | Categorical | Must be an integer between 1 and 4. |
temp |
Normalized temperature in Celsius | Continuous | Must be between 0 and 41. |
atemp |
Normalized “feels like” temperature in Celsius | Continuous | Must be between 0 and 45.455. |
humidity |
Relative humidity (%) | Continuous | Must be between 0 and 100. |
windspeed |
Wind speed in km/h | Continuous | Must be non-negative, and maximum value is approximately 56.9979. |
casual |
Number of casual (non-registered) users | Integer | Must be non-negative, with a maximum of 367. |
registered |
Number of registered users | Integer | Must be non-negative, with a maximum of 886. |
count |
Total number of bike rentals | Integer | Must equal casual + registered, and be non-negative. |
Question 1
Load the dataset bike.csv into memory. Convert holiday to a factor using factor() function. Then split the data into training set containing 2/3 of the original data (test set containing remaining 1/3 of the original data).
Read the dataset into memory
Display the dimensions of the data frame (number of rows and columns)
## [1] 10886 12
Display the column names of the data frame
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count"
Display the first six rows of the data frame to understand its structure
## datetime season holiday workingday weather temp atemp humidity
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75
## windspeed casual registered count
## 1 0.0000 3 13 16
## 2 0.0000 8 32 40
## 3 0.0000 5 27 32
## 4 0.0000 3 10 13
## 5 0.0000 0 1 1
## 6 6.0032 0 1 1
Convert categorical variables to factors
#
Bike.df$season = factor(Bike.df$season,
levels = c(1, 2, 3, 4),
labels = c("Spring", "Summer", "Fall", "Winter")
)
Bike.df$holiday <- factor(Bike.df$holiday,
levels = c(0,1),
labels = c("Non-Holiday", "Holiday")
)
Bike.df$workingday <- factor(Bike.df$workingday,
levels = c(0,1),
labels = c("Non-Workingday", "Workingday")
)
Bike.df$weather <- factor(Bike.df$weather,
levels = c(1, 2, 3, 4),
labels = c("Clear", "Misty_cloudy",
"Light_snow", "Heavy_rain")
)
Split the data into a training set (2/3) and test set (1/3)
# Set seed for reproducibility
set.seed(1)
# Create train index that will contain 2/3 of the dataset
trainIdx = sample(1:nrow(Bike.df), size = 2/3 * nrow(Bike.df))
# Create a training dataset out of "Bike.df"
trainData = Bike.df[trainIdx, ]
# Create a testing dataset from the remaining dataset of "Bike.df"
testData <- Bike.df[-trainIdx, ]
Check the dimension of training dataset and testing dataset
## [1] 7257 12
## [1] 3629 12
Question 2
Build a support vector machine model.
Section A
The response is holiday
and the predictors are:
season
, workingday
, casual
, and
registered
. Please use svm()
function with
radial kernel and gamma = 10 and cost = 100.
svm.model <- svm(holiday ~ season + workingday + casual + registered,
data = trainData,
kernel = "radial",
gamma = 10,
cost = 100)
summary(svm.model)
##
## Call:
## svm(formula = holiday ~ season + workingday + casual + registered,
## data = trainData, kernel = "radial", gamma = 10, cost = 100)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 100
##
## Number of Support Vectors: 633
##
## ( 428 205 )
##
##
## Number of Classes: 2
##
## Levels:
## Non-Holiday Holiday
The SVM model uses a radial kernel for a binary classification task (Holiday vs. Non-Holiday), with a gamma of 10 and cost of 100. These parameters increase model complexity, potentially improving fit but also raising the risk of overfitting.
The model relies on 633 support vectors (428 from one class, 205 from the other), indicating that the data may not be easily separable. This could also reflect some class imbalance.
Section B
Perform a grid search to find the best model with potential cost: 1, 10, 50, 100 and potential gamma: 1, 3, and 5 and using radial kernel and training dataset.
tuned.results <- tune(svm, holiday ~ season + workingday + casual + registered,
data = trainData, kernel = "radial",
ranges = list(cost = c(1, 10, 50, 100),
gamma = c(1, 3, 5)
))
summary(tuned.results)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 100 1
##
## - best performance: 0.02797245
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1 1 0.02879909 0.003578686
## 2 10 1 0.02866154 0.003164643
## 3 50 1 0.02824812 0.003627397
## 4 100 1 0.02797245 0.004158739
## 5 1 3 0.02921269 0.003096545
## 6 10 3 0.02893702 0.003947184
## 7 50 3 0.03003971 0.004246814
## 8 100 3 0.03003952 0.004041429
## 9 1 5 0.02866154 0.003230569
## 10 10 5 0.03045293 0.004517214
## 11 50 5 0.03059048 0.002954565
## 12 100 5 0.03169260 0.002893269
Models with gamma = 1 consistently outperformed those with higher gamma values, regardless of the cost setting. Increasing the cost from 1 to 100 led to steadily lower error rates, with the best performance achieved at cost = 100 and gamma = 1. In contrast, higher gamma values caused the model to overfit, resulting in poorer generalization.
Despite a slightly higher error variance, the best model remains stable and reliable. It strikes a solid balance between accuracy and generalization, making it the ideal configuration. It’s recommended to train the final SVM using these parameters and validate it on a test set to ensure robust performance.
Section C
Print out the model results. What’s the best model parameters?
##
## Call:
## best.tune(METHOD = svm, train.x = holiday ~ season + workingday +
## casual + registered, data = trainData, ranges = list(cost = c(1,
## 10, 50, 100), gamma = c(1, 3, 5)), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 100
##
## Number of Support Vectors: 645
##
## ( 440 205 )
##
##
## Number of Classes: 2
##
## Levels:
## Non-Holiday Holiday
The SVM model was designed to predict holidays using features like season, workingday, casual, and registered, employing a radial basis function (RBF) kernel to handle non-linear patterns. Through grid search tuning over various cost and gamma values, the optimal parameters were found to be cost = 100 and gamma = 1, resulting in the lowest cross-validation error.
The final model uses 645 support vectors (440 for Non-Holiday, 205 for Holiday), indicating a complex decision boundary and slight class imbalance. This imbalance may affect model predictions, favoring the majority class.
Section D
Forecast holiday
using the test dataset and the best
model found in c).
## Non-Holiday Holiday
## 3620 9
This summary suggests a strong imbalance in predicted outcomes, with most predictions being “Non-Holiday”.
Section E
Get the true observations of holiday
in the test
dataset.
## Non-Holiday Holiday
## 3532 97
The summary suggests that most records (3532 out of 3629) are from “Non-Holiday” periods, while only a small subset of records (97) correspond to Holidays. This imbalance in the data might influence modeling results.
Section F
Compute the test error by constructing the confusion matrix. Is it a good model?
## Confusion Matrix and Statistics
##
## Reference
## Prediction Non-Holiday Holiday
## Non-Holiday 3530 90
## Holiday 2 7
##
## Accuracy : 0.9746
## 95% CI : (0.969, 0.9795)
## No Information Rate : 0.9733
## P-Value [Acc > NIR] : 0.3262
##
## Kappa : 0.1281
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99943
## Specificity : 0.07216
## Pos Pred Value : 0.97514
## Neg Pred Value : 0.77778
## Prevalence : 0.97327
## Detection Rate : 0.97272
## Detection Prevalence : 0.99752
## Balanced Accuracy : 0.53580
##
## 'Positive' Class : Non-Holiday
##
Test Error = 1 – Accuracy = 1 – 0.9746 = 0.0254
The model demonstrates high overall accuracy (97.46%) and excellent sensitivity for identifying Non-Holidays, but it performs poorly in detecting Holidays due to significant class imbalance. With low specificity (7.2%), a poor Kappa score (0.1281), and low balanced accuracy (53.58%), the model is heavily biased toward the majority class. Despite appearing effective at first glance, it fails to reliably identify the minority class (Holidays), making it unsuitable if accurate Holiday prediction is important.