Tree Based Methods
Forecast Bike Rental Demand
You are working as a data scientist for the government in the city of Washington, D.C. Currently, Washington, D.C has a bike sharing system. People could rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in the Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.
Data Source: from Kaggle at https://www.kaggle.com/c/bike-sharing-demand
Data Dictionary
This dataset contains the following columns:
| Variable | Description | Data Type | Rules/Constraints |
|---|---|---|---|
datetime |
Date and time of the observation | Timestamp | Must be unique and in ISO 8601 format (YYYY-MM-DD HH:MM:SS). |
season |
Season (1: Winter, 2: Spring, 3: Summer, 4: Fall) | Categorical | Must be an integer between 1 and 4. |
holiday |
Whether the day is a holiday (0: No, 1: Yes) | Binary | Must be 0 or 1. |
workingday |
Whether the day is a working day (0: No, 1: Yes) | Binary | Must be 0 or 1. |
weather |
Weather condition (1 to 4, where 1: Clear, 4: Extreme) | Categorical | Must be an integer between 1 and 4. |
temp |
Normalized temperature in Celsius | Continuous | Must be between 0 and 41. |
atemp |
Normalized “feels like” temperature in Celsius | Continuous | Must be between 0 and 45.455. |
humidity |
Relative humidity (%) | Continuous | Must be between 0 and 100. |
windspeed |
Wind speed in km/h | Continuous | Must be non-negative, and maximum value is approximately 56.9979. |
casual |
Number of casual (non-registered) users | Integer | Must be non-negative, with a maximum of 367. |
registered |
Number of registered users | Integer | Must be non-negative, with a maximum of 886. |
count |
Total number of bike rentals | Integer | Must equal casual + registered, and be non-negative. |
Question 1:
Load the dataset bike.csv into memory. Then split the data into a training set containing 2/3 of the original data (test set containing remaining 1/3 of the original data).
Read the dataset into memory
Display the dimensions of the data frame (number of rows and columns)
## [1] 10886 12
Display the column names of the data frame
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count"
Display the first six rows of the data frame to understand its structure
## datetime season holiday workingday weather temp atemp humidity
## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81
## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80
## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80
## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75
## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75
## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880 75
## windspeed casual registered count
## 1 0.0000 3 13 16
## 2 0.0000 8 32 40
## 3 0.0000 5 27 32
## 4 0.0000 3 10 13
## 5 0.0000 0 1 1
## 6 6.0032 0 1 1
Convert categorical variables to factors
Bike.df$season = factor(Bike.df$season,
levels = c(1, 2, 3, 4),
labels = c("Spring", "Summer", "Fall", "Winter")
)
Bike.df$holiday <- factor(Bike.df$holiday,
levels = c(0,1),
labels = c("No", "Yes")
)
Bike.df$workingday <- factor(Bike.df$workingday,
levels = c(0,1),
labels = c("No", "Yes")
)
Bike.df$weather <- factor(Bike.df$weather,
levels = c(1, 2, 3, 4),
labels = c("Clear", "Misty_cloudy",
"Light_snow", "Heavy_rain")
)Set seed for reproducibility
Split the data into a training set (2/3) and test set (1/3)
trainIdx = sample(1:nrow(Bike.df), size = 2/3 * nrow(Bike.df))
# Create a training dataset out of "Bike.df"
trainData = Bike.df[trainIdx, ]
# Create a testing dataset from the remaining dataset of "Bike.df"
testData <- Bike.df[-trainIdx, ]Question 2:
Build a tree model using function tree().
Section A
The response is count and the predictors are
season, holiday, workingday,
temp, atemp, humidity,
windspeed, casual, and
registered.
# Build a tree model using function tree()
tr.model = tree(
count ~ season + holiday + workingday + temp +
atemp + humidity + windspeed + casual + registered,
data = trainData
)
# Display the statistical summary of the tree
summary(tr.model)##
## Regression tree:
## tree(formula = count ~ season + holiday + workingday + temp +
## atemp + humidity + windspeed + casual + registered, data = trainData)
## Variables actually used in tree construction:
## [1] "registered" "casual"
## Number of terminal nodes: 8
## Residual mean deviance: 1875 = 13590000 / 7249
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -158.000 -19.290 -3.983 0.000 18.020 259.700
- The decision tree mainly depends on the
registeredandcasualvariables, which seem to be the most direct predictors of bike rental counts. - With 8 terminal nodes, the model reflects a moderate level of complexity while still managing to avoid overfitting.
- The Residual Mean Deviance (1875) and the distribution of residuals give insight into how well the model performs. Although there are some notable outliers (like -158 and +259.7), the residuals generally cluster around zero.
- Overall, the tree offers a simplified yet understandable view of the data relationships. However, its predictive accuracy could potentially be enhanced through hyperparameter tuning or by exploring more advanced models such as random forests or gradient boosting.
Section B
Perform cross-validation to choose the best tree by calling
cv.tree().
Section C
Plot the model results of b) and determine the best size of the optimal tree.
plot(CV_Bike$size, CV_Bike$dev, type = "b",
xlab = "Tree Size",
ylab = "Deviance",
main = "Cross Validation Results")- The plot indicates that the ideal tree size is 4.
- At this point, the deviance reaches its lowest value before leveling off with larger trees. This suggests the model achieves a good balance—complex enough to capture key patterns without being overly complicated or prone to overfitting.
Section D
Prune the tree by calling prune.tree() function with the
best size found in c).
Section E
Plot the best tree model.
- The pruned decision tree, refined to 4 terminal nodes through
cross-validation, forecasts a numeric target variable. It begins by
splitting on the
registeredvariable, with the initial split atregistered< 200.5. Subsequent splits lead to four final predicted values: 40.02, 180.40, 332.80, and 553.20. - This streamlined version of the tree is more interpretable, reduces the risk of overfitting, and maintains solid predictive power by emphasizing the most impactful decision points.
Section F
Compute the test error using the test data set.
tr.predictions = predict(prunedTree, newdata = testData)
#compute and display the MSE
tr.mse <- mean((tr.predictions - testData$count)^2)
tr.mse## [1] 4762.857
- The Mean Squared Error (MSE) of 4762.857 gives a numerical indication of the average difference between the model’s predictions and the actual values in the test dataset.
- Whether this MSE is considered good or bad depends on the scale of
the target variable
count.- For example, if
countvalues usually fall in the thousands, this level of error might be reasonable. But if the values are much smaller, the error could be considered quite large in comparison.
- For example, if
Question 3:
Build a random forest model using function
randomForest().
Section A
The response is count and the predictors are
season, holiday, workingday,
temp, atemp, humidity,
windspeed, casual, and
registered.
set.seed(123) # Set seed for reproducibility
rf.model = randomForest(count ~ season + holiday + workingday + temp +
atemp + humidity + windspeed + casual + registered,
data = trainData, importance = TRUE)
# Print the random forest model result
print(rf.model)##
## Call:
## randomForest(formula = count ~ season + holiday + workingday + temp + atemp + humidity + windspeed + casual + registered, data = trainData, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 130.3723
## % Var explained: 99.6
- The Random Forest model performs exceptionally well on the training data, accounting for 99.6% of the variance and showing a low residual error of 130.3723. However, this strong performance might suggest overfitting, so it’s important to validate the model on unseen test data.
- Using 500 trees and 3 variables at each split, the model strikes a good balance between precision and computational efficiency.
- To determine which features, have the greatest impact on
predictions, tools like
importance()orvarImpPlot()can be used to assess variable importance.
Section B
Compute the test error using the test data set.
rf.predictions = predict(rf.model, newdata = testData)
# Calculate the mean squared error
rf.mse <- mean((rf.predictions - testData$count)^2)
rf.mse## [1] 177.4572
- The test Mean Squared Error (MSE) of 177.457 reflects strong predictive accuracy on new data, although it’s slightly higher than the training error of 130.3723—indicating a minor degree of overfitting.
- Overall, the model demonstrates solid performance, with only a small drop in accuracy from training to testing, which is a normal and expected outcome in most modeling scenarios.
Section C
Extract variable importance measure using importance()
function.
## %IncMSE IncNodePurity
## season 14.332261 2084868.26
## holiday 4.836423 77828.05
## workingday 27.288525 2036925.63
## temp 13.620187 6064088.53
## atemp 16.569101 7734568.86
## humidity 17.822489 5095570.13
## windspeed 7.787657 1215118.78
## casual 34.829490 58165642.18
## registered 117.913863 152055043.04
- The variable
registeredplays the most significant role in predicting the target variablecount, withcasualandworkingdayalso contributing notably. - In contrast, features such as
holidayandwindspeedhave little influence on the model’s predictive capability. - These findings can guide model simplification by removing less impact variables and highlight the main drivers behind the predictions.
Section D
Plot the variable importance using function
varImpPlot(). Which are the top 2 important predictors in
this model?
Registeredandcasualare the top predictors, playing a major role in improving the model’s accuracy and reducing variance.- The model relies heavily on these two variables, indicating they are
the primary factors affecting the target variable,
count.