Tree Based Methods

Forecast Bike Rental Demand

You are working as a data scientist for the government in the city of Washington, D.C. Currently, Washington, D.C has a bike sharing system. People could rent a bike from one location and return it to a different place. You are given a historical usage pattern with weather data contained in the Excel workbook bike.csv. You are asked to forecast bike rental demand in the capital bike share program.

Data Source: from Kaggle at https://www.kaggle.com/c/bike-sharing-demand

Data Dictionary

This dataset contains the following columns:

Variable	Description	Data Type	Rules/Constraints
`datetime`	Date and time of the observation	Timestamp	Must be unique and in ISO 8601 format (YYYY-MM-DD HH:MM:SS).
`season`	Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)	Categorical	Must be an integer between 1 and 4.
`holiday`	Whether the day is a holiday (0: No, 1: Yes)	Binary	Must be 0 or 1.
`workingday`	Whether the day is a working day (0: No, 1: Yes)	Binary	Must be 0 or 1.
`weather`	Weather condition (1 to 4, where 1: Clear, 4: Extreme)	Categorical	Must be an integer between 1 and 4.
`temp`	Normalized temperature in Celsius	Continuous	Must be between 0 and 41.
`atemp`	Normalized “feels like” temperature in Celsius	Continuous	Must be between 0 and 45.455.
`humidity`	Relative humidity (%)	Continuous	Must be between 0 and 100.
`windspeed`	Wind speed in km/h	Continuous	Must be non-negative, and maximum value is approximately 56.9979.
`casual`	Number of casual (non-registered) users	Integer	Must be non-negative, with a maximum of 367.
`registered`	Number of registered users	Integer	Must be non-negative, with a maximum of 886.
`count`	Total number of bike rentals	Integer	Must equal casual + registered, and be non-negative.

Question 1:

Load the dataset bike.csv into memory. Then split the data into a training set containing 2/3 of the original data (test set containing remaining 1/3 of the original data).

Read the dataset into memory

Bike.df <- read.csv("data/Bike.csv")

Display the dimensions of the data frame (number of rows and columns)

dim(Bike.df)

## [1] 10886    12

Display the column names of the data frame

colnames(Bike.df)

##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"

Display the first six rows of the data frame to understand its structure

head(Bike.df)

##              datetime season holiday workingday weather temp  atemp humidity
## 1 2011-01-01 00:00:00      1       0          0       1 9.84 14.395       81
## 2 2011-01-01 01:00:00      1       0          0       1 9.02 13.635       80
## 3 2011-01-01 02:00:00      1       0          0       1 9.02 13.635       80
## 4 2011-01-01 03:00:00      1       0          0       1 9.84 14.395       75
## 5 2011-01-01 04:00:00      1       0          0       1 9.84 14.395       75
## 6 2011-01-01 05:00:00      1       0          0       2 9.84 12.880       75
##   windspeed casual registered count
## 1    0.0000      3         13    16
## 2    0.0000      8         32    40
## 3    0.0000      5         27    32
## 4    0.0000      3         10    13
## 5    0.0000      0          1     1
## 6    6.0032      0          1     1

Convert categorical variables to factors

Bike.df$season = factor(Bike.df$season,
                         levels = c(1, 2, 3, 4),
                         labels = c("Spring", "Summer", "Fall", "Winter")
)
Bike.df$holiday <- factor(Bike.df$holiday, 
                           levels = c(0,1), 
                           labels = c("No", "Yes")
)
Bike.df$workingday <- factor(Bike.df$workingday,
                              levels = c(0,1), 
                              labels = c("No", "Yes")
)
Bike.df$weather <- factor(Bike.df$weather,
                           levels = c(1, 2, 3, 4),
                           labels = c("Clear", "Misty_cloudy",
                                      "Light_snow", "Heavy_rain")
)

Set seed for reproducibility

set.seed(123)

Split the data into a training set (2/3) and test set (1/3)

trainIdx = sample(1:nrow(Bike.df), size = 2/3 * nrow(Bike.df))

# Create a training dataset out of "Bike.df"
trainData = Bike.df[trainIdx, ]

# Create a testing dataset from the remaining dataset of "Bike.df"
testData  <- Bike.df[-trainIdx, ]

Question 2:

Build a tree model using function tree().

Section A

The response is count and the predictors are season, holiday, workingday, temp, atemp, humidity, windspeed, casual, and registered.

# Build a tree model using function tree()
tr.model = tree(
  count ~ season + holiday + workingday + temp +
    atemp + humidity + windspeed + casual + registered,
  data = trainData
)

# Display the statistical summary of the tree
summary(tr.model)

## 
## Regression tree:
## tree(formula = count ~ season + holiday + workingday + temp + 
##     atemp + humidity + windspeed + casual + registered, data = trainData)
## Variables actually used in tree construction:
## [1] "registered" "casual"    
## Number of terminal nodes:  8 
## Residual mean deviance:  1875 = 13590000 / 7249 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -158.000  -19.290   -3.983    0.000   18.020  259.700

The decision tree mainly depends on the registered and casual variables, which seem to be the most direct predictors of bike rental counts.
With 8 terminal nodes, the model reflects a moderate level of complexity while still managing to avoid overfitting.
The Residual Mean Deviance (1875) and the distribution of residuals give insight into how well the model performs. Although there are some notable outliers (like -158 and +259.7), the residuals generally cluster around zero.
Overall, the tree offers a simplified yet understandable view of the data relationships. However, its predictive accuracy could potentially be enhanced through hyperparameter tuning or by exploring more advanced models such as random forests or gradient boosting.

Section B

Perform cross-validation to choose the best tree by calling cv.tree().

CV_Bike = cv.tree(tr.model)

Section C

Plot the model results of b) and determine the best size of the optimal tree.

plot(CV_Bike$size, CV_Bike$dev, type = "b",
     xlab = "Tree Size",
     ylab = "Deviance",
     main = "Cross Validation Results")

The plot indicates that the ideal tree size is 4.
At this point, the deviance reaches its lowest value before leveling off with larger trees. This suggests the model achieves a good balance—complex enough to capture key patterns without being overly complicated or prone to overfitting.

Section D

Prune the tree by calling prune.tree() function with the best size found in c).

prunedTree = prune.tree(tr.model, best = 4)

Section E

Plot the best tree model.

plot(prunedTree)
text(prunedTree, pretty = 0)

The pruned decision tree, refined to 4 terminal nodes through cross-validation, forecasts a numeric target variable. It begins by splitting on the registered variable, with the initial split at registered < 200.5. Subsequent splits lead to four final predicted values: 40.02, 180.40, 332.80, and 553.20.
This streamlined version of the tree is more interpretable, reduces the risk of overfitting, and maintains solid predictive power by emphasizing the most impactful decision points.

Section F

Compute the test error using the test data set.

tr.predictions = predict(prunedTree, newdata = testData)

#compute and display the MSE
tr.mse <- mean((tr.predictions - testData$count)^2)
tr.mse

## [1] 4762.857

The Mean Squared Error (MSE) of 4762.857 gives a numerical indication of the average difference between the model’s predictions and the actual values in the test dataset.
Whether this MSE is considered good or bad depends on the scale of the target variable count.
- For example, if count values usually fall in the thousands, this level of error might be reasonable. But if the values are much smaller, the error could be considered quite large in comparison.

Question 3:

Build a random forest model using function randomForest().