bikes


ML Question/Problem: Develop a model that can predict the number of bikes rented.

Background

Bike shares are a novel type of bike rental that has been fully automated, from membership to rental and return. These systems allow users to easily rent bikes from one location and return them to other locations around the city and are part of a larger trend of shared micromobility around the world.

For my project, I chose a dataset on the number of bike share rentals per hour based on factors such as weather, season, time of day, day of week, and hour of day. The source of the dataset is from Fanaee-T and Gama (2014), who compiled publicly available ride data from Capital Bikeshare in Washington, D.C. for 2011-2012 with publicly available weather and holiday data.

I picked this data-set due in part due to the size, with 17,000+ rows available and a good mix of categorical and continuous data. The bikeshare data can also tell us about the traffic, environmental, and health issues in the city so it is an interesting data set to look at. Furthermore, a working model can used to identify optimal conditions for more bike share programs. Bike share companies can also better allocate their resources by being able to predict where they will have the most business.

I wanted to know if I could accurately predict the number of bikes rented out based on a number of factors, including weather, season, time of day, and type of day (weekday, workday, holiday).

Evaluation Metric: I will be looking at the mean squared error (MSE) as the main evaluation criteria. It measures the amount of error between the model and the actual values, which is what I want to optimize on. I hope to lower the MSE as much as possible. I will also look at the R2 value. R2 measures the proportion of the total variation in the dependent variable that can be explained by the independent variables in the model.

Other research published using this data set: https://link.springer.com/article/10.1007/s13748-013-0040-3

Exploratory Analysis

Boxplot:Five Number Summary of Target Variable

# fivenum(bike$cnt)
boxplot(bike$cnt)
text(y=fivenum(bike$cnt),labels=fivenum(bike$cnt), x=1.3)

Distribution of Bike Counts

hist(bike$cnt, col = 'red', xlab='Count', ylab='Frequency', main = 'Bike Count Distribution')

Bikes Rented by Hour

bike_tbl <- bike %>% group_by(hr) %>% 
  summarise(average_count=mean(cnt),
            .groups = 'drop')

ggplot(bike_tbl, aes(x = hr, y = average_count)) +
    geom_bar(stat = 'identity') + ggtitle("Bikes per Hour")

One bike is by far the most common amount of bikes rented out at a time. The data set does not include data where there were no bikes checked out, which is why the morning hours have a less data points than waking hours. In the data, it is most common to have 40-281 bikes rented in a given hour, however there are outliers of large surges in bike rentals.

I think the biggest factor affecting the number of bicycles rented will be the time of day, or what hour it is. Grouping the data by hour and then displaying it in a bar chart with count shows us a distribution that I expect, with the most active hours being the rush hours, with a steady count throughout the day dropping off as it gets later and later into the night and the following morning.

Initial Model Building

Method: I plan to use a random forest because it reduces over-fitting compared to decision trees and help with increasing the accuracy of the prediction. It also handles the categorical and continuous values well and removes the need for normalizing.

param1 <- data.frame("Parameter" = "ntree", "Value"= 1000)
param1[nrow(param1)+1,] <- c("mtry",4)
param1[nrow(param1)+1,] <- c("sampsize",100)
paged_table(param1)

Initial Model Evaluation

Variable Importance

importances <- as.data.frame(importance(rf1))
importances$names <- rownames(importances)

ggplot(importances, aes(x=names, y=`%IncMSE`)) +
  geom_segment(aes(x=names, xend=names, y=0, yend=`%IncMSE`)) +
  geom_point(aes(size=IncNodePurity)) +
  theme_light() + 
  coord_flip()

%IncMSE is the increase in mean squared error (MSE) of the predictions, estimated with out-of-bag cross validation, as a result of the variable being randomly shuffled. IncNodePurity is the measure of how much model error increases when the variable is randomly permuted. The higher the value for either metric, the more important a variable is. As seen, from the plot, my initial assumption that time of day (hr variable) would be the most important is correct.

Change in Mean Square Error

# rf_density <- density(rf1$oob.times)
# plot(rf_density)

rm(fig)
fig <- plot_ly(x=1:rf1$ntree, y=rf1$mse, type = 'scatter', mode = 'lines') %>% layout(title = 'MSE For Each Iteration', xaxis = list(title = 'Number of Trees'), yaxis = list(title = 'MSE'))
fig

Not much difference after about first 400 trees. The MSE are still pretty high, only going down to 13,000.

Optimization/Tuning Model

ntree was changed to 500 as there was not much change at higher trees.

(1) Alter mtry

rm(fig2)
fig2 <- plot_ly(x=1:rf2$ntree, y=rf2$mse, type = 'scatter', mode = 'lines') %>% layout(title = 'MSE For Each Iteration', xaxis = list(title = 'Number of Trees'), yaxis = list(title = 'MSE'))
fig2
rm(fig3)
fig3 <- plot_ly(x=1:rf3$ntree, y=rf3$mse, type = 'scatter', mode = 'lines') %>% layout(title = 'MSE For Each Iteration', xaxis = list(title = 'Number of Trees'), yaxis = list(title = 'MSE'))
fig3
mtry_config <- data.frame("mtry"= 4, "mse"= min(rf1$mse), "rsq"=max(rf1$rsq))
mtry_config[nrow(mtry_config)+1,] <- c(7,min(rf2$mse),max(rf2$rsq))
mtry_config[nrow(mtry_config)+1,] <- c(11,min(rf3$mse),max(rf3$rsq))
paged_table(mtry_config)

(2) Alter Sample Size

rm(fig4)
fig4 <- plot_ly(x=1:rf4$ntree, y=rf4$mse, type = 'scatter', mode = 'lines') %>% layout(title = 'MSE For Each Iteration', xaxis = list(title = 'Number of Trees'), yaxis = list(title = 'MSE'))
fig4
rm(fig5)
fig5 <- plot_ly(x=1:rf5$ntree, y=rf5$mse, type = 'scatter', mode = 'lines') %>% layout(title = 'MSE For Each Iteration', xaxis = list(title = 'Number of Trees'), yaxis = list(title = 'MSE'))
fig5
sampsize_config <- data.frame("sampsize"= 100,"mse"= min(rf1$mse),"rsq"=max(rf1$rsq))
sampsize_config[nrow(sampsize_config)+1,] <- c(500,min(rf4$mse),max(rf4$rsq))
sampsize_config[nrow(sampsize_config)+1,] <- c(1000,min(rf5$mse),max(rf5$rsq))
paged_table(sampsize_config)

The two evaluation metrics improve with increasing mtry, but having an mtry=11 would utilize all variables and decrease the randomness of a random forest, which defeats the purpose of utilizing it compared to just a decision tree. Moreover, there is not much of an improvement from mtry=7 to mtry=11.

The two evaluation metrics also improve with increasing sample size. Since, I have a relatively large data set, I can utilize a larger sample size. Furthermore a sampsize=1000 greatly improves the metrics. However, the tradeoff is possibility of overfitting the model.

Final Model

Model Summary

importances7 <- as.data.frame(importance(rf7))
importances7$names <- rownames(importances7)

ggplot(importances7, aes(x=names, y=`%IncMSE`)) +
  geom_segment(aes(x=names, xend=names, y=0, yend=`%IncMSE`)) +
  geom_point(aes(size=IncNodePurity)) +
  theme_light() + 
  coord_flip()

rm(fig7)
fig7 <- plot_ly(x=1:rf7$ntree, y=rf7$mse, type = 'scatter', mode = 'lines') %>% layout(title = 'MSE For Each Iteration', xaxis = list(title = 'Number of Trees'), yaxis = list(title = 'MSE'))
fig7
paramf <- data.frame("Parameter/Metric" = "ntree", "Value"= 500)
paramf[nrow(paramf)+1,] <- c("mtry",7)
paramf[nrow(paramf)+1,] <- c("sampsize",1000)
paramf[nrow(paramf)+1,] <- c("mse",min(rf7$mse))
paramf[nrow(paramf)+1,] <- c("rsq",max(rf7$rsq))
paged_table(paramf)

Model Prediction on Test Set

predval <- data.frame("Metric" = "MSE (sample size 1000)", "Value"= result_mse7)
predval[nrow(predval)+1,] <- c("R^2 (sample size 1000)",result_r_squared7)
predval[nrow(predval)+1,] <- c("MSE (sample size max)",result_mse6)
predval[nrow(predval)+1,] <- c("R^2 (sample size max)",result_r_squared6)

paged_table(predval)

Fairness Assessment

A fairness assessment is not necessary, since there are no protected classes in the dataset.

Conclusion

The final model was fairly competent, with ~87% of the variance being captured by the model. The most important factor is ‘hr’, which was way more important than any other variable. The R2 value was 0.81 and 0.82 when predicting on the test set, which does a decent job at predicting the bike count, with some error.

I did test a model with the default sample size of nrow(x) of around 15000. This resulted in a much higher test R^2 value of 0.95 instead of 0.87, but I had concerns of overfitting with a sample size this big.

This model still has limitations. A random forest model can’t extrapolate, because the range of possible values the model outputs are the range of values in the dataset. Especially for the topic of bike sharing, extrapolating is important because the maximum of values are constantly increasing. To be fair, extrapolating growth into the future would be hard for any machine learning model. What the model is useful for is identifying the features that matter the most to affecting bike rental count and identifying how much bike rental count is affected (or can be predicted by) the factors in the dataset. This can still be helpful in identifying allocation of resources to the different locations.

A problem with the dataset was that the hours where there were no bike rentals (cnt==0) were not included in the dataset, so the model couldn’t capture hours like those.

Future Work

In the future, I could include more features into the dataset, include 0 counts into the dataset, include a larger timespan by building and cleaning the data straight from Capital Bike, and/or use a different type of model. Moreover, it might make sense to only look at the data within specific times of the day, such as eliminating non-awake hours, as these small values can affect the predictive elements of the model and increase the need for more overfitting to better the evaluation metric. Furthermore, depending on the business need, it might be good to build a model without time of day, so I can see if environmental factors (ie weather) can also be used as a good predictor of rental rates, which can be useful in expanding bikeshare to other areas.

Using machine learning in this way to predict bike rental counts is very promising, given the high R^2 value and the high variance explained by the model. Given this, future work should have very good results.