Forecast use of a city bikeshare system
*“Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.*
*The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.“*
Closer look at the data
Each data point is an hour in time and all its variables are recorded at that one hour time frame.
I have also created a variable called dayofweek. My hypothesis is that the demand is different on different day of the week. For example, on weekday, shared bikes are used mostly for work while on weekend, they are used for site visiting/travelling.
Also notice that rentals count is broken down by casual (number of non-registered user rentals initiated) and registered (number of registered user rentals initiated). It is interesting to see how different the demands for the 2 types are.
Let’s take a look at the data for bike sharing demand by dayofweek and by hour:
library(ggplot2)
ggplot(bike,aes(x=hour,y=dayofweek))+geom_tile(aes(fill=registered))+scale_fill_gradient(name="Count of registered rentals", low="white", high="red")
Number of rentals by registered users in week days peaks at around 8:00 and 17:00, which are also the time when people go to and back from work. Perhaps, registered users are mostly working people.
ggplot(bike,aes(x=hour,y=dayofweek))+geom_tile(aes(fill=casual))+scale_fill_gradient(name="Count of casual rentals", low="white", high="red")
Compared to the previous heat map, the color of this one is much lighter, meaning less demand for casual than for registered. Rentals are mostly occured between 8:00 to 18:00. Towards weekends, there are more rentals as well as later rentals.
Other than hour, I am very curious to see what variables might affect the demand for shared bike. Let’s explore this by using a few most common statistical techniques (tree, linear model, random forest). From that, let’s see how well we can predict the demand.
Note: because there seems to be fundamental differences in what affect demand for shared bikes between casual renters and registered renters, for this exercise, I will try to predict casual and registered seperately, rather than the total count.
To have an understanding of what really affects demand for bike sharing, let’s take a look at it’s tree model. One of the advantage tree model has over linear model is that it models non-linear relationships among variables better, and thus often time more accurate at modeling human decisions. 1. Building tree to predict variable casual
library(tree)
casual.tree=tree(casual~.-registered-datetime-count,data=train)
Use 10-fold cross validation to choose the best tree size
casual.cv=cv.tree(casual.tree,FUN = prune.tree,K=10)
Prune tree to avoid overfitting
casual.prune=prune.tree(casual.tree,best=casual.cv$size[which.min(casual.cv$dev)])
plot(casual.prune);text(casual.prune,pretty=0)
The algorithm for tree model allows the most important variable sits on top of the tree. At each node, the left branch is “Yes” to the criterion while the right branch is “No”. Some interpretation for casual demand from this tree:
casual.predtree=predict(casual.prune,test)
SSE = sum((casual.predtree - test$casual)^2)
SST = sum((mean(test$casual) - test$casual)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.7176369
The R2 is 0.7176369. Pretty good.
reg.tree=tree(registered~.-casual-datetime-count,data=train)
Use 10-fold cross validation to choose the best tree size
reg.cv=cv.tree(reg.tree,FUN = prune.tree,K=10)
Prune tree to avoid overfitting
reg.prune=prune.tree(reg.tree,best=reg.cv$size[which.min(reg.cv$dev)])
plot(reg.prune);text(reg.prune,pretty=0)
Some insights with registered users:
Let’s see how well this model does on testing samples
reg.predtree=predict(reg.prune,test)
SSE = sum((reg.predtree - test$reg)^2)
SST = sum((mean(test$reg) - test$reg)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.6665475
The R2 is 0.6665475. Less than that of the model for casual but still comparable.
Building linear model to predict variable casual
casual.lm=lm(casual~.-registered-datetime-count,data=train)
Building linear model to predict variable registered
reg.lm=lm(registered~.-casual-datetime-count,data=train)
Let’s look at the accuracy (R^2) for the two linear models
summary(casual.lm)$r.squared
## [1] 0.5818426
summary(reg.lm)$r.squared
## [1] 0.6181764
As you can see from the summary, the R2 from the training linear model is much lower than even the out-of-sample R2 from tree model. This suggests that the linear model is not as a good fit in this situtation.
Tree is further advanced to random forest in which the algorithm allows for less overfitting to the training samples (by bootstrapping) and better prediction when there are a large number of correlated predictors (consider less variables at each split). Let’s see if we can have an improvement over our tree model.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
We try different values for number of trees (ntree) in the our forest and for number of variables considered at each split mtry (researches show square root to a half of total number of variables give the best results).
ntree=c(5,10,50,100,300,500)
mtry = c(3,4,5,6,7)
test.err = matrix(0,nrow=length(ntree),ncol=length(mtry))
for (i in 1:length(ntree)){
for (m in 1:length(mtry)) {
fit = randomForest(casual~.-registered-datetime-count,data=train, mtry = mtry[m], ntree=ntree[i])
test.err[i,m]=fit$mse[ntree[i]]
}
}
matplot(ntree,cbind(test.err[,1],test.err[,2],test.err[,3],test.err[,4],test.err[,5]),pch=19,type="b",ylab="MSE")
The less MSE (mean squared error) the model has, the more accurrate it is. In this situation, ntree=500 gives the best results. Let’s see when ntree=500, what number of variables at each split (mtry) gives the best one.
matplot(mtry,cbind(test.err[5,],test.err[6,]),pch=19,type="b",ylab="MSE")
legend("top",legend=c("300 trees","500 trees"),pch=19,col=c("black","red"))
Let’s try to build a random forest with 1000 trees and 5 variables at each split
casual.fit2 = randomForest(casual~.-registered-datetime-count,data=train, mtry=5, ntree=1000)
Compute out-of-sample R^2
casual.predforest=predict(casual.fit2,test)
SSE = sum((casual.predforest - test$casual)^2)
SST = sum((mean(test$casual) - test$casual)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.8751549
This model returns an R2 of 0.8751549, a significant improvement over tree model.
ntree=c(5,10,50,100,300,500)
mtry = c(3,4,5,6,7)
test.err = matrix(0,nrow=length(ntree),ncol=length(mtry))
for (i in 1:length(ntree)){
for (m in 1:length(mtry)) {
reg.fit = randomForest(registered~.-casual-datetime-count,data=train, mtry = mtry[m], ntree=ntree[i])
test.err[i,m]=reg.fit$mse[ntree[i]]
}
}
matplot(ntree,cbind(test.err[,1],test.err[,2],test.err[,3],test.err[,4],test.err[,5]),pch=19,type="b",ylab="MSE")
Again, a range of 300 to 500 trees are good enough for this model.
matplot(mtry,cbind(test.err[5,],test.err[6,]),pch=19,type="b",ylab="MSE")
legend("top",legend=c("300 trees","500 trees"),pch=19,col=c("black","red"))
For the final model, I use 1000 trees and 5 variables at each split
reg.fit2 = randomForest(registered~.-casual-datetime-count,data=train, mtry = 5, ntree=1000)
Compute out-of-sample R^2
reg.predforest=predict(reg.fit2,test)
SSE = sum((reg.predforest - test$registered)^2)
SST = sum((mean(test$registered) - test$registered)^2)
R2 = 1 - SSE/SST
R2
## [1] 0.8420388
Again, this model returns an R2 of 0.8420388, a significant improvement over tree model.
From the tree model, we are able to understand that hour, temp, and workingday are the most important predictors of demand for shared bikes in Washington DC. In predicting the demand, however, the random forest returns much better accuracy than the tree model.