Bike Sharing Demand

Forecast use of a city bikeshare system

I. Introduction:

*“Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.*

*The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.“*

Closer look at the data

datetime - hourly date + timestamp
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - whether the day is considered a holiday
workingday - whether the day is neither a weekend nor holiday
weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals

Each data point is an hour in time and all its variables are recorded at that one hour time frame.

I have also created a variable called dayofweek. My hypothesis is that the demand is different on different day of the week. For example, on weekday, shared bikes are used mostly for work while on weekend, they are used for site visiting/travelling.

Also notice that rentals count is broken down by casual (number of non-registered user rentals initiated) and registered (number of registered user rentals initiated). It is interesting to see how different the demands for the 2 types are.

Let’s take a look at the data for bike sharing demand by dayofweek and by hour:

library(ggplot2)

ggplot(bike,aes(x=hour,y=dayofweek))+geom_tile(aes(fill=registered))+scale_fill_gradient(name="Count of registered rentals", low="white", high="red")

Number of rentals by registered users in week days peaks at around 8:00 and 17:00, which are also the time when people go to and back from work. Perhaps, registered users are mostly working people.

ggplot(bike,aes(x=hour,y=dayofweek))+geom_tile(aes(fill=casual))+scale_fill_gradient(name="Count of casual rentals", low="white", high="red")

Compared to the previous heat map, the color of this one is much lighter, meaning less demand for casual than for registered. Rentals are mostly occured between 8:00 to 18:00. Towards weekends, there are more rentals as well as later rentals.

Other than hour, I am very curious to see what variables might affect the demand for shared bike. Let’s explore this by using a few most common statistical techniques (tree, linear model, random forest). From that, let’s see how well we can predict the demand.

Note: because there seems to be fundamental differences in what affect demand for shared bikes between casual renters and registered renters, for this exercise, I will try to predict casual and registered seperately, rather than the total count.

II.Tree model

To have an understanding of what really affects demand for bike sharing, let’s take a look at it’s tree model. One of the advantage tree model has over linear model is that it models non-linear relationships among variables better, and thus often time more accurate at modeling human decisions. 1. Building tree to predict variable casual

library(tree)

casual.tree=tree(casual~.-registered-datetime-count,data=train)

Use 10-fold cross validation to choose the best tree size

casual.cv=cv.tree(casual.tree,FUN = prune.tree,K=10)

Prune tree to avoid overfitting

casual.prune=prune.tree(casual.tree,best=casual.cv$size[which.min(casual.cv$dev)])
plot(casual.prune);text(casual.prune,pretty=0)

The algorithm for tree model allows the most important variable sits on top of the tree. At each node, the left branch is “Yes” to the criterion while the right branch is “No”. Some interpretation for casual demand from this tree:

Hour is the most important variable in determining casual demand
If the hour is in 1,2,3,4,5,6,7,23,24, the number of rentals for that hour is 6.35 on average.
If the hour is in 8,9,20,21,22, the number of rentals are 17.87 or 49.84, depends if the feel-like weather (atemp) is <29.1675.
Otherwise, if the hour is from 10 to 19, the estimates are on the right branch, with mostly higher number of rentals. This is inline with what we got from the heat map.
Temperature is another important variable. It seems like casual renters prefer cool to hot weather (>19.27 Celcius) with the average demand is at least 3 times as much otherwise.

casual.predtree=predict(casual.prune,test)
SSE = sum((casual.predtree - test$casual)^2)
SST = sum((mean(test$casual) - test$casual)^2)
R2 = 1 - SSE/SST
R2

## [1] 0.7176369

The R2 is 0.7176369. Pretty good.

Building tree to predict variable registered I am going to repeat the same steps for casual

reg.tree=tree(registered~.-casual-datetime-count,data=train)

Use 10-fold cross validation to choose the best tree size

reg.cv=cv.tree(reg.tree,FUN = prune.tree,K=10)

Prune tree to avoid overfitting

reg.prune=prune.tree(reg.tree,best=reg.cv$size[which.min(reg.cv$dev)])
plot(reg.prune);text(reg.prune,pretty=0)

Some insights with registered users:

Again, hour is the most important factor with much less demand in slow hours (1,2,3,4,5,6,23,24)
Season 1 (spring) is another important one. Surprisingly, there is less demand in spring season.
There is at least twice as much demand in working days by registered renters.

Let’s see how well this model does on testing samples

reg.predtree=predict(reg.prune,test)
SSE = sum((reg.predtree - test$reg)^2)
SST = sum((mean(test$reg) - test$reg)^2)
R2 = 1 - SSE/SST
R2

## [1] 0.6665475

The R2 is 0.6665475. Less than that of the model for casual but still comparable.

III. Linear model

Building linear model to predict variable casual

casual.lm=lm(casual~.-registered-datetime-count,data=train)

Building linear model to predict variable registered

reg.lm=lm(registered~.-casual-datetime-count,data=train)

Let’s look at the accuracy (R^2) for the two linear models

summary(casual.lm)$r.squared

## [1] 0.5818426

summary(reg.lm)$r.squared

## [1] 0.6181764

As you can see from the summary, the R2 from the training linear model is much lower than even the out-of-sample R2 from tree model. This suggests that the linear model is not as a good fit in this situtation.

III.Random Forest

Tree is further advanced to random forest in which the algorithm allows for less overfitting to the training samples (by bootstrapping) and better prediction when there are a large number of correlated predictors (consider less variables at each split). Let’s see if we can have an improvement over our tree model.

Building random forest for variable casual

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

We try different values for number of trees (ntree) in the our forest and for number of variables considered at each split mtry (researches show square root to a half of total number of variables give the best results).

ntree=c(5,10,50,100,300,500)
mtry = c(3,4,5,6,7)
test.err = matrix(0,nrow=length(ntree),ncol=length(mtry))

for (i in 1:length(ntree)){
  for (m in 1:length(mtry)) {
    fit = randomForest(casual~.-registered-datetime-count,data=train, mtry = mtry[m], ntree=ntree[i])
    test.err[i,m]=fit$mse[ntree[i]]
  }  
}

matplot(ntree,cbind(test.err[,1],test.err[,2],test.err[,3],test.err[,4],test.err[,5]),pch=19,type="b",ylab="MSE")

The less MSE (mean squared error) the model has, the more accurrate it is. In this situation, ntree=500 gives the best results. Let’s see when ntree=500, what number of variables at each split (mtry) gives the best one.

matplot(mtry,cbind(test.err[5,],test.err[6,]),pch=19,type="b",ylab="MSE")
legend("top",legend=c("300 trees","500 trees"),pch=19,col=c("black","red"))

Let’s try to build a random forest with 1000 trees and 5 variables at each split

casual.fit2 = randomForest(casual~.-registered-datetime-count,data=train, mtry=5, ntree=1000)

Compute out-of-sample R^2

casual.predforest=predict(casual.fit2,test)
  SSE = sum((casual.predforest - test$casual)^2)
  SST = sum((mean(test$casual) - test$casual)^2)
  R2 = 1 - SSE/SST
  R2

## [1] 0.8751549

This model returns an R2 of 0.8751549, a significant improvement over tree model.

Building random forest for variable registered Let’s do the same thing as for casual

ntree=c(5,10,50,100,300,500)
mtry = c(3,4,5,6,7)
test.err = matrix(0,nrow=length(ntree),ncol=length(mtry))

for (i in 1:length(ntree)){
  for (m in 1:length(mtry)) {
    reg.fit = randomForest(registered~.-casual-datetime-count,data=train, mtry = mtry[m], ntree=ntree[i])
    test.err[i,m]=reg.fit$mse[ntree[i]]
  }  
}

matplot(ntree,cbind(test.err[,1],test.err[,2],test.err[,3],test.err[,4],test.err[,5]),pch=19,type="b",ylab="MSE")

Again, a range of 300 to 500 trees are good enough for this model.

matplot(mtry,cbind(test.err[5,],test.err[6,]),pch=19,type="b",ylab="MSE")
legend("top",legend=c("300 trees","500 trees"),pch=19,col=c("black","red"))

For the final model, I use 1000 trees and 5 variables at each split

reg.fit2 = randomForest(registered~.-casual-datetime-count,data=train, mtry = 5, ntree=1000)

Compute out-of-sample R^2

reg.predforest=predict(reg.fit2,test)
  SSE = sum((reg.predforest - test$registered)^2)
  SST = sum((mean(test$registered) - test$registered)^2)
  R2 = 1 - SSE/SST
  R2

## [1] 0.8420388

Again, this model returns an R2 of 0.8420388, a significant improvement over tree model.

IV.Conclusion

From the tree model, we are able to understand that hour, temp, and workingday are the most important predictors of demand for shared bikes in Washington DC. In predicting the demand, however, the random forest returns much better accuracy than the tree model.