April 05, 2016


1 - Introduction

This is an extended work of my final project for course MATH664: Methods for Statistical Consulting. The original post and project backgroud can be found here.

Briefly, due to a limited time and resource, last time when my team and I worked on this project, we only built simple generalizd linear model to explore the patterns in the data set. Possion regression, Linear regression with log transformation, and Negative Binomial regression methods were applied respectively. However the predictive result was not very good. It is part of the reason why the original post was named Exploratory Data Analysis of Bike Rental System in Washington, D.C..

This project follows the steps and procedure of this link, which has very similar exploratory analysis steps as what I did before, however in feature engineering and predictive analysis part, it was completely different.

So in this updated post, I…

  • renewed all my early work, make all the visulizations again with ggplot2 package;
  • created new features using decision tree;
  • split the original traning data set into new training set and testing set, applied random forest method to predict bike usage;
  • evaluate my model and get conclusion.

2 - Exploratory Analysis

First, I combined the training and testing data set together to understand the distribution of the explanatory variables.

Observations: 17,379
Variables: 12
$ datetime   (fctr) 2011-01-01 00:00:00, 2011-01-01 01:00:00, 2011-01-...
$ season     (fctr) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ holiday    (fctr) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ workingday (fctr) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ weather    (fctr) 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,...
$ temp       (dbl) 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.8...
$ atemp      (dbl) 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13....
$ humidity   (int) 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77,...
$ windspeed  (dbl) 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0...
$ casual     (dbl) 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 4...
$ registered (dbl) 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 7...
$ count      (dbl) 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, ...

The structure of the data set shows there is a date & time variable, four categorical variables season, weather, holiday, and workingday, and also four numerical variables temperature, "feels like" temperature, humidity, and windspeed. Total count is the summation of casual and registered.

The prediction task is actually to predict casual and registered bike users respectively. Sum them up, we’ll get the total.


2.1 - Distribution of Categorical Variables

From the plots above, we can see that:

  1. Four seasons in the season variable almost have equal distribution.
Proportion Table of ‘season’ Variable
Spring Summer Fall Winter
24.41 % 25.37 % 25.87 % 24.35 %
  1. Weather 1 (clear weather) of the weather variable is the most, and weather 4 (terrible weather) is the least, which means most of the time during the two-year peroid, the weather condition was good.
Proportion Table of ‘weather’ Variable, weather conditions are getting worse from 1 to 4
WeatherCondition_1 WeatherCondition_2 WeatherCondition_3 WeatherCondition_4
65.67 % 26.15 % 8.17 % 0.02 %
  1. Most of the days are not holidays (of course… = .=!)
Proportion Table of ‘holiday’ Variable
Not a Holiday Holiday
97.12 % 2.88 %
  1. Most of the days are workingdays (sad… T_T)
Proportion Table of ‘workingday’ Variable
Not a Workingday Workingday
31.73 % 68.27 %

2.2 - Distribution of Numerical Variables

It seems these numerical variables are distributed quite naturally.


2.3 - Bike Usage vs. Time

It is a common sense to consider time as an important variable when it comes to predicting rental bike usage. So I extracted time information from the date & time variable to analyze hourly, weekly, monthly, and yearly trend.

As we can see clearly, the overall trend is quite obvious:

  1. Each day, there is a peak in the morning at about 8 am, and another peak in the afternoon at about 5 pm - 6 pm.
  2. Each week, bike usage during the weekday is higher than weekend on average.
  3. Each year, the highest bike usage is between June and July (warm?), while the lowest is from January to March (cold?).
  4. From 2011 to 2012, the bike usage is increasing.

Besides the overall analysis, I also performed the same analysis for registered and casual users respectively where I found something interestig.

The houly trend of the registered bike users match the overall trend, while there is no peak in the morning and afternoon for casual users. This is a huge difference between registered and casual users. Because of this, different hour bins will be created later in feature engineering.

The weekly trend of the registered bike users also match the overall trend, but the casual users actually shows the opposite trend. There’re more casual users in the weekend than weekdays. One possible explanation for it could be that registered users use rental bikes as commute tools while casual users use them as simple transportation during their leisure time. Based on this assumption, a variable for weekend will also be created later.


2.4 - Bike Usage vs. Categorical Variables

From the plots above we can see that the effect of season and weather on rental bike usage is quite clear:

  1. In summer and fall, there are more users than spring and winter;
  2. As the weather condition becomes worse and worse, the number of bike users also decreases (there is only one record for weather condition 4, although its value is almost the same as the median of weather condition 1, it is not representative).

On the other hand, the effect of holiday and workingday is not so obvious. It seems there are more users when it’s not a holiday, and also in workingdays. In order to better understand thses two variable, I repeat the analysis for registered and casual users again respectively.

Again, the trend for registered and casual users are exactly the opposite. This is another support for the assumptions I made earlier.


2.5 - Bike Usage vs. Numerical Variables

We can see that the number of bike usage has positive relationship with temperature, and negative relationship with humidity. There seems also a positive relation between number of bike usage and windspeed, but it is not as strong as temperature.

From the correlation matrix below, we can get the same conclusion, and we can also find that temp and atemp are highly correlated.
Correlation Matrix
count registered casual temp atemp humidity windspeed
count 1 0.97 0.69 0.39 0.39 -0.32 0.1
registered 0.97 1 0.5 0.32 0.31 -0.27 0.09
casual 0.69 0.5 1 0.47 0.46 -0.35 0.09
temp 0.39 0.32 0.47 1 0.98 -0.06 -0.02
atemp 0.39 0.31 0.46 0.98 1 -0.04 -0.06
humidity -0.32 -0.27 -0.35 -0.06 -0.04 1 -0.32
windspeed 0.1 0.09 0.09 -0.02 -0.06 -0.32 1

3 - Feature Engineering

In this section, I’ll create five new features from the features given to help enhance the prediction result.


3.1 - Hour Bins

As demonstrated earlier, hour of the day is a significant variable. Now I’ll create a new feature called “hour bin” based on different time of day for registered and casual users respectively.

I’ll use Decision Tree to help define the hour bins.

Create hourBin variable as follow:

## create hour bucket for registered users
data.combine$hourBin_reg = 0
data.combine$hourBin_reg[data.combine$hour < 7.5] = 1
data.combine$hourBin_reg[data.combine$hour >= 22] = 2
data.combine$hourBin_reg[data.combine$hour >= 9.5 & data.combine$hour < 18] = 3
data.combine$hourBin_reg[data.combine$hour >= 7.5 & data.combine$hour < 8.5] = 4
data.combine$hourBin_reg[data.combine$hour >= 8.5 & data.combine$hour < 9.5] = 5
data.combine$hourBin_reg[data.combine$hour >= 20 & data.combine$hour < 22] = 6
data.combine$hourBin_reg[data.combine$hour >= 18 & data.combine$hour < 20] = 7
## create hour bucket for casual users
data.combine$hourBin_cas = 0
data.combine$hourBin_cas[data.combine$hour < 8.5] = 1
data.combine$hourBin_cas[data.combine$hour >= 8.5 & data.combine$hour < 10] = 2
data.combine$hourBin_cas[data.combine$hour >= 20] = 3
data.combine$hourBin_cas[data.combine$hour >= 10 & data.combine$hour < 20] = 4

3.2 - Temperature Bins

We can do the same for temperature variable.

Create tempBin variable as follow:

## create temperature bucket for registered users
data.combine$tempBin_reg = 0
data.combine$tempBin_reg[data.combine$temp < 13] = 1
data.combine$tempBin_reg[data.combine$temp >= 13 & data.combine$temp < 23] = 2
data.combine$tempBin_reg[data.combine$temp >= 23 & data.combine$temp < 30] = 3
data.combine$tempBin_reg[data.combine$temp >= 30] = 4
## create temperature bucket for casual users
data.combine$tempBin_cas = 0
data.combine$tempBin_cas[data.combine$temp < 15] = 1
data.combine$tempBin_cas[data.combine$temp >= 15 & data.combine$temp < 23] = 2
data.combine$tempBin_cas[data.combine$temp >= 23 & data.combine$temp < 30] = 3
data.combine$tempBin_cas[data.combine$temp >= 30] = 4

3.3 - Quarter

Based on what we’ve seen so far, the rental bike usage is incresing along with time, and has a seasonal trend, so I will create a new variable based on the quarter of year to capture the pattern.

data.combine$quarter = 0
data.combine$quarter[data.combine$year == '2011'] = 1
data.combine$quarter[data.combine$year == '2011' & data.combine$month > 3] = 2
data.combine$quarter[data.combine$year == '2011' & data.combine$month > 6] = 3
data.combine$quarter[data.combine$year == '2011' & data.combine$month > 9] = 4
data.combine$quarter[data.combine$year == '2012'] = 5
data.combine$quarter[data.combine$year == '2012' & data.combine$month > 3] = 6
data.combine$quarter[data.combine$year == '2012' & data.combine$month > 6] = 7
data.combine$quarter[data.combine$year == '2012' & data.combine$month > 9] = 8

8 bins have been created for the two-year period, and the number of observations in each bin are close to each other.

QuarterBin_1 QuarterBin_2 QuarterBin_3 QuarterBin_4 QuarterBin_5 QuarterBin_6 QuarterBin_7 QuarterBin_8
2067 2183 2192 2203 2176 2182 2208 2168

3.4 - Day Type

This new feature is actually a combination of holiday and workingday, but specifically indicate if that day is a weekend.

data.combine$dayType = ""
data.combine$dayType[data.combine$holiday == 1] = "holiday"
data.combine$dayType[data.combine$holiday == 0 & data.combine$workingday == 0] = "weekend"
data.combine$dayType[data.combine$holiday == 0 & data.combine$workingday == 1] = "workingDay"

The result is as follow:

holiday weekend workingDay
500 5014 11865

3.5 - Weekend

Because registered and causal users show different pattern during weekdays and weekends, it is necessary to create a variable for weekend in particular.

data.combine$weekend = 0
data.combine$weekend[data.combine$dayType == "weekend"] = 1

In the result, 0 means that day is a weekday, and 1 means it’s weekend.

Not a Weekend Weekend
12365 5014

4 - Modeling

All the hard work above is actually for the purpose of this final step. Here I will use Random Forest method to predict the bike usage. In order to see the effect of the new features I just created, my first model will use the original featuers only, and in my second model, I will incorporate the new features to see how big a difference my feature engineering can make.


4.1 - Distribution of the Response Variable

Now it is a good time to take a look at the distribution of the response variables, registered and casual.

It is clear that the distribution is skewed to the right. To deal with it, I’ll perform log transformation on the response variable.

## log transformation for dependent variable
data_train$logReg <- log(data_train$registered + 1)
data_train$logCas <- log(data_train$casual + 1)

After log transformation, it is still not perfect normal distribution, but it’s better than before.


4.2 - Random Forest Regression and Model Evaluation

All the preparations are done, now I will build my first random forest model for the problem.

First, I will create partitions for training and testing data set using caret package. 80% are assigned to training set, and the rest are testing set.

set.seed(123)
trainIndex = createDataPartition(data_train$count, p = 0.80, list = FALSE)
training = data_train[trainIndex, ]
testing = data_train[-trainIndex, ]

Next, I fit my model with only original predictors.

model_fit.reg <- randomForest(logReg ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year, data = training, importance = TRUE, ntree = 250)
model_fit.cas <- randomForest(logCas ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year, data = training, importance = TRUE, ntree = 250)

After the model fit is complete, we can use it to predict bike usage.

pred.reg <- predict(model_fit.reg, testing)
pred.cas <- predict(model_fit.cas, testing)

testing$registered_pred <- round(exp(pred.reg) - 1, 0)
testing$casual_pred <- round(exp(pred.cas) - 1, 0)
testing$count_pred <- testing$registered_pred + testing$casual_pred
Prediction Results

Take a look at the prediction vs. observation plot, we can see that as a first model, it is doing surprisingly well.

Registered Users Prediction

Casual Users Prediction

Total Users Prediction

Model Evaluation

The R Squared value of total count is 0.8993 which means 89.93 % variance of the response variable can be explained by the model, and RMSE is 160618.37.

BUT I’m still curious how much improvement can be done by my new features. So next I will add in new features to the model, and fit the data again.

## add new features
model_fit.reg2 <- randomForest(logReg ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year + hourBin_reg + tempBin_reg + quarter + dayType + weekend, data = training, importance = TRUE, ntree = 250)
model_fit.cas2 <- randomForest(logCas ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year + hourBin_cas + tempBin_cas + quarter + dayType + weekend, data = training, importance = TRUE, ntree = 250)

After the model fit is complete, do another prediction.

pred.reg <- predict(model_fit.reg2, testing)
pred.cas <- predict(model_fit.cas2, testing)

testing$registered_pred <- round(exp(pred.reg) - 1, 0)
testing$casual_pred <- round(exp(pred.cas) - 1, 0)
testing$count_pred <- testing$registered_pred + testing$casual_pred
Prediction Results with New Features

Let’s see if the new features improve the prediction results.

Registered Users Prediction

Casual Users Prediction

Registered Users Prediction

New Model Evaluation

The R Squared value of total count increased to 0.9404 which means now 94.04 % variance of the response variable can be explained by the model, and RMSE decreased to 94966.51. The new features indeed improved the prediction model quite a lot.


5 - Summary

In this project, I practiced Data Visualization with ggplot2 package, Feature Engineering using Decision Tree with rpart package, and Predictive Analysis using Random Forest with caret and randomForest package. The final report is created by R Markdown with knitr package.