April 05, 2016
This is an extended work of my final project for course MATH664: Methods for Statistical Consulting. The original post and project backgroud can be found here.
Briefly, due to a limited time and resource, last time when my team and I worked on this project, we only built simple generalizd linear model to explore the patterns in the data set. Possion regression, Linear regression with log transformation, and Negative Binomial regression methods were applied respectively. However the predictive result was not very good. It is part of the reason why the original post was named Exploratory Data Analysis of Bike Rental System in Washington, D.C..
This project follows the steps and procedure of this link, which has very similar exploratory analysis steps as what I did before, however in feature engineering and predictive analysis part, it was completely different.
So in this updated post, I…
ggplot2 package;First, I combined the training and testing data set together to understand the distribution of the explanatory variables.
Observations: 17,379
Variables: 12
$ datetime (fctr) 2011-01-01 00:00:00, 2011-01-01 01:00:00, 2011-01-...
$ season (fctr) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ holiday (fctr) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ workingday (fctr) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ weather (fctr) 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,...
$ temp (dbl) 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.8...
$ atemp (dbl) 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13....
$ humidity (int) 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77,...
$ windspeed (dbl) 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0...
$ casual (dbl) 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 4...
$ registered (dbl) 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 7...
$ count (dbl) 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, ...
The structure of the data set shows there is a date & time variable, four categorical variables season, weather, holiday, and workingday, and also four numerical variables temperature, "feels like" temperature, humidity, and windspeed. Total count is the summation of casual and registered.
The prediction task is actually to predict casual and registered bike users respectively. Sum them up, we’ll get the total.
From the plots above, we can see that:
season variable almost have equal distribution.| Spring | Summer | Fall | Winter |
|---|---|---|---|
| 24.41 % | 25.37 % | 25.87 % | 24.35 % |
weather variable is the most, and weather 4 (terrible weather) is the least, which means most of the time during the two-year peroid, the weather condition was good.| WeatherCondition_1 | WeatherCondition_2 | WeatherCondition_3 | WeatherCondition_4 |
|---|---|---|---|
| 65.67 % | 26.15 % | 8.17 % | 0.02 % |
| Not a Holiday | Holiday |
|---|---|
| 97.12 % | 2.88 % |
| Not a Workingday | Workingday |
|---|---|
| 31.73 % | 68.27 % |
It seems these numerical variables are distributed quite naturally.
It is a common sense to consider time as an important variable when it comes to predicting rental bike usage. So I extracted time information from the date & time variable to analyze hourly, weekly, monthly, and yearly trend.
As we can see clearly, the overall trend is quite obvious:
Besides the overall analysis, I also performed the same analysis for registered and casual users respectively where I found something interestig.
The houly trend of the registered bike users match the overall trend, while there is no peak in the morning and afternoon for casual users. This is a huge difference between registered and casual users. Because of this, different hour bins will be created later in feature engineering.
The weekly trend of the registered bike users also match the overall trend, but the casual users actually shows the opposite trend. There’re more casual users in the weekend than weekdays. One possible explanation for it could be that registered users use rental bikes as commute tools while casual users use them as simple transportation during their leisure time. Based on this assumption, a variable for weekend will also be created later.
From the plots above we can see that the effect of season and weather on rental bike usage is quite clear:
On the other hand, the effect of holiday and workingday is not so obvious. It seems there are more users when it’s not a holiday, and also in workingdays. In order to better understand thses two variable, I repeat the analysis for registered and casual users again respectively.
Again, the trend for registered and casual users are exactly the opposite. This is another support for the assumptions I made earlier.
We can see that the number of bike usage has positive relationship with temperature, and negative relationship with humidity. There seems also a positive relation between number of bike usage and windspeed, but it is not as strong as temperature.
temp and atemp are highly correlated.
| Correlation Matrix | |||||||
| count | registered | casual | temp | atemp | humidity | windspeed | |
|---|---|---|---|---|---|---|---|
| count | 1 | 0.97 | 0.69 | 0.39 | 0.39 | -0.32 | 0.1 |
| registered | 0.97 | 1 | 0.5 | 0.32 | 0.31 | -0.27 | 0.09 |
| casual | 0.69 | 0.5 | 1 | 0.47 | 0.46 | -0.35 | 0.09 |
| temp | 0.39 | 0.32 | 0.47 | 1 | 0.98 | -0.06 | -0.02 |
| atemp | 0.39 | 0.31 | 0.46 | 0.98 | 1 | -0.04 | -0.06 |
| humidity | -0.32 | -0.27 | -0.35 | -0.06 | -0.04 | 1 | -0.32 |
| windspeed | 0.1 | 0.09 | 0.09 | -0.02 | -0.06 | -0.32 | 1 |
In this section, I’ll create five new features from the features given to help enhance the prediction result.
As demonstrated earlier, hour of the day is a significant variable. Now I’ll create a new feature called “hour bin” based on different time of day for registered and casual users respectively.
I’ll use Decision Tree to help define the hour bins.
Create hourBin variable as follow:
## create hour bucket for registered users
data.combine$hourBin_reg = 0
data.combine$hourBin_reg[data.combine$hour < 7.5] = 1
data.combine$hourBin_reg[data.combine$hour >= 22] = 2
data.combine$hourBin_reg[data.combine$hour >= 9.5 & data.combine$hour < 18] = 3
data.combine$hourBin_reg[data.combine$hour >= 7.5 & data.combine$hour < 8.5] = 4
data.combine$hourBin_reg[data.combine$hour >= 8.5 & data.combine$hour < 9.5] = 5
data.combine$hourBin_reg[data.combine$hour >= 20 & data.combine$hour < 22] = 6
data.combine$hourBin_reg[data.combine$hour >= 18 & data.combine$hour < 20] = 7
## create hour bucket for casual users
data.combine$hourBin_cas = 0
data.combine$hourBin_cas[data.combine$hour < 8.5] = 1
data.combine$hourBin_cas[data.combine$hour >= 8.5 & data.combine$hour < 10] = 2
data.combine$hourBin_cas[data.combine$hour >= 20] = 3
data.combine$hourBin_cas[data.combine$hour >= 10 & data.combine$hour < 20] = 4We can do the same for temperature variable.
Create tempBin variable as follow:
## create temperature bucket for registered users
data.combine$tempBin_reg = 0
data.combine$tempBin_reg[data.combine$temp < 13] = 1
data.combine$tempBin_reg[data.combine$temp >= 13 & data.combine$temp < 23] = 2
data.combine$tempBin_reg[data.combine$temp >= 23 & data.combine$temp < 30] = 3
data.combine$tempBin_reg[data.combine$temp >= 30] = 4
## create temperature bucket for casual users
data.combine$tempBin_cas = 0
data.combine$tempBin_cas[data.combine$temp < 15] = 1
data.combine$tempBin_cas[data.combine$temp >= 15 & data.combine$temp < 23] = 2
data.combine$tempBin_cas[data.combine$temp >= 23 & data.combine$temp < 30] = 3
data.combine$tempBin_cas[data.combine$temp >= 30] = 4Based on what we’ve seen so far, the rental bike usage is incresing along with time, and has a seasonal trend, so I will create a new variable based on the quarter of year to capture the pattern.
data.combine$quarter = 0
data.combine$quarter[data.combine$year == '2011'] = 1
data.combine$quarter[data.combine$year == '2011' & data.combine$month > 3] = 2
data.combine$quarter[data.combine$year == '2011' & data.combine$month > 6] = 3
data.combine$quarter[data.combine$year == '2011' & data.combine$month > 9] = 4
data.combine$quarter[data.combine$year == '2012'] = 5
data.combine$quarter[data.combine$year == '2012' & data.combine$month > 3] = 6
data.combine$quarter[data.combine$year == '2012' & data.combine$month > 6] = 7
data.combine$quarter[data.combine$year == '2012' & data.combine$month > 9] = 88 bins have been created for the two-year period, and the number of observations in each bin are close to each other.
| QuarterBin_1 | QuarterBin_2 | QuarterBin_3 | QuarterBin_4 | QuarterBin_5 | QuarterBin_6 | QuarterBin_7 | QuarterBin_8 |
|---|---|---|---|---|---|---|---|
| 2067 | 2183 | 2192 | 2203 | 2176 | 2182 | 2208 | 2168 |
This new feature is actually a combination of holiday and workingday, but specifically indicate if that day is a weekend.
data.combine$dayType = ""
data.combine$dayType[data.combine$holiday == 1] = "holiday"
data.combine$dayType[data.combine$holiday == 0 & data.combine$workingday == 0] = "weekend"
data.combine$dayType[data.combine$holiday == 0 & data.combine$workingday == 1] = "workingDay"The result is as follow:
| holiday | weekend | workingDay |
|---|---|---|
| 500 | 5014 | 11865 |
Because registered and causal users show different pattern during weekdays and weekends, it is necessary to create a variable for weekend in particular.
data.combine$weekend = 0
data.combine$weekend[data.combine$dayType == "weekend"] = 1In the result, 0 means that day is a weekday, and 1 means it’s weekend.
| Not a Weekend | Weekend |
|---|---|
| 12365 | 5014 |
All the hard work above is actually for the purpose of this final step. Here I will use Random Forest method to predict the bike usage. In order to see the effect of the new features I just created, my first model will use the original featuers only, and in my second model, I will incorporate the new features to see how big a difference my feature engineering can make.
Now it is a good time to take a look at the distribution of the response variables, registered and casual.
It is clear that the distribution is skewed to the right. To deal with it, I’ll perform log transformation on the response variable.
## log transformation for dependent variable
data_train$logReg <- log(data_train$registered + 1)
data_train$logCas <- log(data_train$casual + 1)After log transformation, it is still not perfect normal distribution, but it’s better than before.
All the preparations are done, now I will build my first random forest model for the problem.
First, I will create partitions for training and testing data set using caret package. 80% are assigned to training set, and the rest are testing set.
set.seed(123)
trainIndex = createDataPartition(data_train$count, p = 0.80, list = FALSE)
training = data_train[trainIndex, ]
testing = data_train[-trainIndex, ]Next, I fit my model with only original predictors.
model_fit.reg <- randomForest(logReg ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year, data = training, importance = TRUE, ntree = 250)
model_fit.cas <- randomForest(logCas ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year, data = training, importance = TRUE, ntree = 250)After the model fit is complete, we can use it to predict bike usage.
pred.reg <- predict(model_fit.reg, testing)
pred.cas <- predict(model_fit.cas, testing)
testing$registered_pred <- round(exp(pred.reg) - 1, 0)
testing$casual_pred <- round(exp(pred.cas) - 1, 0)
testing$count_pred <- testing$registered_pred + testing$casual_predTake a look at the prediction vs. observation plot, we can see that as a first model, it is doing surprisingly well.
The R Squared value of total count is 0.8993 which means 89.93 % variance of the response variable can be explained by the model, and RMSE is 160618.37.
BUT I’m still curious how much improvement can be done by my new features. So next I will add in new features to the model, and fit the data again.
## add new features
model_fit.reg2 <- randomForest(logReg ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year + hourBin_reg + tempBin_reg + quarter + dayType + weekend, data = training, importance = TRUE, ntree = 250)
model_fit.cas2 <- randomForest(logCas ~ season + holiday + workingday + weather + temp + atemp + humidity + windspeed + hour + weekday + month + year + hourBin_cas + tempBin_cas + quarter + dayType + weekend, data = training, importance = TRUE, ntree = 250)After the model fit is complete, do another prediction.
pred.reg <- predict(model_fit.reg2, testing)
pred.cas <- predict(model_fit.cas2, testing)
testing$registered_pred <- round(exp(pred.reg) - 1, 0)
testing$casual_pred <- round(exp(pred.cas) - 1, 0)
testing$count_pred <- testing$registered_pred + testing$casual_predLet’s see if the new features improve the prediction results.
The R Squared value of total count increased to 0.9404 which means now 94.04 % variance of the response variable can be explained by the model, and RMSE decreased to 94966.51. The new features indeed improved the prediction model quite a lot.
In this project, I practiced Data Visualization with ggplot2 package, Feature Engineering using Decision Tree with rpart package, and Predictive Analysis using Random Forest with caret and randomForest package. The final report is created by R Markdown with knitr package.