To solve the bike sharing problem, I had initially tried numerous model variations using Linear Regression. But the RMSLE metric, which is the evaluation metric for this problem, was still on the higher side. Hence I decided to let go of linear regression and focus on time series modelling.
As I have never dealt with time series modelling, I went through this awesome online textbook - https://www.otexts.org/fpp
The most logical way forward was to try out some of the “benchmark models” before going to more advanced time series techniques.
As a result, I started off by trying the Average Method . Here, the forecasts of all future values are equal to the mean of the historical data.
Below is how I did the same in R:
I had already created all the required variables (Refer to the previous markdown document), so I'm just loading that dataset
bs_av <- read.csv("bs_v1.2.csv")
In the below code, I'm going to get the mean of bike rentals at Year x Month X Hour level
For example, that is the mean of bike rentals at 8:00 AM to 9:00 AM in January 2011? This is the information I'm going to obtain:
meanagg <- aggregate(count~year+month+hour_of_day,bs_av,mean)
dim(meanagg)
## [1] 576 4
class(meanagg)
## [1] "data.frame"
The result is stored in the data frame meanagg. As you can see, there are 576 rows in this dataset. I'm going to be using these 576 values to forecast the 10000 odd values in the test set.
Next, I'm going to create a primary key that is just the concatenation of Year, month and hour of day values separated by _
meanagg$YrMonHod <- paste(meanagg$year,meanagg$month,meanagg$hour_of_day, sep="_")
bs_av$YrMonHod <- paste(bs_av$year,bs_av$month,bs_av$hour_of_day,sep="_")
Merge the two data frames:
bs_av_merge <- merge(bs_av,meanagg[,c("YrMonHod","count")],by="YrMonHod",all.x=T)
Split the train set into train and validation. For this we need to create another field - called “dom” - Day of the month.
bs_av_merge$dom <- strftime(bs_av_merge$datetime,"%d")
bs_av_merge$dom <- as.numeric(bs_av_merge$dom)
bs_ts <- subset(bs_av_merge, bs_av_merge$dom <=14) ### This is done because it a time series data, we need some sort of equal representation for hourly data
bs_vs <- subset(bs_av_merge, bs_av_merge$dom >14)
Let's see what RMSLE value we are able to get for this
library(Metrics)
rmsle(round(bs_vs$count.y),round(bs_vs$count.x))
## [1] 0.60276
This is a welcome figure for me because all my linear regression models yielded an RMSLE of around 0.7 and above. A simple benchmarking model is able to give a better result than that!
Now let's try this on the test data and upload the file on Kaggle.
Load and create the same variables in the test set:
bs_test <- read.csv("test.csv")
bs_test$year<-strftime(bs_test$datetime,"%y")
bs_test$month<-strftime(bs_test$datetime,"%m")
bs_test$hour_of_day<- strftime(bs_test$datetime,"%H")
bs_test$year <- as.numeric(bs_test$year)
bs_test$month <- as.numeric(bs_test$month)
bs_test$hour_of_day <- as.numeric(bs_test$hour_of_day)
Create the primary key in the test set:
bs_test$YrMonHod <- paste(bs_test$year,bs_test$month,bs_test$hour_of_day,sep="_")
Merge the tables:
bs_test_merge <- merge(bs_test,meanagg[,c("YrMonHod","count")],by="YrMonHod",all.x=T)
I need the output file to just contain the datetime and count columns:
write.csv(bs_test_merge[,c(2,14)],"bstsv1.csv", row.names=FALSE )
On Submitting this, the RMSLE was 0.72, which is not all that bad compared to what I might have obtained through my linear regression models. This score now serves as my benchmark, and whatever models I create henceforth should atleast be better than this.