This is an attempt to forecast the number of bike rentals that would happen for each hour using time series analysis.
bs <- read.csv("train.csv")
Variables in Bike Sharing dataset
colnames(bs)
## [1] "datetime" "season" "holiday" "workingday" "weather"
## [6] "temp" "atemp" "humidity" "windspeed" "casual"
## [11] "registered" "count"
It's time for some profiling.
1: SPRING, 2:SUMMER, 3:FALL, 4:WINTER
bs$season <- as.factor(bs$season)
levels(bs$season)
## [1] "1" "2" "3" "4"
aggregate(count~season, data=bs, summary)
## season count.Min. count.1st Qu. count.Median count.Mean count.3rd Qu.
## 1 1 1.0 24.0 78.0 116.3 164.0
## 2 2 1.0 49.0 172.0 215.3 321.0
## 3 3 1.0 68.0 195.0 234.4 347.0
## 4 4 1.0 51.0 161.0 199.0 294.0
## count.Max.
## 1 801.0
## 2 873.0
## 3 977.0
## 4 948.0
table(bs$season)
##
## 1 2 3 4
## 2686 2733 2733 2734
Spring season attracts fewer rentals than the remaining seasons on a first glimpse.
But there are two types of counts, CASUAL and REGISTERED.
par(mfrow=c(1,3))
plot(bs$season,bs$count, main="Overall Count")
plot(bs$season,bs$casual, main="Casual")
plot(bs$season,bs$registered, main="Registered")
CASUAL - SPRING AND WINTER LOW
Definition in Kaggle: whether the day is considered a holiday
bs$holiday <- as.factor(bs$holiday)
table(bs$holiday)
##
## 0 1
## 10575 311
aggregate(count~holiday,data=bs, summary)
## holiday count.Min. count.1st Qu. count.Median count.Mean count.3rd Qu.
## 1 0 1.0 43.0 145.0 191.7 283.0
## 2 1 1.0 38.5 133.0 185.9 308.0
## count.Max.
## 1 977.0
## 2 712.0
Totally 311 holidays in train.
par(mfrow=c(1,3))
plot(bs$holiday,bs$count, main="Overall Count")
plot(bs$holiday,bs$casual, main="Casual")
plot(bs$holiday,bs$registered, main="Registered")
Definition in Kaggle: whether the day is neither a weekend nor holiday
bs$workingday <- as.factor(bs$workingday)
table(bs$workingday)
##
## 0 1
## 3474 7412
aggregate(count~workingday, data=bs, summary)
## workingday count.Min. count.1st Qu. count.Median count.Mean
## 1 0 1.0 44.0 128.0 188.5
## 2 1 1.0 41.0 151.0 193.0
## count.3rd Qu. count.Max.
## 1 304.0 783.0
## 2 277.0 977.0
par(mfrow=c(1,3))
plot(bs$workingday,bs$count, main="Overall Count")
plot(bs$workingday,bs$casual, main="Casual")
plot(bs$workingday,bs$registered, main="Registered")
Workingday - Mean(casual) goes down, Mean(registered) up.
1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
bs$weather<-as.factor(bs$weather)
table(bs$weather)
##
## 1 2 3 4
## 7192 2834 859 1
There is only 1 row in weather 4 (Heavy rain+thunderstorm etc..). So I'm going to recode it into 3, which is the closest and most logical.
bs$weather <- ifelse(bs$weather=="4",3, bs$weather)
bs$weather<-as.factor(bs$weather)
table(bs$weather)
##
## 1 2 3
## 7192 2834 860
par(mfrow=c(1,3))
plot(bs$weather,bs$count, main="Overall Count")
plot(bs$weather,bs$casual, main="Casual")
plot(bs$weather,bs$registered, main="Registered")
COunts decrease with bad weather.
Definition in Kaggle: temperature in Celsius
reg1 <- lm(count~temp, data=bs)
plot(bs$temp, bs$count)
abline(reg1)
cor(bs$temp, bs$count)
## [1] 0.3944536
Try the same for atemp
Definition in Kaggle: “feels like” temperature in Celsius
reg2 <- lm(count~atemp, data=bs)
plot(bs$atemp, bs$count)
abline(reg2)
cor(bs$atemp, bs$count)
## [1] 0.3897844
Definition in Kaggle: relative humidity
summary(bs$humidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 47.00 62.00 61.89 77.00 100.00
reg3 <- lm(count~humidity, data=bs)
plot(bs$humidity, bs$count)
abline(reg2)
cor(bs$humidity, bs$count)
## [1] -0.3173715
There are a few zero values in humidity. At the moment I'm not sure what to do with it.
length(which(bs$humidity==0))
## [1] 22
summary(bs$windspeed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.002 13.000 12.800 17.000 57.000
reg3 <- lm(count~windspeed, data=bs)
plot(bs$windspeed, bs$count)
abline(reg2)
cor(bs$windspeed, bs$count)
## [1] 0.1013695
Now it's time to decimate the timestamp variable:
Day of the week
library(lubridate)
bs$dow<-wday(bs$datetime,label=TRUE,abbr=TRUE)
par(mfrow=c(1,3))
plot(bs$dow, bs$count, main="Count")
plot(bs$dow, bs$casual, main="Casual")
plot(bs$dow, bs$registered, main="Registered")
Casual peaks in weekends, registered dips in weekends
Hour of day
### Hour of the day
bs$hour_of_day<- strftime(bs$datetime,"%H")
par(mfrow=c(1,3))
plot(bs$hour_of_day, bs$count, main="Count")
plot(bs$hour_of_day, bs$casual, main="Casual")
plot(bs$hour_of_day, bs$registered, main="Registered")
The trend is different for casual and registered. For registered, there is a peak during the ofice hours, but in casual, the rentals peak only during noon.
### Month
bs$month<-strftime(bs$datetime,"%m")
bs$month <- as.factor(bs$month)
par(mfrow=c(1,3))
plot(bs$month, bs$count, main="Count")
plot(bs$month, bs$casual, main="Casual")
plot(bs$month, bs$registered, main="Registered")
Normal distribution in Casual, peaks till July, and decreases from Aug-Dec.
### Year
bs$year<-strftime(bs$datetime,"%y")
bs$year <- as.factor(bs$year)
par(mfrow=c(1,3))
plot(bs$year, bs$count, main="Count")
plot(bs$year, bs$casual, main="Casual")
plot(bs$year, bs$registered, main="Registered")
Bike rentals picked up in 2012
Let's plot month trend by year. Start with 2011
par(mfrow=c(1,3))
plot(bs$month[bs$year=="11"], bs$count[bs$year=="11"], main="Count 2011")
plot(bs$month[bs$year=="11"], bs$casual[bs$year=="11"], main="Casual 2011")
plot(bs$month[bs$year=="11"], bs$registered[bs$year=="11"], main="Registered 2011")
Repeat it for 2012
par(mfrow=c(1,3))
plot(bs$month[bs$year=="12"], bs$count[bs$year=="12"], main="Count 2012")
plot(bs$month[bs$year=="12"], bs$casual[bs$year=="12"], main="Casual 2012")
plot(bs$month[bs$year=="12"], bs$registered[bs$year=="12"], main="Registered 2012")
Keeping track of changes
write.csv(bs,"bsv1.1.csv")
Short summary: