Bike Sharing Demand - Part 1 - Profiling

This is an attempt to forecast the number of bike rentals that would happen for each hour using time series analysis.

bs <- read.csv("train.csv")

Variables in Bike Sharing dataset

colnames(bs)
##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"

It's time for some profiling.

1. SEASON

1: SPRING, 2:SUMMER, 3:FALL, 4:WINTER

bs$season <- as.factor(bs$season)
levels(bs$season)
## [1] "1" "2" "3" "4"
aggregate(count~season, data=bs, summary)
##   season count.Min. count.1st Qu. count.Median count.Mean count.3rd Qu.
## 1      1        1.0          24.0         78.0      116.3         164.0
## 2      2        1.0          49.0        172.0      215.3         321.0
## 3      3        1.0          68.0        195.0      234.4         347.0
## 4      4        1.0          51.0        161.0      199.0         294.0
##   count.Max.
## 1      801.0
## 2      873.0
## 3      977.0
## 4      948.0
table(bs$season)
## 
##    1    2    3    4 
## 2686 2733 2733 2734

Spring season attracts fewer rentals than the remaining seasons on a first glimpse.

But there are two types of counts, CASUAL and REGISTERED.

par(mfrow=c(1,3))
plot(bs$season,bs$count, main="Overall Count")
plot(bs$season,bs$casual, main="Casual")
plot(bs$season,bs$registered, main="Registered")

plot of chunk unnamed-chunk-4

CASUAL - SPRING AND WINTER LOW

2. HOLIDAY

Definition in Kaggle: whether the day is considered a holiday

bs$holiday <- as.factor(bs$holiday)
table(bs$holiday)
## 
##     0     1 
## 10575   311
aggregate(count~holiday,data=bs, summary)
##   holiday count.Min. count.1st Qu. count.Median count.Mean count.3rd Qu.
## 1       0        1.0          43.0        145.0      191.7         283.0
## 2       1        1.0          38.5        133.0      185.9         308.0
##   count.Max.
## 1      977.0
## 2      712.0

Totally 311 holidays in train.

par(mfrow=c(1,3))
plot(bs$holiday,bs$count, main="Overall Count")
plot(bs$holiday,bs$casual, main="Casual")
plot(bs$holiday,bs$registered, main="Registered")

plot of chunk unnamed-chunk-6

3. WORKINGDAY

Definition in Kaggle: whether the day is neither a weekend nor holiday

bs$workingday <- as.factor(bs$workingday)
table(bs$workingday)
## 
##    0    1 
## 3474 7412
aggregate(count~workingday, data=bs, summary)
##   workingday count.Min. count.1st Qu. count.Median count.Mean
## 1          0        1.0          44.0        128.0      188.5
## 2          1        1.0          41.0        151.0      193.0
##   count.3rd Qu. count.Max.
## 1         304.0      783.0
## 2         277.0      977.0
par(mfrow=c(1,3))
plot(bs$workingday,bs$count, main="Overall Count")
plot(bs$workingday,bs$casual, main="Casual")
plot(bs$workingday,bs$registered, main="Registered")

plot of chunk unnamed-chunk-8

Workingday - Mean(casual) goes down, Mean(registered) up.

4. WEATHER

1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

bs$weather<-as.factor(bs$weather)
table(bs$weather)
## 
##    1    2    3    4 
## 7192 2834  859    1

There is only 1 row in weather 4 (Heavy rain+thunderstorm etc..). So I'm going to recode it into 3, which is the closest and most logical.

bs$weather <- ifelse(bs$weather=="4",3, bs$weather)
bs$weather<-as.factor(bs$weather)
table(bs$weather)
## 
##    1    2    3 
## 7192 2834  860
par(mfrow=c(1,3))
plot(bs$weather,bs$count, main="Overall Count")
plot(bs$weather,bs$casual, main="Casual")
plot(bs$weather,bs$registered, main="Registered")

plot of chunk unnamed-chunk-11

COunts decrease with bad weather.

5. TEMP

Definition in Kaggle: temperature in Celsius

reg1 <- lm(count~temp, data=bs)
plot(bs$temp, bs$count)
abline(reg1)

plot of chunk unnamed-chunk-12

cor(bs$temp, bs$count)
## [1] 0.3944536

Try the same for atemp

6. ATEMP

Definition in Kaggle: “feels like” temperature in Celsius

reg2 <- lm(count~atemp, data=bs)
plot(bs$atemp, bs$count)
abline(reg2)

plot of chunk unnamed-chunk-13

cor(bs$atemp, bs$count)
## [1] 0.3897844

7. Humidity

Definition in Kaggle: relative humidity

summary(bs$humidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   47.00   62.00   61.89   77.00  100.00
reg3 <- lm(count~humidity, data=bs)
plot(bs$humidity, bs$count)
abline(reg2)

plot of chunk unnamed-chunk-14

cor(bs$humidity, bs$count)
## [1] -0.3173715

There are a few zero values in humidity. At the moment I'm not sure what to do with it.

length(which(bs$humidity==0))
## [1] 22

8. WINDSPEED:

summary(bs$windspeed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.002  13.000  12.800  17.000  57.000
reg3 <- lm(count~windspeed, data=bs)
plot(bs$windspeed, bs$count)
abline(reg2)

plot of chunk unnamed-chunk-16

cor(bs$windspeed, bs$count)
## [1] 0.1013695

9. TIMESTAMP

Now it's time to decimate the timestamp variable:

Day of the week

library(lubridate)
bs$dow<-wday(bs$datetime,label=TRUE,abbr=TRUE)
par(mfrow=c(1,3))
plot(bs$dow, bs$count, main="Count")
plot(bs$dow, bs$casual, main="Casual")
plot(bs$dow, bs$registered, main="Registered")

plot of chunk unnamed-chunk-18

Casual peaks in weekends, registered dips in weekends

Hour of day

### Hour of the day
bs$hour_of_day<- strftime(bs$datetime,"%H")
par(mfrow=c(1,3))
plot(bs$hour_of_day, bs$count, main="Count")
plot(bs$hour_of_day, bs$casual, main="Casual")
plot(bs$hour_of_day, bs$registered, main="Registered")

plot of chunk unnamed-chunk-20

The trend is different for casual and registered. For registered, there is a peak during the ofice hours, but in casual, the rentals peak only during noon.

### Month
bs$month<-strftime(bs$datetime,"%m")
bs$month <- as.factor(bs$month)
par(mfrow=c(1,3))
plot(bs$month, bs$count, main="Count")
plot(bs$month, bs$casual, main="Casual")
plot(bs$month, bs$registered, main="Registered")

plot of chunk unnamed-chunk-22

Normal distribution in Casual, peaks till July, and decreases from Aug-Dec.

### Year
bs$year<-strftime(bs$datetime,"%y")
bs$year <- as.factor(bs$year)
par(mfrow=c(1,3))
plot(bs$year, bs$count, main="Count")
plot(bs$year, bs$casual, main="Casual")
plot(bs$year, bs$registered, main="Registered")

plot of chunk unnamed-chunk-24

Bike rentals picked up in 2012

Let's plot month trend by year. Start with 2011

par(mfrow=c(1,3))
plot(bs$month[bs$year=="11"], bs$count[bs$year=="11"], main="Count 2011")
plot(bs$month[bs$year=="11"], bs$casual[bs$year=="11"], main="Casual 2011")
plot(bs$month[bs$year=="11"], bs$registered[bs$year=="11"], main="Registered 2011")

plot of chunk unnamed-chunk-25

Repeat it for 2012

par(mfrow=c(1,3))
plot(bs$month[bs$year=="12"], bs$count[bs$year=="12"], main="Count 2012")
plot(bs$month[bs$year=="12"], bs$casual[bs$year=="12"], main="Casual 2012")
plot(bs$month[bs$year=="12"], bs$registered[bs$year=="12"], main="Registered 2012")

plot of chunk unnamed-chunk-26

Keeping track of changes

write.csv(bs,"bsv1.1.csv")

Short summary: