Bike Sharing Demand - Part 1 - Profiling

This is an attempt to forecast the number of bike rentals that would happen for each hour using time series analysis.

bs <- read.csv("train.csv")

Variables in Bike Sharing dataset

colnames(bs)

##  [1] "datetime"   "season"     "holiday"    "workingday" "weather"   
##  [6] "temp"       "atemp"      "humidity"   "windspeed"  "casual"    
## [11] "registered" "count"

It's time for some profiling.

1. SEASON

1: SPRING, 2:SUMMER, 3:FALL, 4:WINTER

bs$season <- as.factor(bs$season)
levels(bs$season)

## [1] "1" "2" "3" "4"

aggregate(count~season, data=bs, summary)

##   season count.Min. count.1st Qu. count.Median count.Mean count.3rd Qu.
## 1      1        1.0          24.0         78.0      116.3         164.0
## 2      2        1.0          49.0        172.0      215.3         321.0
## 3      3        1.0          68.0        195.0      234.4         347.0
## 4      4        1.0          51.0        161.0      199.0         294.0
##   count.Max.
## 1      801.0
## 2      873.0
## 3      977.0
## 4      948.0

table(bs$season)

## 
##    1    2    3    4 
## 2686 2733 2733 2734

Spring season attracts fewer rentals than the remaining seasons on a first glimpse.

But there are two types of counts, CASUAL and REGISTERED.

par(mfrow=c(1,3))
plot(bs$season,bs$count, main="Overall Count")
plot(bs$season,bs$casual, main="Casual")
plot(bs$season,bs$registered, main="Registered")

plot of chunk unnamed-chunk-4

CASUAL - SPRING AND WINTER LOW

2. HOLIDAY

Definition in Kaggle: whether the day is considered a holiday

bs$holiday <- as.factor(bs$holiday)
table(bs$holiday)

## 
##     0     1 
## 10575   311

aggregate(count~holiday,data=bs, summary)

##   holiday count.Min. count.1st Qu. count.Median count.Mean count.3rd Qu.
## 1       0        1.0          43.0        145.0      191.7         283.0
## 2       1        1.0          38.5        133.0      185.9         308.0
##   count.Max.
## 1      977.0
## 2      712.0

Totally 311 holidays in train.

par(mfrow=c(1,3))
plot(bs$holiday,bs$count, main="Overall Count")
plot(bs$holiday,bs$casual, main="Casual")
plot(bs$holiday,bs$registered, main="Registered")

plot of chunk unnamed-chunk-6

3. WORKINGDAY

Definition in Kaggle: whether the day is neither a weekend nor holiday

bs$workingday <- as.factor(bs$workingday)
table(bs$workingday)

## 
##    0    1 
## 3474 7412

aggregate(count~workingday, data=bs, summary)

##   workingday count.Min. count.1st Qu. count.Median count.Mean
## 1          0        1.0          44.0        128.0      188.5
## 2          1        1.0          41.0        151.0      193.0
##   count.3rd Qu. count.Max.
## 1         304.0      783.0
## 2         277.0      977.0

par(mfrow=c(1,3))
plot(bs$workingday,bs$count, main="Overall Count")
plot(bs$workingday,bs$casual, main="Casual")
plot(bs$workingday,bs$registered, main="Registered")

plot of chunk unnamed-chunk-8

Workingday - Mean(casual) goes down, Mean(registered) up.

4. WEATHER

1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

bs$weather<-as.factor(bs$weather)
table(bs$weather)

## 
##    1    2    3    4 
## 7192 2834  859    1

There is only 1 row in weather 4 (Heavy rain+thunderstorm etc..). So I'm going to recode it into 3, which is the closest and most logical.

bs$weather <- ifelse(bs$weather=="4",3, bs$weather)
bs$weather<-as.factor(bs$weather)
table(bs$weather)

## 
##    1    2    3 
## 7192 2834  860

par(mfrow=c(1,3))
plot(bs$weather,bs$count, main="Overall Count")
plot(bs$weather,bs$casual, main="Casual")
plot(bs$weather,bs$registered, main="Registered")

plot of chunk unnamed-chunk-11

COunts decrease with bad weather.

5. TEMP

Definition in Kaggle: temperature in Celsius

reg1 <- lm(count~temp, data=bs)
plot(bs$temp, bs$count)
abline(reg1)

plot of chunk unnamed-chunk-12

cor(bs$temp, bs$count)

## [1] 0.3944536

Try the same for atemp

6. ATEMP

Definition in Kaggle: “feels like” temperature in Celsius

reg2 <- lm(count~atemp, data=bs)
plot(bs$atemp, bs$count)
abline(reg2)

plot of chunk unnamed-chunk-13

cor(bs$atemp, bs$count)

## [1] 0.3897844

7. Humidity

Definition in Kaggle: relative humidity

summary(bs$humidity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   47.00   62.00   61.89   77.00  100.00

reg3 <- lm(count~humidity, data=bs)
plot(bs$humidity, bs$count)
abline(reg2)

plot of chunk unnamed-chunk-14

cor(bs$humidity, bs$count)

## [1] -0.3173715

There are a few zero values in humidity. At the moment I'm not sure what to do with it.

length(which(bs$humidity==0))

## [1] 22

8. WINDSPEED:

summary(bs$windspeed)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   7.002  13.000  12.800  17.000  57.000

reg3 <- lm(count~windspeed, data=bs)
plot(bs$windspeed, bs$count)
abline(reg2)

plot of chunk unnamed-chunk-16

cor(bs$windspeed, bs$count)

## [1] 0.1013695

9. TIMESTAMP

Now it's time to decimate the timestamp variable:

Day of the week

library(lubridate)
bs$dow<-wday(bs$datetime,label=TRUE,abbr=TRUE)

par(mfrow=c(1,3))
plot(bs$dow, bs$count, main="Count")
plot(bs$dow, bs$casual, main="Casual")
plot(bs$dow, bs$registered, main="Registered")

plot of chunk unnamed-chunk-18

Casual peaks in weekends, registered dips in weekends

Hour of day

### Hour of the day
bs$hour_of_day<- strftime(bs$datetime,"%H")

par(mfrow=c(1,3))
plot(bs$hour_of_day, bs$count, main="Count")
plot(bs$hour_of_day, bs$casual, main="Casual")
plot(bs$hour_of_day, bs$registered, main="Registered")

plot of chunk unnamed-chunk-20

The trend is different for casual and registered. For registered, there is a peak during the ofice hours, but in casual, the rentals peak only during noon.

### Month
bs$month<-strftime(bs$datetime,"%m")
bs$month <- as.factor(bs$month)

par(mfrow=c(1,3))
plot(bs$month, bs$count, main="Count")
plot(bs$month, bs$casual, main="Casual")
plot(bs$month, bs$registered, main="Registered")

plot of chunk unnamed-chunk-22

Normal distribution in Casual, peaks till July, and decreases from Aug-Dec.

### Year
bs$year<-strftime(bs$datetime,"%y")
bs$year <- as.factor(bs$year)

par(mfrow=c(1,3))
plot(bs$year, bs$count, main="Count")
plot(bs$year, bs$casual, main="Casual")
plot(bs$year, bs$registered, main="Registered")

plot of chunk unnamed-chunk-24

Bike rentals picked up in 2012

Let's plot month trend by year. Start with 2011

par(mfrow=c(1,3))
plot(bs$month[bs$year=="11"], bs$count[bs$year=="11"], main="Count 2011")
plot(bs$month[bs$year=="11"], bs$casual[bs$year=="11"], main="Casual 2011")
plot(bs$month[bs$year=="11"], bs$registered[bs$year=="11"], main="Registered 2011")

plot of chunk unnamed-chunk-25

Repeat it for 2012

par(mfrow=c(1,3))
plot(bs$month[bs$year=="12"], bs$count[bs$year=="12"], main="Count 2012")
plot(bs$month[bs$year=="12"], bs$casual[bs$year=="12"], main="Casual 2012")
plot(bs$month[bs$year=="12"], bs$registered[bs$year=="12"], main="Registered 2012")

plot of chunk unnamed-chunk-26

Keeping track of changes

write.csv(bs,"bsv1.1.csv")

Short summary:

Rentals dip during spring season
Casual count dips on weekdays when compared to weekends, whereas registered dips during weekends when compared to weekdays.
The counts peak during office hours for registered, whereas the count peaks towards noon for casual.
2012 registered more rentals than 2011 for all the months.
Counts reduce when weather is bad.
Temp and atemp are highly correlated.