The problem is to forecast the bike rental demand. The train data set has hourly data for first 19 days of each month while we need to predict for 20th to the month end.

library(ggplot2)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.2.3
library(xtable)
setwd("D:/personal/R/R_pracs/bike sharing")
train <- read.csv("train.csv")
test <- read.csv("test.csv")
str(train)
## 'data.frame':    10886 obs. of  12 variables:
##  $ datetime  : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
##  $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
##  $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
##  $ windspeed : num  0 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ count     : int  16 40 32 13 1 1 2 3 8 14 ...
sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0

The train dataset has 12 variables while the test data has 9 since the casual, registered and total count of the users are to be predicted. It can also be observed that there is no missing data.

Explanation of the Variables:

As seen above, there are 12 variables in the train dataset and can be categorized in to dependent and independent variables. Dependent variables are the 3 user count variables while the other 9 are independent variables.

  1. datetime - date and hour in “mm/dd/yyyy hh:mm” format
  2. season - 4 levels; 1 = spring, 2 = summer, 3 = fall, 4 = winter
  3. holiday - whether the day is considered a holiday
  4. workingday - whether the day is neither a weekend nor holiday
  5. weather -
  1. temp - temperature in Celsius
  2. atemp - “feels like” temperature in Celsius
  3. humidity - relative humidity
  4. windspeed - wind speed
  5. casual - number of non-registered user rentals initiated
  6. registered - number of registered user rentals initiated
  7. count - number of total rentals

We can understand the frequency of each variable by plotting histograms.

hist(train$season,main="Histogram for Season", 
     xlab="Season", col = "grey")

hist(train$holiday, main="Histogram for Holiday", 
     xlab="holiday", col = "grey")

hist(train$workingday, main="Histogram for Working Days", 
     xlab="working day", col = "grey")

hist(train$weather, main="Histogram for Weather", 
     xlab="weather", col = "grey")

hist(train$temp, main="Histogram for Temperature", 
     xlab="temperature", col = "grey")

hist(train$windspeed, main="Histogram for Wind Speed", 
     xlab="wind speed", col = "grey")

Most of the frequencies are as expected. Seanson frequency is almost the same. The number of holidays are less and working days are more. Most of the days have clear weather.

Hourly Trend:

train$hour <- hour(ymd_hms(train$datetime))
ggplot(train,aes(factor(hour),count)) + geom_boxplot()

We can observe that the peak hours are in between 7 to 9 & 16 to 19 hours. Now we can analyse the hourly trend with respect to the weekday

train$day <- wday(ymd_hms(train$datetime), label = TRUE)
ggplot(train, aes(hour, count,color = day)) + geom_smooth(ce=FALSE, fill=NA) + ggtitle("Hourly Trend of bikes rented on weekdays and weekends") + scale_x_continuous( breaks = seq(0,24,by=2))
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

We can observe that the count peaks in mornings (7 to 9) and evenings (16 to 20) for weekdays while for weekends the count is more during the day.

We can check the impact of weather and season on the total bikes rented.

ggplot(train, aes(factor(weather),count)) + geom_boxplot()

ggplot(train, aes(factor(season), count)) + geom_boxplot()

As expected, the number of bikes rented is more on days with clear weather or few clouds. Also, the number of bikes rented is more in summer and fall season.

We can correlate the temperature, weather, humidity, windspeed.

subtrain <- data.frame(train$count,train$temp,train$atemp,train$humidity,train$windspeed)
cor(subtrain)
##                 train.count  train.temp train.atemp train.humidity
## train.count       1.0000000  0.39445364  0.38978444    -0.31737148
## train.temp        0.3944536  1.00000000  0.98494811    -0.06494877
## train.atemp       0.3897844  0.98494811  1.00000000    -0.04353571
## train.humidity   -0.3173715 -0.06494877 -0.04353571     1.00000000
## train.windspeed   0.1013695 -0.01785201 -0.05747300    -0.31860699
##                 train.windspeed
## train.count          0.10136947
## train.temp          -0.01785201
## train.atemp         -0.05747300
## train.humidity      -0.31860699
## train.windspeed      1.00000000

It can be observed that temperature and windspeed are positively correlated with the total bikes rented.

Features

featuring <- function(data){
  features <- c("season","holiday","workingday","weather","temp","atemp","humidity","windspeed","hour","month","weekend")
  data$hour <- hour(ymd_hms(data$datetime))
  data$month <- month(ymd_hms(data$datetime))
  data$day <- wday(ymd_hms(data$datetime), label = TRUE)
  data$weekend <- 0
 data$weekend[data$day == "Sat" | data$day == "Sun"] <- 1
 #converting to factor variables
 data$season <- as.factor(data$season)
 data$holiday <- as.factor(data$holiday)
 data$workingday <- as.factor(data$workingday)
 data$weather <- as.factor(data$weather)
 data$hour <- as.factor(data$hour)
 data$month <- as.factor(data$month)
  return(data[,features])
}

fit <- randomForest(featuring(train),train$count,ntree = 100, importance = TRUE)
varImpPlot(fit)

imp <- importance(fit)
imp
##             %IncMSE IncNodePurity
## season     13.93597     9190478.2
## holiday    12.57817      966539.8
## workingday 22.64050    10040833.8
## weather    21.07409     7065125.7
## temp       16.32514    28355764.4
## atemp      18.39338    34270565.2
## humidity   27.51423    30606103.0
## windspeed  17.54768    11288265.4
## hour       80.39216   178409246.2
## month      19.85469    19118401.9
## weekend    21.94247     8893824.6
imp_features <- data.frame(features = row.names(imp),importance = imp[,1])
ggplot(imp_features,aes(x=reorder(features,importance),y = importance) ) + geom_bar(stat="identity") + coord_flip()  + ggtitle("Relative Importance of features") + ylab("Features") + xlab("Importance")

pred <- predict(fit,featuring(test))

test$count <- pred
final <- data.frame(test$datetime,test$count)
names(final) <- c("datetime","count")
write.csv(final,file = "submission.csv", row.names = FALSE)