The problem is to forecast the bike rental demand. The train data set has hourly data for first 19 days of each month while we need to predict for 20th to the month end.
library(ggplot2)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.2.3
library(xtable)
setwd("D:/personal/R/R_pracs/bike sharing")
train <- read.csv("train.csv")
test <- read.csv("test.csv")
str(train)
## 'data.frame': 10886 obs. of 12 variables:
## $ datetime : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...
## $ windspeed : num 0 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ count : int 16 40 32 13 1 1 2 3 8 14 ...
sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0
The train dataset has 12 variables while the test data has 9 since the casual, registered and total count of the users are to be predicted. It can also be observed that there is no missing data.
As seen above, there are 12 variables in the train dataset and can be categorized in to dependent and independent variables. Dependent variables are the 3 user count variables while the other 9 are independent variables.
We can understand the frequency of each variable by plotting histograms.
hist(train$season,main="Histogram for Season",
xlab="Season", col = "grey")
hist(train$holiday, main="Histogram for Holiday",
xlab="holiday", col = "grey")
hist(train$workingday, main="Histogram for Working Days",
xlab="working day", col = "grey")
hist(train$weather, main="Histogram for Weather",
xlab="weather", col = "grey")
hist(train$temp, main="Histogram for Temperature",
xlab="temperature", col = "grey")
hist(train$windspeed, main="Histogram for Wind Speed",
xlab="wind speed", col = "grey")
Most of the frequencies are as expected. Seanson frequency is almost the same. The number of holidays are less and working days are more. Most of the days have clear weather.
train$hour <- hour(ymd_hms(train$datetime))
ggplot(train,aes(factor(hour),count)) + geom_boxplot()
We can observe that the peak hours are in between 7 to 9 & 16 to 19 hours. Now we can analyse the hourly trend with respect to the weekday
train$day <- wday(ymd_hms(train$datetime), label = TRUE)
ggplot(train, aes(hour, count,color = day)) + geom_smooth(ce=FALSE, fill=NA) + ggtitle("Hourly Trend of bikes rented on weekdays and weekends") + scale_x_continuous( breaks = seq(0,24,by=2))
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
We can observe that the count peaks in mornings (7 to 9) and evenings (16 to 20) for weekdays while for weekends the count is more during the day.
We can check the impact of weather and season on the total bikes rented.
ggplot(train, aes(factor(weather),count)) + geom_boxplot()
ggplot(train, aes(factor(season), count)) + geom_boxplot()
As expected, the number of bikes rented is more on days with clear weather or few clouds. Also, the number of bikes rented is more in summer and fall season.
We can correlate the temperature, weather, humidity, windspeed.
subtrain <- data.frame(train$count,train$temp,train$atemp,train$humidity,train$windspeed)
cor(subtrain)
## train.count train.temp train.atemp train.humidity
## train.count 1.0000000 0.39445364 0.38978444 -0.31737148
## train.temp 0.3944536 1.00000000 0.98494811 -0.06494877
## train.atemp 0.3897844 0.98494811 1.00000000 -0.04353571
## train.humidity -0.3173715 -0.06494877 -0.04353571 1.00000000
## train.windspeed 0.1013695 -0.01785201 -0.05747300 -0.31860699
## train.windspeed
## train.count 0.10136947
## train.temp -0.01785201
## train.atemp -0.05747300
## train.humidity -0.31860699
## train.windspeed 1.00000000
It can be observed that temperature and windspeed are positively correlated with the total bikes rented.
featuring <- function(data){
features <- c("season","holiday","workingday","weather","temp","atemp","humidity","windspeed","hour","month","weekend")
data$hour <- hour(ymd_hms(data$datetime))
data$month <- month(ymd_hms(data$datetime))
data$day <- wday(ymd_hms(data$datetime), label = TRUE)
data$weekend <- 0
data$weekend[data$day == "Sat" | data$day == "Sun"] <- 1
#converting to factor variables
data$season <- as.factor(data$season)
data$holiday <- as.factor(data$holiday)
data$workingday <- as.factor(data$workingday)
data$weather <- as.factor(data$weather)
data$hour <- as.factor(data$hour)
data$month <- as.factor(data$month)
return(data[,features])
}
fit <- randomForest(featuring(train),train$count,ntree = 100, importance = TRUE)
varImpPlot(fit)
imp <- importance(fit)
imp
## %IncMSE IncNodePurity
## season 13.93597 9190478.2
## holiday 12.57817 966539.8
## workingday 22.64050 10040833.8
## weather 21.07409 7065125.7
## temp 16.32514 28355764.4
## atemp 18.39338 34270565.2
## humidity 27.51423 30606103.0
## windspeed 17.54768 11288265.4
## hour 80.39216 178409246.2
## month 19.85469 19118401.9
## weekend 21.94247 8893824.6
imp_features <- data.frame(features = row.names(imp),importance = imp[,1])
ggplot(imp_features,aes(x=reorder(features,importance),y = importance) ) + geom_bar(stat="identity") + coord_flip() + ggtitle("Relative Importance of features") + ylab("Features") + xlab("Importance")
pred <- predict(fit,featuring(test))
test$count <- pred
final <- data.frame(test$datetime,test$count)
names(final) <- c("datetime","count")
write.csv(final,file = "submission.csv", row.names = FALSE)