NIE Institute of Technology

Introduction

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Data

For this study,I have collected the dataset from https://www.kaggle.com/datasets

dteday - Date of the day

season - 1 = spring, 2 = summer, 3 = fall, 4 = winter

yr - ‘0’-2011;‘1’-2012

mnth - Months in numerals(1-12)

hr - Hour of the day

holiday - whether the day is considered a holiday

weekday - Weekday in numerals, 0 being Sunday and 6 being Saturday

workingday - whether the day is neither a weekend nor holiday

weathersit - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp - temperature in Celsius

atemp - “feels like” temperature in Celsius

hum - relative humidity

windspeed - wind speed

casual - number of non-registered user rentals initiated

registered - number of registered user rentals initiated

cnt - number of total rentals

Reading the data files

my_data<-read.csv(paste("hour.csv",sep=""))

Splitting the data into train and test

train <- my_data[1:10887,]
test <- my_data[10888:17379,]

Viewing and binding the data

View(train)
View(test)
data=rbind(train,test)

Analyzing the combined data

str(data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
summary(data)
##     instant             dteday          season            yr        
##  Min.   :    1   2011-01-01:   24   Min.   :1.000   Min.   :0.0000  
##  1st Qu.: 4346   2011-01-08:   24   1st Qu.:2.000   1st Qu.:0.0000  
##  Median : 8690   2011-01-09:   24   Median :3.000   Median :1.0000  
##  Mean   : 8690   2011-01-10:   24   Mean   :2.502   Mean   :0.5026  
##  3rd Qu.:13034   2011-01-13:   24   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :17379   2011-01-15:   24   Max.   :4.000   Max.   :1.0000  
##                  (Other)   :17235                                   
##       mnth              hr           holiday           weekday     
##  Min.   : 1.000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
##  1st Qu.: 4.000   1st Qu.: 6.00   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 7.000   Median :12.00   Median :0.00000   Median :3.000  
##  Mean   : 6.538   Mean   :11.55   Mean   :0.02877   Mean   :3.004  
##  3rd Qu.:10.000   3rd Qu.:18.00   3rd Qu.:0.00000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :23.00   Max.   :1.00000   Max.   :6.000  
##                                                                    
##    workingday       weathersit         temp           atemp       
##  Min.   :0.0000   Min.   :1.000   Min.   :0.020   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.340   1st Qu.:0.3333  
##  Median :1.0000   Median :1.000   Median :0.500   Median :0.4848  
##  Mean   :0.6827   Mean   :1.425   Mean   :0.497   Mean   :0.4758  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.660   3rd Qu.:0.6212  
##  Max.   :1.0000   Max.   :4.000   Max.   :1.000   Max.   :1.0000  
##                                                                   
##       hum           windspeed          casual         registered   
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Min.   :  0.0  
##  1st Qu.:0.4800   1st Qu.:0.1045   1st Qu.:  4.00   1st Qu.: 34.0  
##  Median :0.6300   Median :0.1940   Median : 17.00   Median :115.0  
##  Mean   :0.6272   Mean   :0.1901   Mean   : 35.68   Mean   :153.8  
##  3rd Qu.:0.7800   3rd Qu.:0.2537   3rd Qu.: 48.00   3rd Qu.:220.0  
##  Max.   :1.0000   Max.   :0.8507   Max.   :367.00   Max.   :886.0  
##                                                                    
##       cnt       
##  Min.   :  1.0  
##  1st Qu.: 40.0  
##  Median :142.0  
##  Mean   :189.5  
##  3rd Qu.:281.0  
##  Max.   :977.0  
## 

Factoring some variables from numeric

data$weathersit=as.factor(data$weathersit)
data$holiday=as.factor(data$holiday)
data$workingday=as.factor(data$workingday)
data$hr=as.factor(data$hr)
data$yr=as.factor(data$yr)

data$hr=as.integer(data$hr)

Splitting into train and test

train <- data[1:10887,]
test <- data[10888:17379,]

Using decision trees for binning some variables

library(rpart)
library(rattle) #these libraries are used to get a good visual plot for the decision tree model. 
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)

d=rpart(registered~hr,data=train)
fancyRpartPlot(d)

d=rpart(casual~hr,data=train)
fancyRpartPlot(d)

Binding the data

data=rbind(train,test)

Feature Engineering

data$dp_reg=0
data$dp_reg[data$hr<8]=1
data$dp_reg[data$hr>=22]=2
data$dp_reg[data$hr>9 & data$hr<18]=3
data$dp_reg[data$hr==8]=4
data$dp_reg[data$hr==9]=5
data$dp_reg[data$hr==20 | data$hr==21]=6
data$dp_reg[data$hr==19 | data$hr==18]=7

###train$dp_reg=0
###train$dp_reg[train$hr<8]=1
###train$dp_reg[train$hr>=22]=2
###train$dp_reg[train$hr>9 & train$hr<18]=3
###train$dp_reg[train$hr==8]=4
###train$dp_reg[train$hr==9]=5
###train$dp_reg[train$hr==20 | train$hr==21]=6
###train$dp_reg[train$hr==19 | train$hr==18]=7
###View(train)

data$dp_cas=0
data$dp_cas[data$hr<=8]=1
data$dp_cas[data$hr==9]=2
data$dp_cas[data$hr>=10 & data$hr<=19]=3
data$dp_cas[data$hr>19]=4
#View(data)

#test$dp_cas=0
#test$dp_cas[test$hr<=8]=1
#test$dp_cas[test$hr==9]=2
#test$dp_cas[test$hr>=10 & test$hr<=19]=3
#test$dp_cas[test$hr>19]=4

train <- data[1:10887,]
test <- data[10888:17379,]

f=rpart(registered~temp,data=train)
fancyRpartPlot(f)

f=rpart(casual~temp,data=train)
fancyRpartPlot(f)

data$temp_reg=0
data$temp_reg[data$temp<13]=1
data$temp_reg[data$temp>=13 & data$temp<23]=2
data$temp_reg[data$temp>=23 & data$temp<30]=3
data$temp_reg[data$temp>=30]=4

#train$temp_reg=0
#train$temp_reg[train$temp<13]=1
#train$temp_reg[train$temp>=13 & train$temp<23]=2
#train$temp_reg[train$temp>=23 & train$temp<30]=3
#train$temp_reg[train$temp>=30]=4

data$temp_cas=0
data$temp_cas[data$temp<15]=1
data$temp_cas[data$temp>=15 & data$temp<23]=2
data$temp_cas[data$temp>=23 & data$temp<30]=3
data$temp_cas[data$temp>=30]=4

#test$temp_cas=0
#test$temp_cas[test$temp<15]=1
#test$temp_cas[test$temp>=15 & test$temp<23]=2
#test$temp_cas[test$temp>=23 & test$temp<30]=3
#test$temp_cas[test$temp>=30]=4

data$year_part[data$yr=='0']=1
data$year_part[data$yr=='0' & data$mnth>3]=2
data$year_part[data$yr=='0' & data$mnth>6]=3
data$year_part[data$yr=='0' & data$mnth>9]=4
data$year_part[data$yr=='1']=5
data$year_part[data$yr=='1' & data$mnth>3]=6
data$year_part[data$yr=='1' & data$mnth>6]=7
data$year_part[data$yr=='1' & data$mnth>9]=8
table(data$year_part)
## 
##    1    2    3    4    5    6    7    8 
## 2067 2183 2192 2203 2176 2182 2208 2168

Creating another variable day_type which may affect our accuracy as weekends and weekdays are important in deciding rentals

data$day_type=0

data$day_type[data$holiday==0 & data$workingday==0]="weekend"
data$day_type[data$holiday==1]="holiday"
data$day_type[data$holiday==0 & data$workingday==1]="working day"

train <- data[1:10887,]
test <- data[10888:17379,]

plot(train$temp,train$cnt)

data=rbind(train,test)
#View(data)

data$weekend=0
data$weekend[data$workingday=="0" | data$day=="6"]=1
#View(data)
str(data)
## 'data.frame':    17379 obs. of  24 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weathersit: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
##  $ dp_reg    : num  1 1 1 1 1 1 1 4 5 3 ...
##  $ dp_cas    : num  1 1 1 1 1 1 1 1 2 3 ...
##  $ temp_reg  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ temp_cas  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ year_part : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day_type  : chr  "weekend" "weekend" "weekend" "weekend" ...
##  $ weekend   : num  1 1 1 1 1 1 1 1 1 1 ...

Converting all relevant categorical variables into factors to feed to our random forest model

data$season=as.factor(data$season)
data$holiday=as.factor(data$holiday)
data$workingday=as.factor(data$workingday)
data$weathersit=as.factor(data$weathersit)
data$hr=as.factor(data$hr)
data$mnth=as.factor(data$mnth)
data$dp_cas=as.factor(data$dp_cas)
data$dp_reg=as.factor(data$dp_reg)
data$day_type=as.factor(data$day_type)
data$weekday=as.factor(data$weekday)
data$temp_cas=as.factor(data$temp_cas)
data$temp_reg=as.factor(data$temp_reg)

train <- data[1:10887,]
test <- data[10888:17379,]

Log transformation for some skewed variables, which can be seen from their distribution

train$reg1=train$registered+1
train$cas1=train$casual+1
train$logcas=log(train$cas1)
train$logreg=log(train$reg1)
test$logreg=0
test$logcas=0

boxplot(train$logreg~train$weather,xlab="weather", ylab="registered users")

boxplot(train$logreg~train$season,xlab="season", ylab="registered users")

Final model building using random forest

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
set.seed(415)
fit1 <- randomForest(logreg ~ hr +workingday+weekday+holiday+ day_type +temp_reg+hum+atemp+windspeed+season+weathersit+dp_reg+weekend+yr+year_part, data=train,importance=TRUE, ntree=250)



pred1=predict(fit1,test)
test$logreg=pred1

set.seed(415)
fit2 <- randomForest(logcas ~hr + day_type+weekday+hum+atemp+temp_cas+windspeed+season+weathersit+holiday+workingday+dp_cas+weekend+yr+year_part, data=train,importance=TRUE, ntree=250)

pred2=predict(fit2,test)
test$logcas=pred2

Creating the final submission file

( I have created new variable count,reg_predict,cas_predict in test dataframe to analyze the predicted value and actual value using t-test)

test$reg_predict=exp(test$logreg)-1
test$cas_predict=exp(test$logcas)-1
test$count=test$cas_predict+test$reg_predict
View(test)

Using t-test to analyze our predicted model

t.test(test$cnt,test$count,var.equal=TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  test$cnt and test$count
## t = 14.029, df = 12982, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  41.85533 55.45098
## sample estimates:
## mean of x mean of y 
##  256.8527  208.1996
t.test(test$registered,test$reg_predict,var.equal=TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  test$registered and test$reg_predict
## t = 15.368, df = 12982, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  38.65829 49.96187
## sample estimates:
## mean of x mean of y 
##  207.7497  163.4396
t.test(test$casual,test$cas_predict,var.equal=TRUE, paired=FALSE)
## 
##  Two Sample t-test
## 
## data:  test$casual and test$cas_predict
## t = 4.47, df = 12982, p-value = 7.887e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.438595 6.247553
## sample estimates:
## mean of x mean of y 
##  49.10305  44.75998
### Using Root Mean Square Error to analyze the predicted and actual values
#rmse<-function(error)
#{
#  sqrt(mean(error^2))
#}
#error<-test$cnt-test$count
#rmse(error)

Summary

Mean of actual count(cnt) is 256.8527 and mean of predicted count(count) value is 208.1996

Mean of actual registered value is 207.7479 and mean of predicted registered value is 163.4396

Mean of actual casual value is 49.10305 and mean of predicted casual value is 44.75998

The difference between predicted and actual value can be due to many reasons like overfitting. Overfitting can be addressed by slicing the training data before feeding it to the prediction model.