NIE Institute of Technology
Introduction
Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
Data
dteday - Date of the day
season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
yr - ‘0’-2011;‘1’-2012
mnth - Months in numerals(1-12)
hr - Hour of the day
holiday - whether the day is considered a holiday
weekday - Weekday in numerals, 0 being Sunday and 6 being Saturday
workingday - whether the day is neither a weekend nor holiday
weathersit - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp - temperature in Celsius
atemp - “feels like” temperature in Celsius
hum - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
cnt - number of total rentals
Reading the data files
my_data<-read.csv(paste("hour.csv",sep=""))
Splitting the data into train and test
train <- my_data[1:10887,]
test <- my_data[10888:17379,]
Viewing and binding the data
View(train)
View(test)
data=rbind(train,test)
Analyzing the combined data
str(data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
summary(data)
## instant dteday season yr
## Min. : 1 2011-01-01: 24 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 2011-01-08: 24 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 2011-01-09: 24 Median :3.000 Median :1.0000
## Mean : 8690 2011-01-10: 24 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 2011-01-13: 24 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 2011-01-15: 24 Max. :4.000 Max. :1.0000
## (Other) :17235
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
##
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
##
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
##
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
##
Factoring some variables from numeric
data$weathersit=as.factor(data$weathersit)
data$holiday=as.factor(data$holiday)
data$workingday=as.factor(data$workingday)
data$hr=as.factor(data$hr)
data$yr=as.factor(data$yr)
data$hr=as.integer(data$hr)
Splitting into train and test
train <- data[1:10887,]
test <- data[10888:17379,]
Using decision trees for binning some variables
library(rpart)
library(rattle) #these libraries are used to get a good visual plot for the decision tree model.
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
d=rpart(registered~hr,data=train)
fancyRpartPlot(d)

d=rpart(casual~hr,data=train)
fancyRpartPlot(d)

Binding the data
data=rbind(train,test)
Feature Engineering
data$dp_reg=0
data$dp_reg[data$hr<8]=1
data$dp_reg[data$hr>=22]=2
data$dp_reg[data$hr>9 & data$hr<18]=3
data$dp_reg[data$hr==8]=4
data$dp_reg[data$hr==9]=5
data$dp_reg[data$hr==20 | data$hr==21]=6
data$dp_reg[data$hr==19 | data$hr==18]=7
###train$dp_reg=0
###train$dp_reg[train$hr<8]=1
###train$dp_reg[train$hr>=22]=2
###train$dp_reg[train$hr>9 & train$hr<18]=3
###train$dp_reg[train$hr==8]=4
###train$dp_reg[train$hr==9]=5
###train$dp_reg[train$hr==20 | train$hr==21]=6
###train$dp_reg[train$hr==19 | train$hr==18]=7
###View(train)
data$dp_cas=0
data$dp_cas[data$hr<=8]=1
data$dp_cas[data$hr==9]=2
data$dp_cas[data$hr>=10 & data$hr<=19]=3
data$dp_cas[data$hr>19]=4
#View(data)
#test$dp_cas=0
#test$dp_cas[test$hr<=8]=1
#test$dp_cas[test$hr==9]=2
#test$dp_cas[test$hr>=10 & test$hr<=19]=3
#test$dp_cas[test$hr>19]=4
train <- data[1:10887,]
test <- data[10888:17379,]
f=rpart(registered~temp,data=train)
fancyRpartPlot(f)

f=rpart(casual~temp,data=train)
fancyRpartPlot(f)

data$temp_reg=0
data$temp_reg[data$temp<13]=1
data$temp_reg[data$temp>=13 & data$temp<23]=2
data$temp_reg[data$temp>=23 & data$temp<30]=3
data$temp_reg[data$temp>=30]=4
#train$temp_reg=0
#train$temp_reg[train$temp<13]=1
#train$temp_reg[train$temp>=13 & train$temp<23]=2
#train$temp_reg[train$temp>=23 & train$temp<30]=3
#train$temp_reg[train$temp>=30]=4
data$temp_cas=0
data$temp_cas[data$temp<15]=1
data$temp_cas[data$temp>=15 & data$temp<23]=2
data$temp_cas[data$temp>=23 & data$temp<30]=3
data$temp_cas[data$temp>=30]=4
#test$temp_cas=0
#test$temp_cas[test$temp<15]=1
#test$temp_cas[test$temp>=15 & test$temp<23]=2
#test$temp_cas[test$temp>=23 & test$temp<30]=3
#test$temp_cas[test$temp>=30]=4
data$year_part[data$yr=='0']=1
data$year_part[data$yr=='0' & data$mnth>3]=2
data$year_part[data$yr=='0' & data$mnth>6]=3
data$year_part[data$yr=='0' & data$mnth>9]=4
data$year_part[data$yr=='1']=5
data$year_part[data$yr=='1' & data$mnth>3]=6
data$year_part[data$yr=='1' & data$mnth>6]=7
data$year_part[data$yr=='1' & data$mnth>9]=8
table(data$year_part)
##
## 1 2 3 4 5 6 7 8
## 2067 2183 2192 2203 2176 2182 2208 2168
Creating another variable day_type which may affect our accuracy as weekends and weekdays are important in deciding rentals
data$day_type=0
data$day_type[data$holiday==0 & data$workingday==0]="weekend"
data$day_type[data$holiday==1]="holiday"
data$day_type[data$holiday==0 & data$workingday==1]="working day"
train <- data[1:10887,]
test <- data[10888:17379,]
plot(train$temp,train$cnt)

data=rbind(train,test)
#View(data)
data$weekend=0
data$weekend[data$workingday=="0" | data$day=="6"]=1
#View(data)
str(data)
## 'data.frame': 17379 obs. of 24 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 1 2 3 4 5 6 7 8 9 10 ...
## $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weathersit: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## $ dp_reg : num 1 1 1 1 1 1 1 4 5 3 ...
## $ dp_cas : num 1 1 1 1 1 1 1 1 2 3 ...
## $ temp_reg : num 1 1 1 1 1 1 1 1 1 1 ...
## $ temp_cas : num 1 1 1 1 1 1 1 1 1 1 ...
## $ year_part : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day_type : chr "weekend" "weekend" "weekend" "weekend" ...
## $ weekend : num 1 1 1 1 1 1 1 1 1 1 ...
Converting all relevant categorical variables into factors to feed to our random forest model
data$season=as.factor(data$season)
data$holiday=as.factor(data$holiday)
data$workingday=as.factor(data$workingday)
data$weathersit=as.factor(data$weathersit)
data$hr=as.factor(data$hr)
data$mnth=as.factor(data$mnth)
data$dp_cas=as.factor(data$dp_cas)
data$dp_reg=as.factor(data$dp_reg)
data$day_type=as.factor(data$day_type)
data$weekday=as.factor(data$weekday)
data$temp_cas=as.factor(data$temp_cas)
data$temp_reg=as.factor(data$temp_reg)
train <- data[1:10887,]
test <- data[10888:17379,]
Final model building using random forest
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
set.seed(415)
fit1 <- randomForest(logreg ~ hr +workingday+weekday+holiday+ day_type +temp_reg+hum+atemp+windspeed+season+weathersit+dp_reg+weekend+yr+year_part, data=train,importance=TRUE, ntree=250)
pred1=predict(fit1,test)
test$logreg=pred1
set.seed(415)
fit2 <- randomForest(logcas ~hr + day_type+weekday+hum+atemp+temp_cas+windspeed+season+weathersit+holiday+workingday+dp_cas+weekend+yr+year_part, data=train,importance=TRUE, ntree=250)
pred2=predict(fit2,test)
test$logcas=pred2
Creating the final submission file
( I have created new variable count,reg_predict,cas_predict in test dataframe to analyze the predicted value and actual value using t-test)
test$reg_predict=exp(test$logreg)-1
test$cas_predict=exp(test$logcas)-1
test$count=test$cas_predict+test$reg_predict
View(test)
Using t-test to analyze our predicted model
t.test(test$cnt,test$count,var.equal=TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: test$cnt and test$count
## t = 14.029, df = 12982, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 41.85533 55.45098
## sample estimates:
## mean of x mean of y
## 256.8527 208.1996
t.test(test$registered,test$reg_predict,var.equal=TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: test$registered and test$reg_predict
## t = 15.368, df = 12982, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 38.65829 49.96187
## sample estimates:
## mean of x mean of y
## 207.7497 163.4396
t.test(test$casual,test$cas_predict,var.equal=TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: test$casual and test$cas_predict
## t = 4.47, df = 12982, p-value = 7.887e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.438595 6.247553
## sample estimates:
## mean of x mean of y
## 49.10305 44.75998
### Using Root Mean Square Error to analyze the predicted and actual values
#rmse<-function(error)
#{
# sqrt(mean(error^2))
#}
#error<-test$cnt-test$count
#rmse(error)
Summary
Mean of actual count(cnt) is 256.8527 and mean of predicted count(count) value is 208.1996
Mean of actual registered value is 207.7479 and mean of predicted registered value is 163.4396
Mean of actual casual value is 49.10305 and mean of predicted casual value is 44.75998
The difference between predicted and actual value can be due to many reasons like overfitting. Overfitting can be addressed by slicing the training data before feeding it to the prediction model.