Background:

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Dataset:

Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Washingtown D.C.Bikeshare system, USA which is publicly available through this link.

Abstract:

This article aims to present an initial analysis on the demand for rental bicycles . To do that, I made the following studies:

- Temporary visualization and Annual analysis
- Hourly averaged demand
- Prediction of demand according to Time Series Data
- Prediction of demand according to weather conditions

So the programming is going to be shown step by step in order to spread a new case of \(The\) \(Art\) \(of\) \(Programming\) \(with\) \(R\).

First, setting not to show messages or warnings:

knitr::opts_chunk$set(message= FALSE)
knitr::opts_chunk$set(warning= FALSE)

Downloading dataset:

setwd("~/Analisis de Datos/Practicas en R/Bicing sharing")
datos <- read.csv("hour.csv", header=TRUE, sep=",", quote= "\"'" , stringsAsFactors= FALSE, dec=".")
datos <- within(datos, rm(instant, yr,mnth,holiday,weekday,atemp,cnt))
conversion <- c(41,100,67)
for(n in 6:8){ datos[,n] <- round(datos[,n]*conversion[n-5]/max(datos[,n]-min(datos[,n])),0)}
datos$casual <- as.numeric(datos$casual)
datos$season <- as.numeric(datos$season)
datos$workingday <- as.numeric(datos$workingday)
datos$registered <- as.numeric(datos$registered)
datos$weathersit <- as.factor(datos$weathersit)

Calling the packages:

#I recommend downloading first plyr and then dplyr
library(plyr)
library(dplyr)
library(ggplot2)
library(DT)
library(extrafont)
library(TTR)
library(forecast)
library(party)
library(plotrix)

Showing Input values:

datatable(datos, class= "cell-border stripe")

Attribute:	Description:
dteday	Date: year-month-day
season	1:Springer; 2:Summer; 3:Fall; 4:Winter
hr	Hour:0 to 23
workingday	if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit	1: Clear,.., Partly cloudy/ 2: Mist,..,Mist + Cloudy/ 3: Light Snow, rain,Scattered clouds/ 4: Heavy Rain,…, Snow + Fog
temp	Temperature in Celsius.The values are divided to 41 (max)
hum	Humidity. The values are divided to 100 (max)
windspreed	Normalized wind speed. The values are divided to 67 (max)
casual	Count of casual users
registered	Count of registered users

1. Temporary visualization and Annual analysis:

First visualization:

I am making a comparison in the evolution of hourly demand for casual, registered users and total demand.

datetime <- paste(datos$dteday,paste(if_else(datos$hr<10,paste("0",datos$hr,sep=""),
                                             as.character(datos$hr)),":00:00",sep="") )
datos <- cbind(datetime,datos)
datos$datetime <- as.POSIXct(datos$datetime, "%Y-%m-%d %H:%M:%S")
ggplot(datos, aes(datetime)) + geom_line(aes(y= casual + registered,colour= "total")) +
geom_line(aes(y= registered, colour= "registered")) + geom_line(aes(y= casual, colour= "casual")) +
scale_colour_manual("", values= c("casual"= "deepskyblue3","registered"= " darkorange","total"= "firebrick4"))+
ggtitle("Hourly demand") + scale_x_datetime(date_labels = "%b %Y") + theme_bw() +
theme( axis.line = element_line(size=1, colour= "black"),plot.title = element_text(hjust = 0.5), 
       panel.border = element_blank())+
theme(plot.title = element_text(family= "Comic Sans MS"), text= element_text(family = "Comic Sans MS")) +
theme(axis.text.x = element_text(colour="black",size= 10), axis.text.y = element_text(colour="black",size=10),       legend.key = element_rect(fill="white",colour="white")) +
theme(legend.position = "bottom",legend.direction = "horizontal",legend.text = element_text(colour = "black",        size= 12)) +
theme(panel.grid.major = element_line(colour= "gray80"), panel.grid.minor = element_line(colour= "gray80")) + ylab("Nº users")

There is a high demand in 2012
There is a peak demand between March and October a year

Then quantifyng the demand per year.

Annual analysis:

datos$año <- format(datos$datetime,"%Y")
Tusers2011 <- colSums(filter(datos[,-(1:2)], año == 2011)[,8:9])
Tusers2012 <- colSums(filter(datos[,-(1:2)], año == 2012)[,8:9])
Tanual <- data.frame(año = c(rep(2011,2),rep(2012,2) ),Count= c(Tusers2011,Tusers2012), 
                     User= c(rep(c("casual","registered"),2)) )
Panual <- data.frame(año = Tanual$año, label= Tanual$Count)
Panual$label <- c(sort(Panual$label[1:2],decreasing = TRUE),sort(Panual$label[3:4],decreasing = TRUE))
Panual <- ddply(Panual,.(año),transform,pos= cumsum(label)- 0.5*label)
Panual$label <- round(Panual$label*100/c(rep(sum(Tanual$Count[1:2]),2), rep(sum(Tanual$Count[3:4]),2)),0)
Panual$label <- paste(Panual$label,"%",sep="")
totales <- data.frame(año= c(2011,2012), label= c(sum(Tusers2011), sum(Tusers2012)), 
                      pos= c(sum(Tusers2011),sum(Tusers2012))+100000) 
Panual <- rbind(Panual,totales)
#Plotting:
colores <- c("deepskyblue3", "darkorange")
ggplot() + geom_bar(data=  Tanual, aes(x= año, y= Count, fill= User),stat= "Identity" )+
geom_text(data= Panual, aes(x= año,y= pos, label= label),size= 4.5,family= "Comic Sans MS")+
theme_bw() + scale_x_continuous(breaks = 2011:2012) + ggtitle("Users")+
scale_fill_manual(values= colores) + theme(legend.title = element_blank()) +
theme(plot.title = element_text(hjust = 0.5), axis.line = element_line(size= 1, colour= "black")) +
theme(panel.border = element_blank(), panel.grid.minor = element_blank()) +
theme(plot.title = element_text(family = "Comic Sans MS"), text = element_text(family= "Comic Sans MS")) +
theme(axis.text.x = element_text(colour="black",size=10),axis.text.y = element_text(colour="black", size=7)) +
theme(legend.text = element_text(colour= "black",size=11)) + xlab("year")

The demand rate between casual and registered users is 1 to 4
In 2012, there was an increasing of 50% in demand

2. Hourly averaged demand:

The goal is to show the demand on average per hour in a day of the year.To choose the year whose result is more similar to this year.So I take as a reference the year 2012, being that it is the closest year whose data are available.

Initial coding:

datos2012 <- filter(datos[,-1],año == "2012")
datos2012 <- within(datos2012, rm(dteday,weathersit,temp,hum,windspeed,año))
i <- 1
for(S in 1:4){ season <- filter(datos2012, season == S)
              for (H in 0:23) { hora <- filter(season, hr == H)
                                for(wk in 0:1){workingday <- filter(hora, workingday == wk)
                                if ( i ==1){ datosprom2012 <-  round(colMeans(workingday),0)
                                i <- 2}
                                else { datosprom2012 <- rbind(datosprom2012,round(colMeans(workingday),0))}
                                              }
                              } 
             }
datosprom2012 <- as.data.frame(datosprom2012)
rownames(datosprom2012) <- NULL 
datosprom2012$season <- as.factor(datosprom2012$season)

Plotting:

datosprom2012$workingday <- ifelse(datosprom2012$workingday == 0, "h","w")
names <- list( '1'="Spring",'2'="Summer",'3'="Fall",'4'="Winter",'w'="weekday",'h'="holiday")
labeller <- function(variable,value){ return(names[value])}
ggplot(datosprom2012, aes(hr)) + geom_line (aes(y= casual, colour="casual")) +
geom_line(aes(y= registered,colour="registered")) + geom_line(aes(y= casual + registered, colour= "total")) +
scale_colour_manual("",values= c("casual"= "deepskyblue3","registered"= "darkorange","total"= "firebrick4")) +
facet_grid(workingday ~ season, labeller = labeller, scale= "free_y" ) + theme_bw() +
theme(strip.background = element_rect(colour = "paleturquoise", fill = "paleturquoise")) +
theme(legend.position = "bottom",legend.direction = "horizontal",legend.title = element_blank()) +
theme(plot.title = element_text(hjust = 0.5), axis.line = element_line(size= 1, colour= "black")) + 
theme(plot.title = element_text(family = "Comic Sans MS"), text = element_text(family= "Comic Sans MS")) +
theme(axis.text.x = element_text(colour="black",size=10),axis.text.y = element_text(colour="black", size=10)) +
theme(legend.text = element_text(colour= "black",size=13)) + theme(panel.grid.minor = element_blank()) +
ggtitle("Average demand-2012") + ylab("count") + xlab("hours")

On holidays, the graphics of demand for users both casual and registered resemble the Gauss Bell, yet both functions have similar peak hour at midday.
On weekdays, the demand between both types of users varies drastically, the demand on casual users has a lower preponderance that shows an increasing and decreasing function with slopes nearly flat where the highest point comes to be at 15 hours. While the demand on registered users resembles two obtused triangles next to each other where the two highest points are registered at 9 and 18 hours.
There is a lower demand for bikes in Spring, especially on holiday.

3. Prediction of demand according to Time Series Data:

I made two analysis:

Time series: It comes to be a study of additive descomposition of total time data in function of seasonal, trend and random components.
Forecasting the demand: According to the total time data, I have estimated the monthly demand for next year.

Last, the demand for casual, registered users and total demand have been everyone ejecuted separately.

Previous coding:

datosmensual <- within(datos, rm(datetime,season,hr,workingday,weathersit,temp,hum,windspeed))
datosmensual$mes <- format(as.Date(datosmensual$dteday, "%Y-%m-%d"),"%m")
datosmensual$año <- as.numeric(datosmensual$año)
datosmensual$mes <- as.numeric(datosmensual$mes)
demandamensual <- matrix(c(rep(0,24*2)),24,2)
colnames(demandamensual) <- c("casual","registered")
i <- 1
for(y in 2011:2012){datoaño <- filter(datosmensual,año == y)
                    for(m in 1:12){datomes <- filter(datoaño, mes == m) 
                                   demandamensual[i,] <-  colSums(datomes[,2:3])
                                   i <- i +1
                                  }
                    }
demandamensual <- as.data.frame(demandamensual)
demandamensual$total <- demandamensual$casual + demandamensual$registered

Casual Users:

casual <- ts(demandamensual$casual,frequency = 12,start= c(2011,1))
casualcomponents <- decompose(casual)
plot(casualcomponents)

There is no randomness in demand
The trend is growing

casualforescast <- HoltWinters(casual)
casualforescast2 <- forecast(casualforescast, h=12)
plot(casualforescast2,main= "Forecast of casual demand from HoltWinters")

There will be a higher demand, where peak demand would be in the middle of the year

Registered Users:

registered <- ts(demandamensual$registered,frequency = 12,start= c(2011,1))
registeredcomponents <- decompose(registered)
plot(registeredcomponents)

There is randomnes in demand
The trend is growing

registeredforescast <- HoltWinters(registered)
registeredforescast2 <- forecast(registeredforescast, h=12)
plot(registeredforescast2,main= "Forecast of registered demand from HoltWinters")

Similar to the case of demand for casual users, there will be a higher demand where the peak demand would be mid-year

Total users:

total <- ts(demandamensual$total,frequency = 12,start= c(2011,1))
totalcomponents <- decompose(total)
plot(totalcomponents)

This shows identical characteristics respect to the demand for registered users

totalforescast <- HoltWinters(total)
totalforescast2 <- forecast(totalforescast, h=12)
plot(totalforescast2,main= "Forecast of total demand from HoltWinters")

It shows identical characteristics respect to the demand for registered users

4. Prediction of demand according to weather conditions:

Because of a growing in user demand of almost 50% in 2012 compared to 2011 and for a most reliable demand prediction design ,so it is preferible using data from last year. Also for practical purposes, I am only considering the total of users to avoid multiple decisions tree for each target.

Piechart of the demand on weekday and holiday:

D2012 <- filter(datos[,-1],año == "2012")
D2012 <- within(D2012, rm(dteday,año))
D2012$total <- D2012$casual+ D2012$registered
D2012 <- within(D2012, rm(casual,registered))
wd <- sum(filter(D2012, workingday == 1)$total)
nwd <- sum(filter(D2012, workingday == 0)$total)
pwd <- round(wd*100/(wd+ nwd),0)
pnwd <- 100 - pwd
slices <- c(pwd,pnwd)
dia <- c( "weekday:", "holiday:")  
lbls <- paste( dia,paste(slices,"%",sep=""))
pie3D(slices,labels=lbls, col = c( "blue", "green"),explode=0.0,theta=0.9, 
      mar=c(1,1,1,1), labelcex=1,main="Users 2012")

So there will apparently not be much difference in the complexity of decisions tree both on weekdays and on holidays regardless of the season.

Applying Decision Tree :

Weekdays:

First, let’s test the model for Winter season:

D20121 <- filter(D2012, season == 4 & workingday == 1 )
model1 <- ctree( total~., data= D20121)
plot(model1, main= "2012 Decision Tree of Winter's weekday")

Holiday:

Now, testing the model for Spring season :

D20122 <- filter(D2012, season == 1 & workingday == 0)
model2 <- ctree( total~., data= D20122)
plot(model2, main= "2012 Decision Tree of Spring's holiday")

There is no need to build more Decisions Tree.
The models are not wrong built, rather aport too much information that it is not useful from the point of view for business.

Study of demand for rental bikes

By Juan Pablo Delzo

March 1st, 2018