Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Washingtown D.C.Bikeshare system, USA which is publicly available through this link.
This article aims to present an initial analysis on the demand for rental bicycles . To do that, I made the following studies:
- Temporary visualization and Annual analysis
- Hourly averaged demand
- Prediction of demand according to Time Series Data
- Prediction of demand according to weather conditions
So the programming is going to be shown step by step in order to spread a new case of \(The\) \(Art\) \(of\) \(Programming\) \(with\) \(R\).
First, setting not to show messages or warnings:
knitr::opts_chunk$set(message= FALSE)
knitr::opts_chunk$set(warning= FALSE)
setwd("~/Analisis de Datos/Practicas en R/Bicing sharing")
datos <- read.csv("hour.csv", header=TRUE, sep=",", quote= "\"'" , stringsAsFactors= FALSE, dec=".")
datos <- within(datos, rm(instant, yr,mnth,holiday,weekday,atemp,cnt))
conversion <- c(41,100,67)
for(n in 6:8){ datos[,n] <- round(datos[,n]*conversion[n-5]/max(datos[,n]-min(datos[,n])),0)}
datos$casual <- as.numeric(datos$casual)
datos$season <- as.numeric(datos$season)
datos$workingday <- as.numeric(datos$workingday)
datos$registered <- as.numeric(datos$registered)
datos$weathersit <- as.factor(datos$weathersit)
#I recommend downloading first plyr and then dplyr
library(plyr)
library(dplyr)
library(ggplot2)
library(DT)
library(extrafont)
library(TTR)
library(forecast)
library(party)
library(plotrix)
datatable(datos, class= "cell-border stripe")
| Attribute: | Description: |
|---|---|
| dteday | Date: year-month-day |
| season | 1:Springer; 2:Summer; 3:Fall; 4:Winter |
| hr | Hour:0 to 23 |
| workingday | if day is neither weekend nor holiday is 1, otherwise is 0. |
| weathersit | 1: Clear,.., Partly cloudy/ 2: Mist,..,Mist + Cloudy/ 3: Light Snow, rain,Scattered clouds/ 4: Heavy Rain,…, Snow + Fog |
| temp | Temperature in Celsius.The values are divided to 41 (max) |
| hum | Humidity. The values are divided to 100 (max) |
| windspreed | Normalized wind speed. The values are divided to 67 (max) |
| casual | Count of casual users |
| registered | Count of registered users |
I am making a comparison in the evolution of hourly demand for casual, registered users and total demand.
datetime <- paste(datos$dteday,paste(if_else(datos$hr<10,paste("0",datos$hr,sep=""),
as.character(datos$hr)),":00:00",sep="") )
datos <- cbind(datetime,datos)
datos$datetime <- as.POSIXct(datos$datetime, "%Y-%m-%d %H:%M:%S")
ggplot(datos, aes(datetime)) + geom_line(aes(y= casual + registered,colour= "total")) +
geom_line(aes(y= registered, colour= "registered")) + geom_line(aes(y= casual, colour= "casual")) +
scale_colour_manual("", values= c("casual"= "deepskyblue3","registered"= " darkorange","total"= "firebrick4"))+
ggtitle("Hourly demand") + scale_x_datetime(date_labels = "%b %Y") + theme_bw() +
theme( axis.line = element_line(size=1, colour= "black"),plot.title = element_text(hjust = 0.5),
panel.border = element_blank())+
theme(plot.title = element_text(family= "Comic Sans MS"), text= element_text(family = "Comic Sans MS")) +
theme(axis.text.x = element_text(colour="black",size= 10), axis.text.y = element_text(colour="black",size=10), legend.key = element_rect(fill="white",colour="white")) +
theme(legend.position = "bottom",legend.direction = "horizontal",legend.text = element_text(colour = "black", size= 12)) +
theme(panel.grid.major = element_line(colour= "gray80"), panel.grid.minor = element_line(colour= "gray80")) + ylab("Nº users")
Then quantifyng the demand per year.
datos$año <- format(datos$datetime,"%Y")
Tusers2011 <- colSums(filter(datos[,-(1:2)], año == 2011)[,8:9])
Tusers2012 <- colSums(filter(datos[,-(1:2)], año == 2012)[,8:9])
Tanual <- data.frame(año = c(rep(2011,2),rep(2012,2) ),Count= c(Tusers2011,Tusers2012),
User= c(rep(c("casual","registered"),2)) )
Panual <- data.frame(año = Tanual$año, label= Tanual$Count)
Panual$label <- c(sort(Panual$label[1:2],decreasing = TRUE),sort(Panual$label[3:4],decreasing = TRUE))
Panual <- ddply(Panual,.(año),transform,pos= cumsum(label)- 0.5*label)
Panual$label <- round(Panual$label*100/c(rep(sum(Tanual$Count[1:2]),2), rep(sum(Tanual$Count[3:4]),2)),0)
Panual$label <- paste(Panual$label,"%",sep="")
totales <- data.frame(año= c(2011,2012), label= c(sum(Tusers2011), sum(Tusers2012)),
pos= c(sum(Tusers2011),sum(Tusers2012))+100000)
Panual <- rbind(Panual,totales)
#Plotting:
colores <- c("deepskyblue3", "darkorange")
ggplot() + geom_bar(data= Tanual, aes(x= año, y= Count, fill= User),stat= "Identity" )+
geom_text(data= Panual, aes(x= año,y= pos, label= label),size= 4.5,family= "Comic Sans MS")+
theme_bw() + scale_x_continuous(breaks = 2011:2012) + ggtitle("Users")+
scale_fill_manual(values= colores) + theme(legend.title = element_blank()) +
theme(plot.title = element_text(hjust = 0.5), axis.line = element_line(size= 1, colour= "black")) +
theme(panel.border = element_blank(), panel.grid.minor = element_blank()) +
theme(plot.title = element_text(family = "Comic Sans MS"), text = element_text(family= "Comic Sans MS")) +
theme(axis.text.x = element_text(colour="black",size=10),axis.text.y = element_text(colour="black", size=7)) +
theme(legend.text = element_text(colour= "black",size=11)) + xlab("year")
The goal is to show the demand on average per hour in a day of the year.To choose the year whose result is more similar to this year.So I take as a reference the year 2012, being that it is the closest year whose data are available.
datos2012 <- filter(datos[,-1],año == "2012")
datos2012 <- within(datos2012, rm(dteday,weathersit,temp,hum,windspeed,año))
i <- 1
for(S in 1:4){ season <- filter(datos2012, season == S)
for (H in 0:23) { hora <- filter(season, hr == H)
for(wk in 0:1){workingday <- filter(hora, workingday == wk)
if ( i ==1){ datosprom2012 <- round(colMeans(workingday),0)
i <- 2}
else { datosprom2012 <- rbind(datosprom2012,round(colMeans(workingday),0))}
}
}
}
datosprom2012 <- as.data.frame(datosprom2012)
rownames(datosprom2012) <- NULL
datosprom2012$season <- as.factor(datosprom2012$season)
datosprom2012$workingday <- ifelse(datosprom2012$workingday == 0, "h","w")
names <- list( '1'="Spring",'2'="Summer",'3'="Fall",'4'="Winter",'w'="weekday",'h'="holiday")
labeller <- function(variable,value){ return(names[value])}
ggplot(datosprom2012, aes(hr)) + geom_line (aes(y= casual, colour="casual")) +
geom_line(aes(y= registered,colour="registered")) + geom_line(aes(y= casual + registered, colour= "total")) +
scale_colour_manual("",values= c("casual"= "deepskyblue3","registered"= "darkorange","total"= "firebrick4")) +
facet_grid(workingday ~ season, labeller = labeller, scale= "free_y" ) + theme_bw() +
theme(strip.background = element_rect(colour = "paleturquoise", fill = "paleturquoise")) +
theme(legend.position = "bottom",legend.direction = "horizontal",legend.title = element_blank()) +
theme(plot.title = element_text(hjust = 0.5), axis.line = element_line(size= 1, colour= "black")) +
theme(plot.title = element_text(family = "Comic Sans MS"), text = element_text(family= "Comic Sans MS")) +
theme(axis.text.x = element_text(colour="black",size=10),axis.text.y = element_text(colour="black", size=10)) +
theme(legend.text = element_text(colour= "black",size=13)) + theme(panel.grid.minor = element_blank()) +
ggtitle("Average demand-2012") + ylab("count") + xlab("hours")
I made two analysis:
Last, the demand for casual, registered users and total demand have been everyone ejecuted separately.
datosmensual <- within(datos, rm(datetime,season,hr,workingday,weathersit,temp,hum,windspeed))
datosmensual$mes <- format(as.Date(datosmensual$dteday, "%Y-%m-%d"),"%m")
datosmensual$año <- as.numeric(datosmensual$año)
datosmensual$mes <- as.numeric(datosmensual$mes)
demandamensual <- matrix(c(rep(0,24*2)),24,2)
colnames(demandamensual) <- c("casual","registered")
i <- 1
for(y in 2011:2012){datoaño <- filter(datosmensual,año == y)
for(m in 1:12){datomes <- filter(datoaño, mes == m)
demandamensual[i,] <- colSums(datomes[,2:3])
i <- i +1
}
}
demandamensual <- as.data.frame(demandamensual)
demandamensual$total <- demandamensual$casual + demandamensual$registered
casual <- ts(demandamensual$casual,frequency = 12,start= c(2011,1))
casualcomponents <- decompose(casual)
plot(casualcomponents)
casualforescast <- HoltWinters(casual)
casualforescast2 <- forecast(casualforescast, h=12)
plot(casualforescast2,main= "Forecast of casual demand from HoltWinters")
registered <- ts(demandamensual$registered,frequency = 12,start= c(2011,1))
registeredcomponents <- decompose(registered)
plot(registeredcomponents)
registeredforescast <- HoltWinters(registered)
registeredforescast2 <- forecast(registeredforescast, h=12)
plot(registeredforescast2,main= "Forecast of registered demand from HoltWinters")
total <- ts(demandamensual$total,frequency = 12,start= c(2011,1))
totalcomponents <- decompose(total)
plot(totalcomponents)
totalforescast <- HoltWinters(total)
totalforescast2 <- forecast(totalforescast, h=12)
plot(totalforescast2,main= "Forecast of total demand from HoltWinters")
Because of a growing in user demand of almost 50% in 2012 compared to 2011 and for a most reliable demand prediction design ,so it is preferible using data from last year. Also for practical purposes, I am only considering the total of users to avoid multiple decisions tree for each target.
D2012 <- filter(datos[,-1],año == "2012")
D2012 <- within(D2012, rm(dteday,año))
D2012$total <- D2012$casual+ D2012$registered
D2012 <- within(D2012, rm(casual,registered))
wd <- sum(filter(D2012, workingday == 1)$total)
nwd <- sum(filter(D2012, workingday == 0)$total)
pwd <- round(wd*100/(wd+ nwd),0)
pnwd <- 100 - pwd
slices <- c(pwd,pnwd)
dia <- c( "weekday:", "holiday:")
lbls <- paste( dia,paste(slices,"%",sep=""))
pie3D(slices,labels=lbls, col = c( "blue", "green"),explode=0.0,theta=0.9,
mar=c(1,1,1,1), labelcex=1,main="Users 2012")
So there will apparently not be much difference in the complexity of decisions tree both on weekdays and on holidays regardless of the season.
First, let’s test the model for Winter season:
D20121 <- filter(D2012, season == 4 & workingday == 1 )
model1 <- ctree( total~., data= D20121)
plot(model1, main= "2012 Decision Tree of Winter's weekday")
Now, testing the model for Spring season :
D20122 <- filter(D2012, season == 1 & workingday == 0)
model2 <- ctree( total~., data= D20122)
plot(model2, main= "2012 Decision Tree of Spring's holiday")