This project is about the Capital Bikeshare (CaBi) scheme (http://www.capitalbikeshare.com/) in Washington DC, launched in 2008 by the government of District of Columbia. In this project, we focus on the operations of CaBi in 2011 and 2012. In the Excel spreadsheet data.xlsx, it contains hourly rental data spanning these two years. Two types of rental data were recorded: casual renters and registered renters. A registered renter is a pre-registered member of the bike sharing program, while a casual renter pays for the use of a bike on the spot using a credit card. In addition to the time/date information (including holidays), the spreadsheet also records the weather condition in each hour (e.g., temperature, wind speed and humidity).
We are going to load in R the whole Excel workbook. Since it is composed by three separated sheets, the package XLConnect comes in use. It allow to read the entire workbook, and then load the single worksheets. Please note that XLConnect requires Java to be pre-installed.
library(XLConnect)
work_book <- loadWorkbook("raw data.xlsx")
Since the work_book is quite small (1728 bytes) we are going to load in memory every sheet like a separate object.
object.size(work_book)
## 1728 bytes
note <- readWorksheet(work_book, sheet = "Note")
all_data <- readWorksheet(work_book, sheet = "All data")
testing <- readWorksheet(work_book, sheet = "Testing")
The Excel workbook is composed by three sheets: “Note” is the coodebook, “All data” is the main dataset, “Testing” is a second and smaller dataset.
#summary(work_book) [check!]
It is the codebook. It describes how the content of the dataset is coded. Specifically, it reports how seasons (season) and weekdays (Weekday) are numerically coded in the datasets.
str(note)
## 'data.frame': 22 obs. of 5 variables:
## $ Season : chr "1" "2" "3" "4" ...
## $ Col2 : chr "Spring" "Summer" "Autumn" "Winter" ...
## $ Col3 : logi NA NA NA NA NA NA ...
## $ Weekday: num 1 2 3 4 5 6 7 NA NA NA ...
## $ Col5 : chr "Monay" "Tuesday" "Wednesday" "Thursday" ...
The dataset all_data is composed by 17379 observation with 13 variables. datetime is profitably flanked by variables providing further information on the day of the week, the season and if we are talking about a working or a resting day.
str(all_data)
## 'data.frame': 17379 obs. of 13 variables:
## $ datetime : POSIXct, format: "2011-01-01 00:00:00" "2011-01-01 01:00:00" ...
## $ Weekday : num 6 6 6 6 6 6 6 6 6 6 ...
## $ season : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Holiday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Workingday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : num 1 1 1 1 1 2 1 1 1 1 ...
## $ Temp : num 9.84 9.02 9.02 9.84 9.84 ...
## $ Feel.like.temp : num 14.4 13.6 13.6 14.4 14.4 ...
## $ Humidity : num 81 80 80 75 75 75 80 86 75 76 ...
## $ Windspeed : num 0 0 0 0 0 ...
## $ Casual.renter : num 3 8 5 3 0 0 2 1 1 8 ...
## $ Registered.renter: num 13 32 27 10 1 1 0 2 7 6 ...
## $ Total.renter : num 16 40 32 13 1 1 2 3 8 14 ...
testing is a smaller 688 observations dataset. On this data, we will try do develop a model that could forecast the daily rental demand for both casual and registered renters.
str(testing)
## 'data.frame': 688 obs. of 15 variables:
## $ datetime : POSIXct, format: "2013-01-01 00:00:00" "2013-01-01 01:00:00" ...
## $ Weekday : num 2 2 2 2 2 2 2 2 2 2 ...
## $ season : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Holiday : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Workingday : num 0 0 0 0 0 0 0 0 0 0 ...
## $ weather : num 1 1 1 1 1 2 1 1 1 1 ...
## $ Temp : num 13.1 12.8 11.8 11.5 10.9 ...
## $ Feel.like.temp : num 17.4 16.2 16.2 15.9 14.9 ...
## $ Humidity : num 71 70.7 77.3 79 79 ...
## $ Windspeed : num 0 6 0 0 4 ...
## $ Col11 : logi NA NA NA NA NA NA ...
## $ Date : POSIXct, format: "2013-01-01" "2013-01-02" ...
## $ Casual.renter : chr NA NA NA NA ...
## $ Registered.renter: logi NA NA NA NA NA NA ...
## $ Total.renter : logi NA NA NA NA NA NA ...
Use data visualisation to explore the data as much as possible. The top management also welcomed any interesting insights that you might observe from the data. The top management also indicated that they were interested in knowing more about their customers, including the behavioural differences between casual renters and registered renters. Although you were only given limited data (not including the data by stations), the top management wondered if the data could reveal any information about the purposes of the bike rentals by casual renters and registered renters.
In first place we load the xts package, that will help us in manage the time series. Then we create a new xts object containing the number of bike renters along the time horizon considered, that is, from January 1st 2011 and December 31st 2012.
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
casual_xts<- xts(all_data$Casual.renter, all_data$datetime)
regular_xts<- xts(all_data$Registered.renter, all_data$datetime)
We think that the weather could be could be a relevat incentive to the use of the bike. Hence, we are going to plot the total number of bike renter per day. The plot could be more readable by plotting their number by week, but we choose this time format in order to make the plots in line with the requests of the second part of the assignment.
NOTE: The legend box is too compact to allocate the weather conditions as reported in note:
Good: Clear, Few clouds, Partly cloudy, Partly cloudy
Quite good: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
Quite bad: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
Bad: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
The distribution let us deduce that the casual renters are more interested in enjoying the bike during beautiful days. It let us imagine a recreational use of the rent service.
plot(apply.daily(casual_xts, sum), type = "p", pch = all_data$weather, main = "Total number of causal renters per day, by weather")
legend('topleft', legend=c("Good", "Quite good", "Quite bad","Bad"), pch=1:4, title = "Weather", cex = 0.5, pt.cex = 1)
With registerd users this quality is less evident, and the overall distribution more compact. It suggest that the rent service represents a real daily transport mode, to be used even with non-optimal weather conditions.
plot(apply.daily(regular_xts, sum), type = "p", pch = all_data$weather, main = "Total number of regular renters per day, by weather")
legend('topleft', legend=c("Good", "Quite good", "Quite bad","Bad"), pch=1:4, title = "Weather", cex = 0.5, pt.cex = 1)
The bike renting follows, unsurprisingly, a seasonal trend. At the same time, the weather condition is likely to be a strong element concurring in determining the amount of rents.
Such an hypothesis is supported by a simple observation of the frequency of the rents according to the weather conditions. Even if the numbers between the registred users are greater, good weather conditions reveal a remakbly higher number of bike rents.
all_data$weather2 <- all_data$weather
all_data$weather2 <- factor(all_data$weather2, levels = c(1,2,3,4), labels = c("good", "quite_good", "quite_bad", "bad"))
# Total number of Casual renters by weather conditions
aggregate(all_data$Casual.renter,by=list(Category=all_data$weather2), FUN=sum)
## Category x
## 1 good 456624
## 2 quite_good 139107
## 3 quite_bad 35529
## 4 bad 16
# Total number of Registered renters by weather conditions
aggregate(all_data$Registered.renter ,by=list(Category=all_data$weather2), FUN=sum)
## Category x
## 1 good 1862136
## 2 quite_good 670905
## 3 quite_bad 171894
## 4 bad 243
At the same time, could be interesting an observation of the number of rents according to the day of the week.
all_data$Weekday2 <- all_data$Weekday
all_data$Weekday2 <- factor(all_data$Weekday2, levels = c(1,2,3,4,5,6,7), labels = c("monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday"))
# Total number of Casual renters by weekday
aggregate(all_data$Casual.renter,by=list(Category=all_data$Weekday2), FUN=sum)
## Category x
## 1 monday 71580
## 2 tuesday 56901
## 3 wednesday 56888
## 4 thursday 60736
## 5 friday 78610
## 6 saturday 163371
## 7 sunday 143190
# Total number of Registered renters by weekday
aggregate(all_data$Registered.renter,by=list(Category=all_data$Weekday2), FUN=sum)
## Category x
## 1 monday 392051
## 2 tuesday 408325
## 3 wednesday 412762
## 4 thursday 428263
## 5 friday 416272
## 6 saturday 338152
## 7 sunday 309353
We can see that there is a relevant peak if rents during the week end. Even if there are many outliers, the result is indicative of the different behaviour between registered and casual users. Registerd users show a stronger activities during the workind days of the week. It is possible that the rent bike represents the daily vehicle for the home - office travel. Causal users show a sigtly superior use of the bike during the weekend. It suggest an use of the bike of touristic nature.
par(mfrow = c(1, 2))
with(all_data, plot(Weekday2, Casual.renter, main = "Casual renters by weekday", ylim=c(0, 800)))
with(all_data, plot(Weekday2, Registered.renter, main = "Registered renters by weekday"))
At the same time, the weather conditions still suggests us a major role in determining the rent of a bike by casual bikers.
coplot(Casual.renter ~ Weekday2 | weather2, data = all_data, rows=1, xlab = c("Causal renters per day", "Weather conditions"), ylim=c(0, 800), ylab = "Weekday", xlim = c(0, 8), panel = function(x, y, ...){boxplot(y ~ x, add=TRUE)})
Similarly, the weather conditions affect the behaviour of the registred users.
coplot(Registered.renter ~ Weekday2 | weather2, data = all_data, rows=1, xlab = c("Registered renters per day", "Weather conditions"), ylab = "Weekday", xlim = c(0, 8), panel = function(x, y, ...){boxplot(y ~ x, add=TRUE)})