INTRODUCTION

Bike-sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Today, there exists great interest in these systems due to their important role in traffic, environmental, and health issues.

Apart from interesting real-world applications of bike-sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Having features such as duration of travel, departure, and arrival position, total bike number rented turns the bike-sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

Capital Bikeshare has more than 4300 bikes available at 500 stations across 7 jurisdictions. With that number, Capital Bikeshare provides residents and visitors with a convenient, fun, and affordable transportation option for getting from point A to point B. People use Capital Bikeshare to commute to work or school, run errands, get to appointments or social engagements and more.

DATA UNDERSTANDING

We aggregated the data on daily basis and set limitation on only one station Capital Bikeshare system and focusing on only the number of bikes rented.

dataset <- read.csv("data_input/day.csv")
head(dataset)
tail(dataset)


COLUMNS EXPLANATION

instant: recorded index
dteday: date of transaction
season: number representing season (1:Spring, 2:Summer, 3:Fall, 4:Winter)
yr: number representing year (0:2011, 1:2012)
mnth: number representing Month (1:January to 12:December)
hr: number representing Hour (0 to 23)
holiday: number representing (0:Not Holiday ; 1:Holiday)
weekday: number representing Day of the week (0:Sunday, 1:Monday, 2:Tuesday, 3:Wednesday, 4:Thursday, 5:Friday, 6:Saturday)
workingday: Whether Working day or Weekend (0:Weekend/Holiday, 1:Working Day)
weathersit: Weather Condition (1:Clear, Few clouds, Partly cloudy, Partly cloudy, 2:Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist, 3:Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds, 4:Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog)
temp: Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users (non-member user)
registered: count of registered users (member user)
cnt: count of total rental bikes including both casual and registered

CHECKING NA’S AND DUPLICATES

CHECKING NA’S

sum(is.na(dataset))
#> [1] 0

CHECKING DUPLICATES

sum(duplicated(dataset))
#> [1] 0

as we can see from the code above, in this dataframe there are no missing values and duplicate values.

DATATYPES

We need to change several columns before we start to explore the dataset.

library(lubridate)
# preparation needs before changing several columns into factor (categorical)
# because from the dataset we get several columns with datatype int
dataset$season <- as.character(dataset$season)
dataset$holiday <- as.character(dataset$holiday)
dataset$weekday <- as.character(dataset$weekday)
dataset$weathersit <- as.character(dataset$weathersit)

dataset$season <- sapply(X = as.character(dataset$season),
                           FUN = switch, 
                           "1" = "Spring",
                           "2" = "Summer", 
                           "3" = "Fall", 
                           "4" = "Winter")
dataset$holiday <- sapply(X = as.character(dataset$holiday),
                           FUN = switch, 
                           "0" = "No",
                           "1" = "Yes")
dataset$workingday <- sapply(X = as.character(dataset$workingday),
                           FUN = switch, 
                           "0" = "Weekend",
                           "1" = "Weekday")
dataset$weathersit <- sapply(X = as.character(dataset$weathersit),
                           FUN = switch, 
                           "1" = "Clear / Partly Cloudy",
                           "2" = "Mist + Cloudy",
                           "3" = "Light Snow / Light Rain + Thunderstorm",
                           "4" = "Heavy Rain + Ice Pallets + Thunderstorm + Mist / Snow + Mist")


# convert into datetime datatype and extracting year, month and day
dataset$dteday <- ymd(dataset$dteday)
dataset$yr <- year(dataset$dteday)
dataset$mnth <- month(dataset$dteday, label = T, abbr = F)
dataset$weekday <- wday(dataset$dteday, label = T, abbr = F)


# the numerical columns in this dataset are treated with normalization,
# we assume that it would be better to return it to the original values for EDA purpose.
dataset$temp <- dataset$temp*41
dataset$atemp <- dataset$atemp*50
dataset$hum <- dataset$hum*100
dataset$windspeed <- dataset$windspeed*67

head(dataset)


EXPLORATORY DATA ANALYST

  1. We would like to now the how many bike rented in 2011 and 2011
# we use `xtabs` function to show the total number of bike rented in each year
xtabs(formula = cnt ~ yr, data = dataset)
#> yr
#>    2011    2012 
#> 1243103 2049576

Insight:
- the number of bike rented in 2012 are greater than 2011
- the increase in number almost doubled.
We are hoping that the increase in number is because people are started to like using bike as their transportation

  1. Are season affect the number of bike rented in each year?
xtabs(formula = cnt ~ season+yr, data = dataset)
#>         yr
#> season     2011   2012
#>   Fall   419650 641479
#>   Spring 150000 321348
#>   Summer 347316 571273
#>   Winter 326137 515476

Insight:
- the number of bike rented are at the highest in Fall Season, both in 2011 and 2012
- we assume that in the winter people will rent bike less than any other season, because we think that the weather condition in winter is not bike friendly (e.g. strong wind, snow fall and of course it is cold outside).
But our data says no, people rent bike in winter and it is almost as much as they rent bike in summer.

  1. Let us check the weather condition in season which has the highest number of bike rented

Average value:

aggregate(x = cnt~season, data = dataset, FUN = sum)
table(dataset$weathersit,dataset$season)
#>                                         
#>                                          Fall Spring Summer Winter
#>   Clear / Partly Cloudy                   136    111    113    103
#>   Light Snow / Light Rain + Thunderstorm    4      4      3     10
#>   Mist + Cloudy                            48     66     68     65
aggregate(x = temp~season, data = dataset, FUN = mean)
aggregate(x = windspeed~season, data = dataset, FUN = min)
aggregate(x = windspeed~season, data = dataset, FUN = max)
aggregate(x = hum~season, data = dataset, FUN = mean) 

Insight:
1. From dataset, we can see that the number of bike rented in Fall season is the highest, hence we would like to see how weather situation, temperature, wind speed and humidity are in Fall Season
2. Most of weather situation in Fall Season are Clear / Partly Cloudy
3. On Fall Season, the average of temperature reach the highest number of all season, 28.959
4. On Fall Season, the wind speed vary from 4.293 to 25.167
5. On Fall Season, the average of humidity is 63.348 %

ACKNOWLEDGEMENT

Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto INESC Porto
Campus da FEUP Rua Dr. Roberto Frias, 378 4200 - 465 Porto, Portugal

Original dataset : https://www.kaggle.com/c/bike-sharing-demand
Capital Bikeshare trip data : http://capitalbikeshare.com/system-data
Weather Information : https://openweathermap.org/history
Holiday Schedule : http://dchr.dc.gov/page/holiday-schedule

Title  

A work by Taufan Anggoro Adhi

tf.anggoro@gmail.com