Bike Sharing System Data Analysis

October 2020

RPubs link information

Rpubs allows you to publish your RMarkdown documents on the web.
Use this link to acess the presentation: www………

Introduction

The improvements in technology and connectivity have been changing all types of business, bikes rental also experienced these transitions.
Bike-sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return has become automatic.
In other words, bike-sharing systems consists of strong and durable bikes that are available into a network of docking stations throughout some region.

Introduction

The bike-sharing systems can be unlocked from any station and returned to any station in the system, making them ideal for one-way trips.

Introduction

Capital Bikeshare is metro DC’s bike-share company, with more than 4,300 bikes available at 500 stations across six USA jurisdictions.
Opened in September 2010, Capital Bikeshare was the largest bike sharing service in the United States until New York City’s Citi Bike began operations in May 2013.

Problem Statement

The dependet variable the problem is the number of bicicles rented per day.
Find out which are the most relevant variables to define the number of bike rentals using Capital Bikeshare.
Use data visualization and statistical measures to analysize the dataset variables relationship.
Create a multilinear regression model to exam the variables importance.

Data

The dataset contains the daily count of rental bikes between years 2011 and 2012 in Capital Bikeshare system with the corresponding weather and seasonal informatio.
There are a total of 16 columns in the dataset and 731 rows.
Let´s read the dataset and check it´s structure:

# Read dataset
day <- read.csv('day.csv')

# Check dataset structure
str(day)

## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

Data

All variables of the dataset are numerical, except for the variable dteday that is character.
Let´s check the definition of the variables:

Variable	Definition
instant	record index
dteday	date
season	1:winter, 2:spring, 3:summer, 4:fall
yr	0:2011, 1:2012
mnth	month (1 to 12)
holiday	weather day is holiday or not
weekday	day of the week (0 to 6)
workingday	if day is neither weekend nor holiday is 1, otherwise is 0.
weathersit	1:clear_sky, 2:cloudy, 3:ligth_snow_rain
temp	normalized temperature in Celsius, the values are derived via t_min=-8 and t_max=+39
atemp	normalized feeling temperature in Celsius, the values are derived via t_min=-16 and t_max=+50
hum	normalized humidity, the values are divided to 100 (max)
windspeed	normalized wind speed, the values are divided to 67 (max)
casual	count of casual users bike renters
registered	count of registered users bike renters
cnt	count of total rental bikes, the sum of variables casual and registered

Data

To begin, let´s check if there are any missingg values on the dataset.
Secondly, let´s refine the dataset folowing variables in order to produce the data visualization:
- dteday transform into date
- season transform into factor
- weathersit transform into factor

# Check missing values
any(is.na(data))

## [1] FALSE

# Transform variable dteday as date
day$dteday <- as.Date(day$dteday, format = '%Y-%m-%d')

# Refine variable season
day$season <- factor(day$season, levels = 1:4, labels = c("winter","spring","summer","fall"))

# Refine variable weathersit
day$weathersit <- factor(day$weathersit, levels = 1:4, labels = c(
        "clear_sky",
        "cloudy",
        "ligth_snow_rain",
        "heavy_snow_rain"))

Data Visualisation

Let´s plot a time series graph of the variable cnt.

# Plot graph
ggplot(day, aes(x = dteday, y = cnt)) + 
    geom_line() + theme_minimal() + labs(x = 'Date', y = 'Number of Bikes Rented') +
    theme(axis.text = element_text(size=14), axis.title = element_text(size=14))

Data Visualisation

We can notice with the time-series graph that the number of bike rents increases significantly from 2011 to 2012.
Furthermore, the graph also shows a seasonal influence and a high fluctuation over the value of bike rents.

Data Visualisation

Let´s plot a boxplot graph of cnt according to weathersit.

ggplot(day, aes(y=cnt, fill = weathersit, x = weathersit)) +
    geom_boxplot() + theme_minimal() + labs(x = '', fill = '', y = 'Number of Bikes Rented') +
    theme(axis.title.x = element_blank(), axis.text.x = element_blank(), 
    legend.text = element_text(size=14), axis.title = element_text(size=14),
    axis.text = element_text(size=14))

Data Visualisation

We can notice with the boxplot graph that the weather influences the number of bikes rented.
As expected, the number of bikes rented during clear sky weather tends to be higher comparing to the other weather conditions.

Descriptive Statistics

To start statistics analysis, let´s check some statistical measures of the values of bike rents for each seasson.

# Group data by seasson
table <- day %>% group_by(season) %>% summarise(min = min(cnt), lowerQuantile = quantile(cnt,probs = .25),
                                       median = median(cnt), upperQuantile = quantile(cnt,probs = .75),
                                       max = max(cnt), Mean = mean(cnt), standardDeviation = sd(cnt),
                                       count = n())

# Print table
knitr::kable(table)

season	min	lowerQuantile	median	upperQuantile	max	Mean	standardDeviation	count
winter	431	1538.0	2209.0	3456.00	7836	2604.133	1399.942	181
spring	795	4003.0	4941.5	6377.00	8362	4992.332	1695.977	184
summer	1115	4586.5	5353.5	6929.25	8714	5644.303	1459.800	188
fall	22	3615.5	4634.5	5624.50	8555	4728.163	1699.615	178

The descriptive statistics table show that there is a significant difference between the number of bikes rented depending on the season.

Hypthesis Testing

Regression analysis is a statistical technique for estimating the relationships among variables.
The objective of a regression model is to formulate a linear equation between the dependent and independent variable.
Regression models with more than one independent variable are called multilinear regression.
The multiple linear regression equation is formulated as the following:

\[y = \sum^n_{i = 1}B_0+B_i*x_i+E\]

\(y =\) dependent variable ; \(B_i =\) parameter

\(x_i =\) independent variable ; \(E =\) error

Hypothesis Testing

Before fitting the regression model, it is necessary to remove the variables that can negatively influence the model.
The following variables will be excluded from the dataset because:
- dteday other variables (yr, mnth and weekday) already represents the date
- instant only for identification
- casual prevent overfitting the model
- registered prevent overfitting the model

# Remove unwanted variables
day <- select(day, -c(dteday, instant, casual, registered))

# Create multiple linear regression model
model <- lm(cnt ~ ., data = day)

Hypothesis Testing

# Check model summary
model %>% summary()

## 
## Call:
## lm(formula = cnt ~ ., data = day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3617.0  -370.3    72.4   473.0  3128.9 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1307.97     241.26   5.421 8.09e-08 ***
## seasonspring               1158.76     114.67  10.105  < 2e-16 ***
## seasonsummer                921.46     165.16   5.579 3.43e-08 ***
## seasonfall                 1651.21     153.67  10.745  < 2e-16 ***
## yr                         2018.97      61.02  33.089  < 2e-16 ***
## mnth                        -15.24      16.15  -0.943  0.34574    
## holiday                    -531.53     187.34  -2.837  0.00468 ** 
## weekday                      67.51      15.17   4.449 1.00e-05 ***
## workingday                  116.17      67.10   1.731  0.08385 .  
## weathersitcloudy           -452.17      80.56  -5.613 2.85e-08 ***
## weathersitligth_snow_rain -1954.82     205.44  -9.515  < 2e-16 ***
## temp                       3941.82    1380.69   2.855  0.00443 ** 
## atemp                      1290.20    1507.55   0.856  0.39238    
## hum                       -1198.19     294.51  -4.068 5.26e-05 ***
## windspeed                 -2708.11     429.85  -6.300 5.19e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 812.2 on 716 degrees of freedom
## Multiple R-squared:  0.8276, Adjusted R-squared:  0.8242 
## F-statistic: 245.5 on 14 and 716 DF,  p-value: < 2.2e-16

Hypothesis Testing

To conclude the hypothesis testing, let´s check the importance of the variables using the function varImp() from the R package caret.

# Create variables importance dataframe
varimport <- as.data.frame(varImp(model))
varimport$variable <- rownames(varimport)

# Print ordered variables importance dataframe
knitr::kable(varimport[order(varimport$Overall, decreasing = TRUE),])

	Overall	variable
yr	33.0893935	yr
seasonfall	10.7454714	seasonfall
seasonspring	10.1053497	seasonspring
weathersitligth_snow_rain	9.5151611	weathersitligth_snow_rain
windspeed	6.3001274	windspeed
weathersitcloudy	5.6127197	weathersitcloudy
seasonsummer	5.5792940	seasonsummer
weekday	4.4490027	weekday
hum	4.0684423	hum
temp	2.8549587	temp
holiday	2.8372380	holiday
workingday	1.7311647	workingday
mnth	0.9434979	mnth
atemp	0.8558282	atemp

Discussion

The multilinear regression model create had a good performance with an adjusted R-squared above 0.8.
The most important variables that influence the number of bikes from Capital Bikeshare rented is year, followed by the season, weather and windspeed.
Beyond that, as we can see by the model p-values, the season’s fall and spring are more important to the model comparing to the other seasons.
To improve the data analysis, is recommended to transform the variable weekday into factor to check which weekday has more significance for the regression model.
To conclude, we can highlight that the season and the weather are determinants when you are evaluating the daily number of bikes rented by costumer of Capital Bikeshare.

RPubs link information

Introduction

Introduction

Introduction

Problem Statement

Data

Data

Data

Data Visualisation

Data Visualisation

Data Visualisation

Data Visualisation

Descriptive Statistics

Hypthesis Testing

Hypothesis Testing

Hypothesis Testing

Hypothesis Testing

Discussion

References