## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
##
## [[12]]
## [1] TRUE
##
## [[13]]
## [1] TRUE
##
## [[14]]
## [1] TRUE
##
## [[15]]
## [1] TRUE
##
## [[16]]
## [1] TRUE
##
## [[17]]
## [1] TRUE
Presentation of the Project Notebook
1.1 Dataset Description
1.2 Dataset settings Adjusments
Descriptive Analysis
2.1 Overview
2.2 How many bikes are rented on a daily basis ?
2.3 What is the most common Rental Method ?
Local Analysis
3.1 What is the variation of bikes available within a day ?
3.2 How many transactions are made in a day ?
3.3 How much sales regarding these transactions ?
Public transportation have made a huge progress these last years especially when it comes to green mobility.Indeed, public city bikes are flourishing in urban areas for offering a fast and environmental friendly way to travel for people. As the demand increases while public powers are putting efforts in developing bike lanes instead of roads. However if it is growing too fast , city centers could face local shortages if every city area is not well equipped with enough city bikes.
This data analysis project is conducted over an hourly updated dataset of Paris citybikes called Velib. That is why in this exploratory analysis shortages per areas (18 totally in Paris) will be showcased for drawing some recommendations on further locations to build.
Paris has more than 1400 stations and more than 400km2 covered with mechanical and ebikes. Data is updated in real time what allows to see the evolution of service in terms of available bikes for each station.
Here are listed below the variables of this dataset :
stationCode : chr
station_id : num
num_bikes_available: int (Number of bikes available)
mechanical : int (Number of mechanicald Bikes available)
ebike : int (Number of Ebikes available)
numBikesAvailable : int (Total Number of bikes available)
num_docks_available: int (Number of docks available)
numDocksAvailable : int
is_installed : int (Wether if the station is installed or not 1-yes 0-no)
is_returning : int (1- bike is returned 0-bike is not)
is_renting : int (1-bike is rented 0-bike is not)
last_reported : int (timestamp when the situation was reported)
name : chr (name of the station)
lat : num (latitude of the station)
lon : num (longitude of the station)
capacity : num (total capacity of the station)
rental_methods : (NA refers to subscription CREDIT CARD to one time payment)
Quick explanation of the dataset : the above aggregated data is directly obtained by GET request from public APIs.Each time the API is called the sent response is capturing the situation of bike trafic in Paris. As bike come and go the number of bikes available and dock available varies depending on the number of bikes actually rented during two time periods. One line equals to one transaction when each time a bike is rented.
For this analysis we will only take in account the transactions made in year 2022.
A good analysis starts with removing outliers. As a matter of fact do you have noticed that some the old transactions were reported in 2018 ? Obviously we can formulate some hypothesis about this. Every user is not peculiar when it comes to ride or park a bike. And some have seen their beloved ones ending in the waters of the Seine or broken by accident. 😢
Life is though for citybikes, but numerous ones are still remaining and are used everyday without being seriously damaged. That is why remaining operational bikes will only be taken in account in this study.
However feel free to play with the transactions that were reported before this year !
There is no precise variable to estimate the number of bikes like a unique id for each bike actually in function.Nevertheless, the total number of bikes can be estimated by the number of docks assuming that there are as many docks as bikes available and this number of docks remains stable over time.
After filtering by date the count for each day analysed the count of stations is successively 1444,1452 and 1452. However this might be to a lack of data sent for one station. Let’s assume that the number of stations count is stable.
The 1452 stations are counting bikes with mechanical and e bikes over the fleet. Mechanical bikes are representing more than 2/3 of the bikes population in Paris !
## Le chargement a nécessité le package : viridisLite
##
## Attachement du package : 'viridis'
## L'objet suivant est masqué depuis 'package:scales':
##
## viridis_pal
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
As we can see there the number of bikes available remains stable from day to day.
However, as you might have noticed the number of bikes available equals the number of mechanical and e bikes lastly reported in the station. But it does not mean that every bike is fully operational. One way to challenge this KPI is to compare for every bike station how many bikes are rented for one day (ie counting the number of transaction lines) and the number of bikes available.
#create a new column for grouping the dates per hour and analyse the number of bikes rented during the day 8th December.
rented <- velib %>%
filter(last_reported > as.POSIXct('2022-12-08 00:00:00')) %>%
group_by(Hour = floor_date(last_reported,"1 hour")) %>%
mutate(rented_count = sum(is_renting))
rented_plot <- ggplot(rented,aes(x = Hour, y = rented_count)) +
geom_line() +
labs(title = "Count of rented bikes over time",
x = 'Date',
y = 'Count of Rented bikes')
rented_plot
City bikes can be easily used with city transportation card with a significant discount on all subscription. So we might think that bikes are more rented by subscription rather than with Credit Card which means that the user is unlocking the bike for a single ride.
payment <- dbGetQuery(conn, 'SELECT rental_methods, count(*) as subscription
FROM Velib_Data
WHERE rental_methods IS NULL
UNION
SELECT rental_methods, count(*) as subscription
FROM Velib_Data
WHERE rental_methods == "CREDITCARD"')
payment
## Warning: Ignoring unknown parameters: stat
Unlike what could be assumed the single ride payment is the most used rental method. In proportion it is more than 88% ! Numerous reasons can be given for explaining such gap but maybe locals are not using as many as citybikes as tourists, or they are not aware that a monthly subscription will cost as much as the same as a single ride.
In order to analyse which stations are lacking city bikes, first step consists in seeing how many bikes are available within a day to analyse peak hours. As each line refers to one transaction let’s create a table with all the transaction done within one day.
trafic <- velib %>%
group_by(hour = format(
as.POSIXct(
floor_date(last_reported,"1 hour")),
format = '%H')) %>%
#for one station it could have several transactions per hour
distinct(hour,station_id, .keep_all = T) %>%
summarize(total_availability = sum(num_bikes_available)) %>%
#removing the outliers from the hours not recorded during the night
filter(hour > '09')
trafic
trafic_plot <- ggplot(trafic, aes(x = hour, y = total_availability)) +
geom_col()+
labs(title = 'Number of bikes available over time',
x = 'hour',
y = 'number of bikes')
trafic_plot
As the data frame append lines when a new transaction is reported, we would want to monitor the available number of bikes from one station to another but limit the number of points entered. For doing this, one way is to rank all stations per number of transactions and show this KPI.
# counting the number of transactions per citybike stations
#as the calls were not made hour per hour regularly on one day get rid of the date to only have the hour. We assume here that the trafic does not vary from one day to another excepting week-ends.
transaction_count <- velib %>%
group_by(hour = format(
as.POSIXct(
floor_date(last_reported,"1 hour")),
format = '%H')) %>%
filter(is_renting == 1) %>%
count(station_id, wt = is_renting) %>%
summarise(total_transactions = sum(n))
transaction_count
transaction_count_plot <- ggplot(transaction_count, aes(x = hour, y = total_transactions)) +
geom_col() +
labs (title = "Total transactions per hour",
x = "hour",
y = "count")
transaction_count_plot
According the graph the number of transactions is peaking between 19:00 and 20:00 at 12606. Considering the total amount of bikes available in the city that is to say the number of docks equals the number of bikes in available on station plus those rented is 29755, more than 42% of bikes in the city are rented in this time frame !
Assuming this data, it could be great to see how much turnover is doing Velib’ on single ride rentals. Nevertheless, it is not possible to correctly know how much it is for subscriptions as different subscription methods are existing at different fees.
In the following part of the study we will assume that payment with credit card will only be one time payment that don’t exceed 45 min (time when you get charged extra fee). For the need of the study we will timeframe the day in 3 parts : morning (from 00:00 AM to 12:00AM), afternoon (from 12:01 AM to 6:00PM) and evening (from 6:01 PM to 00:00 AM).
A single ride is costing 3€ for a mechanical or an ebike.
#take the transaction count and filtering by rental_methods
turnover_over_date <- velib %>%
group_by(hour = format(
as.POSIXct(
floor_date(last_reported,"1 hour")),
format = '%H')) %>%
#filterig by is_renting otherwise the user is not debited of his ride
filter(is_renting == 1,
rental_methods == "CREDITCARD") %>%
count(station_id, wt = is_renting) %>%
summarise(turnover = sum(n)*3)
turnover_morning <- turnover_over_date %>%
filter(hour < '12' & hour > '00') %>%
summarise(total = sum(turnover))
turnover_after_noon <- turnover_over_date %>%
filter(hour < '18' & hour > '12') %>%
summarise(total = sum(turnover))
turnover_evening <- turnover_over_date %>%
filter(hour < '24' & hour > '18') %>%
summarise(total = sum(turnover))
turnover_morning
turnover_after_noon
turnover_evening
#display the turnover for one day
turnover_plot <- ggplot(turnover_over_date, aes(x = hour, y = turnover)) +
geom_col() +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
labs(title = 'turnover over time by creditcard payment segment',
x = 'Date',
y = 'turnover')
turnover_plot
Interesting ! The turnover made between 20:00 and 21:00 is twice as much as the turnover made between 22:00 and 23:00 and peaks at 34000 €. In the evening the computed turnover made in the evening is 102534 € ! Comparing to the afternoon it is 3 times bigger !
This exploratory data analysis has led us to understand more indepth which are the rental trends in terms of trafic over the day, prefered rental methods by users and how bike availability can be impacted. Obviously all the variables did not have all been used. Here are some additional ideas for some exploratory analysis about citybikes.
Seizing the number of stations in each Parisian disctrict and computing the number of bikes in order to see which district is in tension compared to the others
Extend this analysis to scooters
Predict the most recurring pathways traveled by users between bike stations. This analysis will imply tracking bikes id with other tools like instant tracking of scooters for example.