The data we are using has been made available by Motivate International Inc. under this license. https://ride.divvybikes.com/data-license-agreement and you can find the lastest version here https://divvy-tripdata.s3.amazonaws.com/index.html.
The data is about a bike sharing business in chicago. It contained
historical information on trip with regard to the start and end
date/time, docking stations, and the type of member and rideable type
(classic bike, electric bike or docked bike).
We will try to understand how subscribers who has membership (does who
buy annual plan) and casual riders (does who by pass per trip or day)
use the service differently.
loading the necessary libraries
library(tidyverse)
## Warning: le package 'tidyverse' a été compilé avec la version R 4.2.1
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## Warning: le package 'readr' a été compilé avec la version R 4.2.1
## Warning: le package 'forcats' a été compilé avec la version R 4.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
library(skimr)
## Warning: le package 'skimr' a été compilé avec la version R 4.2.2
library(lubridate)
## Warning: le package 'lubridate' a été compilé avec la version R 4.2.1
##
## Attachement du package : 'lubridate'
##
## Les objets suivants sont masqués depuis 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(kableExtra)
## Warning: le package 'kableExtra' a été compilé avec la version R 4.2.1
##
## Attachement du package : 'kableExtra'
##
## L'objet suivant est masqué depuis 'package:dplyr':
##
## group_rows
Import all the data sets/ the data set are stored per month
jan_data <- read_csv("Case_Study/data/202201_tripdata.csv")
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fev_data <- read_csv("Case_Study/data/202202_tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_data <- read_csv("Case_Study/data/202203_tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_data <- read_csv("Case_Study/data/202204_tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_data <- read_csv("Case_Study/data/202205_tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_data <- read_csv("Case_Study/data/202206_tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_data <- read_csv("Case_Study/data/202207_tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aou_data <- read_csv("Case_Study/data/202208_tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_data <- read_csv("Case_Study/data/202209_tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_data <- read_csv("Case_Study/data/202210_tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_data <- read_csv("Case_Study/data/202211_tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Combine all the data into a single data set
trip_data_2022 <- bind_rows(jan_data, fev_data, mar_data, apr_data, may_data, jun_data, jul_data, aou_data, oct_data, nov_data)
Get a preview of the data
head(trip_data_2022)
A quick summary with glimpse
glimpse(trip_data_2022)
## Rows: 4,927,226
## Columns: 13
## $ ride_id <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual <chr> "casual", "casual", "member", "casual", "member", "…
Compute the trip duration in sec, days of the week and month. it is irrelevant to compute a year variable in this case because we only have data for one year.
trip_data_2022 <- trip_data_2022 %>%
mutate(ride_len_sec=ended_at-started_at,
day_of_week=wday(started_at, label=TRUE), month_ride=month(started_at, label=TRUE))
We learn from the company website that any ride less than 60 sec are “potentially false starts or users trying to re-dock a bike to ensure it was secure”. So let’s delete ride below 60 in our data frame
trip_data_2022 <- trip_data_2022 %>%
filter(ride_len_sec>=60)
Also let us take a look at trip that lasted more than 24 hours that is 86 400 sec
temp <- trip_data_2022 %>%
filter(ride_len_sec>86400)
head(temp)
Honestly it is quite wear to have these kind of trip but we don’t have more information to decide on them so, Let’s procede. Compute ride length in minutes
trip_data_2022 <- trip_data_2022 %>%
mutate(ride_len_min=as.numeric(ride_len_sec/60))
#Some analysis to compare member Vs casual riders
First, let’s try to compute the number of trips and average ride length per type of subscribers
trip_data_2022 %>%
group_by(member=member_casual) %>%
summarise(number_of_subscriber=n(), average_ride_len=mean(ride_len_min)) %>%
kbl() %>% kable_styling()
| member | number_of_subscriber | average_ride_len |
|---|---|---|
| casual | 2026191 | 30.17630 |
| member | 2797831 | 13.18012 |
We see from the table above that casual riders make longer trip than the member. Let’s try to understand this further. Let’s see the average trip duration per day of the week per type of subscribers
trip_data_2022 %>%
group_by(day_of_week,member_casual) %>%
summarise(Average_ride_len=mean(ride_len_min)) %>%
ggplot(mapping = aes(x =day_of_week, y=Average_ride_len, fill=day_of_week))+
geom_col()+
facet_wrap(~member_casual)+
labs(title="Average ride duration per type of subscribers", x="day of the week", y="ride duration (min)")+
theme(legend.position = "none")
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
To see if casual riders do use differently bike in the week, let’s compute the number of trip per day of week per type of subscribers
ggplot(data=trip_data_2022, mapping = aes(x =day_of_week, fill=day_of_week))+
geom_bar()+
facet_wrap(~member_casual)+
labs(title='Number of ride per day of the week', x="day of the week", y='Number of ride')+
theme(legend.position = "none")
It appear that member use the bike mostly on working days and casual riders on weekends. This could be (this is just a hypothesis) that member mostly use the bikes for commuting to work, to bus or train station. And casual riders use it mostly for leisure or sport. But we would need more data to verify that. Let’s try to understand if these is preferences among the group with regard to the type of bike. We will plot the rideable_type by member_type
ggplot(data=trip_data_2022, mapping = aes(x =rideable_type, fill=member_casual))+
geom_bar()+
facet_wrap(~member_casual)+
labs(title='Number of trip per type of the bike', x="type of bike", y='Number of subscriber', fill="type of member")
Riders who have membership, doesn’t use docked bike. The hypothesis we stated above may explain this but it need further analysis. Let’s have some insight on ride per month
ggplot(data=trip_data_2022, mapping = aes(x =month_ride, fill=month_ride))+
geom_bar()+
facet_wrap(~member_casual)+
labs(title='Number of trip per month', x='Month', y='Number of trips')+
theme(legend.position = "none")
The two type of subscribers seem to have pretty much the same distribution. The main insights we get from these analysis are :
Casual riders do longer trip compare to those who have membership subscription(twice on average 30.18 vs 13.17 minutes);
Subscribers who have membership use cyclistic services mostly on the working days where as casual riders use it mostly on weekend;
Membership subscriber make 38% more ride per month than casual riders.
Next we will try explore how to build a ML model to predict the number of riders per month !