Cleaning and data transformation

loading the necessary libraries

library(tidyverse)

## Warning: le package 'tidyverse' a été compilé avec la version R 4.2.1

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.2

## Warning: le package 'readr' a été compilé avec la version R 4.2.1

## Warning: le package 'forcats' a été compilé avec la version R 4.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(readr)
library(skimr)

## Warning: le package 'skimr' a été compilé avec la version R 4.2.2

library(lubridate)

## Warning: le package 'lubridate' a été compilé avec la version R 4.2.1

## 
## Attachement du package : 'lubridate'
## 
## Les objets suivants sont masqués depuis 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(kableExtra)

## Warning: le package 'kableExtra' a été compilé avec la version R 4.2.1

## 
## Attachement du package : 'kableExtra'
## 
## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     group_rows

Import all the data sets/ the data set are stored per month

jan_data <- read_csv("Case_Study/data/202201_tripdata.csv")

## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fev_data <- read_csv("Case_Study/data/202202_tripdata.csv")

## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mar_data <- read_csv("Case_Study/data/202203_tripdata.csv")

## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

apr_data <- read_csv("Case_Study/data/202204_tripdata.csv")

## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may_data <- read_csv("Case_Study/data/202205_tripdata.csv")

## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jun_data <- read_csv("Case_Study/data/202206_tripdata.csv")

## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jul_data <- read_csv("Case_Study/data/202207_tripdata.csv")

## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

aou_data <- read_csv("Case_Study/data/202208_tripdata.csv")

## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oct_data <- read_csv("Case_Study/data/202209_tripdata.csv")

## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sep_data <- read_csv("Case_Study/data/202210_tripdata.csv")

## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov_data <- read_csv("Case_Study/data/202211_tripdata.csv")

## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Combine all the data into a single data set

trip_data_2022 <- bind_rows(jan_data, fev_data, mar_data, apr_data, may_data, jun_data, jul_data, aou_data, oct_data, nov_data)

Get a preview of the data

head(trip_data_2022)

A quick summary with glimpse

glimpse(trip_data_2022)

## Rows: 4,927,226
## Columns: 13
## $ ride_id            <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at         <dttm> 2022-01-13 11:59:47, 2022-01-10 08:41:56, 2022-01-…
## $ ended_at           <dttm> 2022-01-13 12:02:44, 2022-01-10 08:46:17, 2022-01-…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id   <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name   <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id     <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat          <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng          <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat            <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng            <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual      <chr> "casual", "casual", "member", "casual", "member", "…

Compute the trip duration in sec, days of the week and month. it is irrelevant to compute a year variable in this case because we only have data for one year.

trip_data_2022 <- trip_data_2022 %>%
  mutate(ride_len_sec=ended_at-started_at,
         day_of_week=wday(started_at, label=TRUE), month_ride=month(started_at, label=TRUE))

We learn from the company website that any ride less than 60 sec are “potentially false starts or users trying to re-dock a bike to ensure it was secure”. So let’s delete ride below 60 in our data frame

trip_data_2022 <- trip_data_2022 %>%
  filter(ride_len_sec>=60)

Also let us take a look at trip that lasted more than 24 hours that is 86 400 sec

temp <- trip_data_2022 %>%
  filter(ride_len_sec>86400)
head(temp)

Honestly it is quite wear to have these kind of trip but we don’t have more information to decide on them so, Let’s procede. Compute ride length in minutes

trip_data_2022 <- trip_data_2022 %>%
  mutate(ride_len_min=as.numeric(ride_len_sec/60))

#Some analysis to compare member Vs casual riders

First, let’s try to compute the number of trips and average ride length per type of subscribers

trip_data_2022 %>%
  group_by(member=member_casual) %>%
  summarise(number_of_subscriber=n(), average_ride_len=mean(ride_len_min)) %>%
  kbl() %>% kable_styling()

member	number_of_subscriber	average_ride_len
casual	2026191	30.17630
member	2797831	13.18012

We see from the table above that casual riders make longer trip than the member. Let’s try to understand this further. Let’s see the average trip duration per day of the week per type of subscribers

trip_data_2022 %>%
  group_by(day_of_week,member_casual) %>%
  summarise(Average_ride_len=mean(ride_len_min)) %>% 
  ggplot(mapping = aes(x =day_of_week, y=Average_ride_len, fill=day_of_week))+
  geom_col()+
  facet_wrap(~member_casual)+
  labs(title="Average ride duration per type of subscribers", x="day of the week", y="ride duration (min)")+
  theme(legend.position = "none")

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

To see if casual riders do use differently bike in the week, let’s compute the number of trip per day of week per type of subscribers

ggplot(data=trip_data_2022, mapping = aes(x =day_of_week, fill=day_of_week))+
  geom_bar()+
  facet_wrap(~member_casual)+
  labs(title='Number of ride per day of the week', x="day of the week", y='Number of ride')+
  theme(legend.position = "none")

It appear that member use the bike mostly on working days and casual riders on weekends. This could be (this is just a hypothesis) that member mostly use the bikes for commuting to work, to bus or train station. And casual riders use it mostly for leisure or sport. But we would need more data to verify that. Let’s try to understand if these is preferences among the group with regard to the type of bike. We will plot the rideable_type by member_type

ggplot(data=trip_data_2022, mapping = aes(x =rideable_type, fill=member_casual))+
  geom_bar()+
  facet_wrap(~member_casual)+
  labs(title='Number of trip per type of the bike', x="type of bike",  y='Number of subscriber', fill="type of member")

Riders who have membership, doesn’t use docked bike. The hypothesis we stated above may explain this but it need further analysis. Let’s have some insight on ride per month

ggplot(data=trip_data_2022, mapping = aes(x =month_ride, fill=month_ride))+
  geom_bar()+
  facet_wrap(~member_casual)+
  labs(title='Number of trip per month', x='Month',  y='Number of trips')+
  theme(legend.position = "none")

The two type of subscribers seem to have pretty much the same distribution. The main insights we get from these analysis are :

Casual riders do longer trip compare to those who have membership subscription(twice on average 30.18 vs 13.17 minutes);
Subscribers who have membership use cyclistic services mostly on the working days where as casual riders use it mostly on weekend;
Membership subscriber make 38% more ride per month than casual riders.

Next we will try explore how to build a ML model to predict the number of riders per month !

R Notebook

Introduction

Cleaning and data transformation