This is a documentation for my first case study “CYCLISTIC BAKE-SHARE”.
What is Cyclistic?
Cyclistic is a bike share company in Chicago, it’s a program that feature more than 5800 bicycles and 600 docking stations.
What do we want to know from this case study?
Analyze the differences in usage between annual members and casual riders to identify opportunities to convert casual riders to annual members.
library(knitr)
library(kableExtra)
cyclistic_data <- data.frame(
Category = c(
"Data source",
"Organization",
"Bias/credibility (ROCCC)",
"Licensing, privacy, security and accessibility",
"Data integrity verification",
"Limitation",
"Tools chosen"
),
Description = c(
"Coursera – Cyclistic Case Study",
"Tabular format with ride ID, start/end times, start/end station names and IDs, user types, bike type, and the longitude and latitude of each station",
"Reliable, original, comprehensive, current, cited but some missing and inconsistent data exist",
"Data publicly released and accessible for analysis",
"Checked for unrealistic trip durations, null values, duplicates and standardized the characters type",
"Lack of demographic information",
"Google Sheets,SQL server management studio, Tableau, and R"
),
stringsAsFactors = FALSE
)
# Create a formatted table with adjusted column widths
kable(cyclistic_data, caption = "Cyclistic Data Metadata") %>%
kable_styling(full_width = TRUE) %>%
column_spec(1, width = "5cm") %>%
column_spec(2, width = "12cm")
| Category | Description |
|---|---|
| Data source | Coursera – Cyclistic Case Study |
| Organization | Tabular format with ride ID, start/end times, start/end station names and IDs, user types, bike type, and the longitude and latitude of each station |
| Bias/credibility (ROCCC) | Reliable, original, comprehensive, current, cited but some missing and inconsistent data exist |
| Licensing, privacy, security and accessibility | Data publicly released and accessible for analysis |
| Data integrity verification | Checked for unrealistic trip durations, null values, duplicates and standardized the characters type |
| Limitation | Lack of demographic information |
| Tools chosen | Google Sheets,SQL server management studio, Tableau, and R |
In this section, i will document the entire cleaning process in Google sheets, SQL server management studio and RStudio.
Important note : Our dataset is too large to import it all in a single spreadsheet so im going to explain how the cleaning process is executed in google sheet for documentation sake then i will translate my work to SQL so i can work with the data i was provided and continue my analysis.
In order to help us answer the business task, we need to add two new columns to our dataset :
I - ride_length : Duration of the Trip
II - day_of_week : Day of the Trip
We will use WEEKDAY() in a new column to get the day of the week as a number from 1 (Sunday) to 7 (Saturday).
Let’s continue the cleaning for the rest of the data we have :
When looking at the data, we can notice empty cells in the station names and ids, let’s correct them :
Finally, let’s remove the rows that contain the same ride_id because its supposed to be unique for each ride, it is easily done by going to Data → Data cleanup → Remove duplicates.
This is the end of the section about google sheet, we will now work on SQL server management studio.
First, we need to import the datasets to the SQL server management studio (SSMS). We have 12 CSV files.
I will start by creating a temporary table, all_trips_tmp, where i would import all the data using CREATE TABLE:
Why a temporary table?
When i tried to import the second file of the data, i got an error related to the started_at column so i made a temporary table that will hold all our data with NVARCHAR type.
-- Create temporary table all_trips_tmp
CREATE TABLE all_trips_tmp (
ride_id NVARCHAR(100),
rideable_type NVARCHAR(100),
started_at NVARCHAR(50),
ended_at NVARCHAR(50),
start_station_name NVARCHAR(255),
start_station_id NVARCHAR(50),
end_station_name NVARCHAR(255),
end_station_id NVARCHAR(50),
start_lat NVARCHAR(50),
start_lng NVARCHAR(50),
end_lat NVARCHAR(50),
end_lng NVARCHAR(50),
member_casual NVARCHAR(50)
);
Now i will import the data using BULK INSERT ( this query will help import the files one by one so i can make sure it was imported correctly)
-- Import data
BULK INSERT all_trips_tmp
FROM 'File_path\202101-divvy-tripdata.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
TABLOCK,
CODEPAGE = '65001' -- handles UTF-8 if needed
);
BULK INSERT all_trips_tmp
FROM 'File_path\202102-divvy-tripdata.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '0x0a',
TABLOCK,
CODEPAGE = '65001'
); -- ...We will execute the same query for the 12 files we have.
Now let’s create the table, all_trips That we will work on for cleaning with the proper data type :
-- Create all_trips table with the proper data type
CREATE TABLE all_trips (
ride_id NVARCHAR(100),
rideable_type NVARCHAR(100),
started_at DATETIME,
ended_at DATETIME,
start_station_name NVARCHAR(255),
start_station_id NVARCHAR(50),
end_station_name NVARCHAR(255),
end_station_id NVARCHAR(50),
start_lat FLOAT,
start_lng FLOAT,
end_lat FLOAT,
end_lng FLOAT,
member_casual NVARCHAR(50)
);
Let’s insert the data into all_trips by using INSERT INTO :
-- Insert the data into our table
INSERT INTO all_trips
SELECT
ride_id,
rideable_type,
TRY_CONVERT(DATETIME, started_at) AS started_at,
TRY_CONVERT(DATETIME, ended_at) AS ended_at,
start_station_name,
start_station_id,
end_station_name,
end_station_id,
TRY_CONVERT(FLOAT, start_lat) AS start_lat,
TRY_CONVERT(FLOAT, start_lng) AS start_lng,
TRY_CONVERT(FLOAT, end_lat) AS end_lat,
TRY_CONVERT(FLOAT, end_lng) AS end_lng,
member_casual
FROM all_trips_tmp;
We will check the data by comparing the count of the rows between all_trips and all_trips_tmp, then we will check a sample from our table :
-- Compare how many rows we have between all_trips and all_trips_tmp SELECT COUNT(*) AS total_trips FROM all_trips; SELECT COUNT(*) AS total_temp_rows FROM all_trips_tmp; -- Check a sample of the data SELECT TOP 10 * FROM all_trips;
Now, it’s time to start the cleaning process, we can notice that there are a few null values on our table, we will use DELETE FROM to get rid of them, we will also standardize the station names so they will be easy to clean by using UPPER(LTRIM(RTRIM())) and we will remove the duplicates rides :
-- Delete null values from started_at and ended_at
DELETE FROM all_trips
WHERE started_at IS NULL OR ended_at IS NULL;
-- Delete where ride_id is null
DELETE FROM all_trips
WHERE LTRIM(RTRIM(ride_id)) = '';
-- Standardize names : remove extra spaces and capitalize
UPDATE all_trips
SET start_station_name = UPPER(LTRIM(RTRIM(start_station_name))),
end_station_name = UPPER(LTRIM(RTRIM(end_station_name)));
-- Remove duplicates
WITH CTE AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ride_id ORDER BY ride_id) AS rn
FROM all_trips
)
DELETE FROM CTE WHERE rn > 1;
-- Delete empty station names
DELETE FROM all_trips
WHERE start_station_name = '' OR end_station_name = '';
Like we did in the spreadsheet section, we are going to add two new columns that will help us answer the business task, ride_length and day_of_week by using ALTER TABLE /ADD to create our new columns and UPDATE /SET to set the values :
-- Add a new column for ride_length in seconds ALTER TABLE all_trips ADD ride_length_sec INT; -- Calculate ride_length as the difference between started_at and ended_at UPDATE all_trips SET ride_length_sec = DATEDIFF(SECOND, started_at, ended_at);
Like previously, when checking the new column, we will find some negative values and unrealistic durations, so let’s clean our new column :
-- Convert negative durations to positive UPDATE all_trips SET ride_length_sec = ABS(ride_length_sec); -- Remove unrealistic durations DELETE FROM all_trips WHERE ride_length_sec <= 60 -- delete rides shorter than 1 minute OR ride_length_sec > 86400; -- delete rides longer than 1 day
Let’s add our second new column day_of_week and also day_name :
-- Add a new column for day_of_week ALTER TABLE all_trips ADD day_of_week INT; -- Extract the day from the column started_at UPDATE all_trips SET day_of_week = DATEPART(WEEKDAY, started_at); -- Add a new column for day_name ALTER TABLE all_trips ADD day_name NVARCHAR(10); -- Extract the day name from the column started_at UPDATE all_trips SET day_name = DATENAME(WEEKDAY, started_at);
Let’s do more verification of the data :
-- Check for start station names inconsistencies SELECT start_station_name, COUNT(*) AS trips FROM all_trips GROUP BY start_station_name ORDER BY start_station_name ; -- There is a station name 351 that contain one trip ride that isn't consistent with the rest of the data so i will delete it -- Delete station name 351 DELETE FROM all_trips WHERE start_station_name = '351'; -- Check for end station names inconsistencies SELECT end_station_name, COUNT(*) AS trips FROM all_trips GROUP BY end_station_name ORDER BY end_station_name ; -- Final check of the data for null values SELECT COUNT(*) AS total_rows, COUNT(CASE WHEN ride_id IS NULL OR LTRIM(RTRIM(ride_id)) = '' THEN 1 END) AS empty_ride_id, COUNT(CASE WHEN start_station_name IS NULL OR LTRIM(RTRIM(start_station_name)) = '' THEN 1 END) AS empty_start_station_name, COUNT(CASE WHEN end_station_name IS NULL OR LTRIM(RTRIM(end_station_name)) = '' THEN 1 END) AS empty_end_station_name, COUNT(CASE WHEN rideable_type IS NULL OR LTRIM(RTRIM(rideable_type)) = '' THEN 1 END) AS empty_rideable_type, COUNT(CASE WHEN member_casual IS NULL OR LTRIM(RTRIM(member_casual)) = '' THEN 1 END) AS empty_member_casual FROM all_trips; -- See a preview of the data SELECT TOP 10 * FROM all_trips;
After a final review, the dataset is now clean and ready for analysis.
I used the cleaned data from the table all_trips in Tableau for visualization ( you can import the data from sql directly to tableau without the need to export it).
These are the dashboards i made :
_ Bike
type overview : members vs casuals : See the riders
activity based on the type of bikes.
_ Seasonal
approach : See the riders activity by seasons.
_ Weekends
vs weekdays : See the difference in activity between the
weekends and weekdays by the rider type.
_ Riders
activity : See the stations where we have the most traffic
between casual riders and members.
Note : The final analysis will be documented in the R section.
This section will cover the work on R, i will use the same 12 csv files i worked with in SQL.
Before importing our data and start to clean, let’s install and call the package we need for these tasks : “tidyverse” will help us wrangle the data :
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
After calling the tidyverse package we have some conflicts, so we use the conflicted package to manage them and we will set dplyr::filter and dplyr::lag as the default choices :
library(conflicted)
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
Let’s begin, the first step is to import our data, readr library will help us import our csv files :
library(readr)
X202101_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202102_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202102-divvy-tripdata.csv")
## Rows: 49623 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202103_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202104_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202105_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202106_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202107_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202108_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202109_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202110_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202111_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202112_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now we need to combine them into a single table we can work with, but before that let’s make sure the data contained into the columns have the same type :
str(X202101_divvy_tripdata)
## spc_tbl_ [96,834 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:96834] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:96834] NA NA NA NA ...
## $ end_station_id : chr [1:96834] NA NA NA NA ...
## $ start_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:96834] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(X202102_divvy_tripdata)
## spc_tbl_ [49,623 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:49623] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr [1:49623] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr [1:49623] "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
## $ ended_at : chr [1:49623] "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
## $ start_station_name: chr [1:49623] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr [1:49623] "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr [1:49623] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr [1:49623] "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : chr [1:49623] "42.012701" "42.012701" "41.88579466666667" "41.895634" ...
## $ start_lng : chr [1:49623] "-87.666058" "-87.666058" "-87.63110066666667" "-87.672069" ...
## $ end_lat : chr [1:49623] "42.004583" "42.019537" "41.884866" "41.903119" ...
## $ end_lng : chr [1:49623] "-87.661406" "-87.669563" "-87.62749766666667" "-87.673935" ...
## $ member_casual : chr [1:49623] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_character(),
## .. ended_at = col_character(),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_character(),
## .. start_lng = col_character(),
## .. end_lat = col_character(),
## .. end_lng = col_character(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
library(knitr)
library(kableExtra)
# Create a data frame with column names and types for two files
column_info <- data.frame(
Column = c("ride_id", "rideable_type", "started_at", "ended_at",
"start_station_name", "start_station_id", "end_station_name",
"end_station_id", "start_lat", "start_lng", "end_lat",
"end_lng", "member_casual"),
File1_Type = c("col_character", "col_character", "col_datetime", "col_datetime",
"col_character", "col_character", "col_character", "col_character",
"col_double", "col_double", "col_double", "col_double", "col_character"),
File2_Type = c("col_character", "col_character", "col_character", "col_character",
"col_character", "col_character", "col_character", "col_character",
"col_character", "col_character", "col_character", "col_character", "col_character"),
stringsAsFactors = FALSE
)
# Create a formatted table with adjusted column widths
kable(column_info, caption = "Data Types") %>%
kable_styling(full_width = TRUE) %>%
column_spec(1, width = "5cm") %>%
column_spec(2, width = "5cm") %>%
column_spec(3, width = "5cm")
| Column | File1_Type | File2_Type |
|---|---|---|
| ride_id | col_character | col_character |
| rideable_type | col_character | col_character |
| started_at | col_datetime | col_character |
| ended_at | col_datetime | col_character |
| start_station_name | col_character | col_character |
| start_station_id | col_character | col_character |
| end_station_name | col_character | col_character |
| end_station_id | col_character | col_character |
| start_lat | col_double | col_character |
| start_lng | col_double | col_character |
| end_lat | col_double | col_character |
| end_lng | col_double | col_character |
| member_casual | col_character | col_character |
From the two files, we can notice the difference in the data types so lets correct it before joining all our data, we need to convert these columns to the correct type, this is what i am going to do :
library(dplyr)
dataframe_list <- list(X202101_divvy_tripdata, X202102_divvy_tripdata, X202103_divvy_tripdata, X202104_divvy_tripdata, X202105_divvy_tripdata, X202106_divvy_tripdata, X202107_divvy_tripdata, X202108_divvy_tripdata, X202109_divvy_tripdata, X202110_divvy_tripdata, X202111_divvy_tripdata, X202112_divvy_tripdata)
# Function to standardize column types
clean_types <- function(df) {
df %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type),
started_at = as.POSIXct(started_at),
ended_at = as.POSIXct(ended_at),
start_station_name = as.character(start_station_name),
start_station_id = as.character(start_station_id),
end_station_name = as.character(end_station_name),
end_station_id = as.character(end_station_id),
start_lat = as.numeric(start_lat),
start_lng = as.numeric(start_lng),
end_lat = as.numeric(end_lat),
end_lng = as.numeric(end_lng),
member_casual = as.character(member_casual)
)
}
# Apply to each dataframe
dataframe_list <- lapply(dataframe_list, clean_types)
Using this method i got an error concerning the started_at data which means some of the started_at / ended_at values are characters in a format that as.POSIXct() doesn’t automatically recognize. We will use the library lubridate for inconsistent datetime parsing and we will also use a function parse_datetime_safe to detect a different date format for example we can have in one csv file “YYYY-MM-DD” and in another “MM/DD/YYYY” :
library(dplyr)
library(lubridate)
# A safe datetime parser
parse_datetime_safe <- function(x) {
suppressWarnings(ymd_hms(x, quiet = TRUE)) %>%
if_else(is.na(.), mdy_hms(x, quiet = TRUE), .)
}
# Function to clean column types
clean_types <- function(df) {
df %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type),
started_at = parse_datetime_safe(as.character(started_at)),
ended_at = parse_datetime_safe(as.character(ended_at)),
start_station_name = as.character(start_station_name),
start_station_id = as.character(start_station_id),
end_station_name = as.character(end_station_name),
end_station_id = as.character(end_station_id),
start_lat = as.numeric(start_lat),
start_lng = as.numeric(start_lng),
end_lat = as.numeric(end_lat),
end_lng = as.numeric(end_lng),
member_casual = as.character(member_casual)
)
}
Let’s reapply the clean_types function :
dataframe_list <- lapply(dataframe_list, clean_types)
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `start_lat = as.numeric(start_lat)`.
## Caused by warning:
## ! NAs introduits lors de la conversion automatique
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.
We can now join all the csv files in a single table together that we will call all_trips :
library(dplyr)
all_trips <- bind_rows(dataframe_list)
Important note : using this method i got a warning which is due to NA and null values, we will clean them in the next step :
str(all_trips)
## tibble [5,595,064 × 13] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5595064] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:5595064] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5595064], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:5595064], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:5595064] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:5595064] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:5595064] NA NA NA NA ...
## $ end_station_id : chr [1:5595064] NA NA NA NA ...
## $ start_lat : num [1:5595064] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5595064] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5595064] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5595064] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5595064] "member" "member" "member" "member" ...
all_trips <- all_trips %>%
filter(!is.na(start_station_name),
!is.na(start_station_id),
!is.na(end_station_name),
!is.na(end_station_id),
!is.na(start_lat),
!is.na(start_lng),
!is.na(end_lat),
!is.na(end_lng))
head(all_trips)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 B9F73448DFBE0D45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike 2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
Like in google sheet and SQL, let’s create the two columns that will help us in solving the business task, we will start by creating the column ride_length in seconds :
all_trips <- all_trips %>%
mutate(
ride_length = as.numeric(difftime(ended_at, started_at, units = "secs"))
)
summary(all_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3354 417 732 1309 1327 3356649 44
We can notice there are negative values and unrealistic durations, let’s clean the ride_length column :
all_trips <- all_trips %>%
mutate(
ride_length = abs(ride_length)
)
all_trips <- all_trips %>%
filter(
!is.na(ride_length),
ride_length >= 60, # remove rides shorter than 60 seconds
ride_length <= 86400 # remove rides longer than 1 day
)
summary(all_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 427 742 1208 1338 86362
Now, we will create the week_of_day column :
library(lubridate)
library(dplyr)
all_trips <- all_trips %>%
mutate(
day_of_week = weekdays(started_at)
)
Note : This gave me day_of_week in french so i will change that because i want it in english ( it’s due to my system’s settings) :
library(dplyr)
library(lubridate)
# Create English day names manually
english_days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
all_trips <- all_trips %>%
mutate(
# wday returns 1=Sunday, 2=Monday, ...
day_of_week = english_days[wday(started_at)]
) %>%
# make it ordered
mutate(
day_of_week = factor(day_of_week,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
)
head(all_trips)
## # A tibble: 6 × 15
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 B9F73448DFBE0D45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike 2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <dbl>, day_of_week <fct>
library(dplyr)
library(stringr)
all_trips <- all_trips %>%
mutate(across(where(is.character),
~ str_trim(.) %>% # remove leading/trailing spaces
str_to_lower() # convert to lowercase
))
Let’s inspect the new table all_trips that we have created :
sum(is.na(all_trips))
## [1] 0
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
nrow(all_trips)
## [1] 4527665
dim(all_trips)
## [1] 4527665 15
head(all_trips)
## # A tibble: 6 × 15
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 b9f73448dfbe0d45 classic_bike 2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457c7f4b5d3da135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57c750326f9fdabe electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4d518c65e338d070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9d08a3aff410474d classic_bike 2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49fce1f8598f12c6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <dbl>, day_of_week <fct>
str(all_trips)
## tibble [4,527,665 × 15] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:4527665] "b9f73448dfbe0d45" "457c7f4b5d3da135" "57c750326f9fdabe" "4d518c65e338d070" ...
## $ rideable_type : chr [1:4527665] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:4527665], format: "2021-01-24 19:15:38" "2021-01-23 12:57:38" ...
## $ ended_at : POSIXct[1:4527665], format: "2021-01-24 19:22:51" "2021-01-23 13:02:10" ...
## $ start_station_name: chr [1:4527665] "california ave & cortez st" "california ave & cortez st" "california ave & cortez st" "california ave & cortez st" ...
## $ start_station_id : chr [1:4527665] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:4527665] "wood st & augusta blvd" "california ave & north ave" "wood st & augusta blvd" "wood st & augusta blvd" ...
## $ end_station_id : chr [1:4527665] "657" "13258" "657" "657" ...
## $ start_lat : num [1:4527665] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:4527665] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:4527665] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:4527665] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:4527665] "member" "member" "casual" "casual" ...
## $ ride_length : num [1:4527665] 433 272 587 537 609 ...
## $ day_of_week : Factor w/ 7 levels "Monday","Tuesday",..: 7 6 6 6 7 5 2 6 3 5 ...
summary(all_trips)
## ride_id rideable_type started_at
## Length:4527665 Length:4527665 Min. :2021-01-01 00:02:24
## Class :character Class :character 1st Qu.:2021-06-04 17:20:07
## Mode :character Mode :character Median :2021-07-28 17:36:29
## Mean :2021-07-25 05:36:58
## 3rd Qu.:2021-09-18 15:18:46
## Max. :2021-12-31 23:59:48
##
## ended_at start_station_name start_station_id
## Min. :2021-01-01 00:08:39 Length:4527665 Length:4527665
## 1st Qu.:2021-06-04 17:40:03 Class :character Class :character
## Median :2021-07-28 17:53:36 Mode :character Mode :character
## Mean :2021-07-25 05:57:07
## 3rd Qu.:2021-09-18 15:45:50
## Max. :2022-01-01 12:17:58
##
## end_station_name end_station_id start_lat start_lng
## Length:4527665 Length:4527665 Min. :41.65 Min. :-87.83
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.64
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53
##
## end_lat end_lng member_casual ride_length
## Min. :41.65 Min. :-87.83 Length:4527665 Min. : 60
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 427
## Median :41.90 Median :-87.64 Mode :character Median : 742
## Mean :41.90 Mean :-87.64 Mean : 1208
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 1338
## Max. :42.17 Max. :-87.52 Max. :86362
##
## day_of_week
## Monday :567614
## Tuesday :595203
## Wednesday:607888
## Thursday :590003
## Friday :647249
## Saturday :814546
## Sunday :705162
We will add more columns that will provide us with additional opportunities to aggregate the data which are date, month, day and year for each ride :
all_trips$date<-as.Date(all_trips$started_at)
all_trips$month<-format(as.Date(all_trips$date),"%m")
all_trips$day<-format(as.Date(all_trips$date),"%d")
all_trips$year<-format(as.Date(all_trips$date),"%y")
Now we’re all set, we have the necessary data to start our analysis, let’s save our table with saveRDS() so we dont rerun the same code chunks everytime we want to use the file :
saveRDS(all_trips, "all_trips_clean.rds")
We can call our table all_trips with readRDS()
all_trips <- readRDS("all_trips_clean.rds")
Let’s start making our visualization, we will begin by making a pie chart that shows us the preference of the bike type by member vs casuals
library(dplyr)
library(tidyr)
bike_pref <- all_trips %>%
group_by(member_casual, rideable_type) %>%
summarise(count = n(), .groups = "drop") %>%
complete(member_casual, rideable_type, fill = list(count = 0)) %>% # fill missing types with 0
group_by(member_casual) %>%
mutate(perc = count / sum(count) * 100,
label = ifelse(count > 0, paste0(round(perc, 1), "%"), "")) %>%
ungroup()
ggplot(bike_pref, aes(x = "", y = perc, fill = rideable_type)) +
geom_col(width = 1, color= "black") + coord_polar(theta = "y") + # makes it a pie chart
geom_text(aes(label = label),
position = position_stack(vjust = 0.5),
size = 4,fontface = "bold", color = "white") + # labels inside slices
facet_wrap(~ member_casual) + # separate pies for member vs casual
labs(title = "Bike Type Preference : Member vs Casual", fill = "Bike Type") +
theme_void() + # clean theme without axes
theme(
plot.title = element_text(hjust = 0.5, size = 20, face = "bold"), # centered title
strip.text = element_text(size = 14) # facet labels
) +
scale_fill_brewer(palette = "Set2")
Observation 1 : Classic bikes are popular for the two type of riders and casual riders use the docked bikes unlike the members.
We will see now the usage of bikes by the hour of the days :
library(dplyr)
library(lubridate)
# Extract hour from started_at
all_trips <- all_trips %>%
mutate(hour = hour(started_at))
# Aggregate trips by hour, rideable_type, and rider type
hourly_usage <- all_trips %>%
group_by(hour, rideable_type, member_casual) %>%
summarise(nbtrips = n(), .groups = "drop")
library(ggplot2)
ggplot(hourly_usage, aes(x = hour, y = nbtrips, color = rideable_type)) +
geom_line(size = 1) +
facet_wrap(~ member_casual) + # separate charts for member / casual
labs(title = "Bike Type Usage by Hour of Day",
x = "Hour of Day",
y = "Number of Trips",
color = "Bike Type") +
theme_minimal() +
scale_x_continuous(breaks = seq(0, 23, by = 2)) + # every 2 hours
theme(
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5) # centered
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Observation 2 : For both members and casuals their most active time is between 16h and 18h.
Let’s see a scatter plot that shows us the correlation between the number of trips and the average ride duration by bike type and rider type.
library(dplyr)
library(lubridate)
scatter_data <- all_trips %>%
group_by(member_casual, rideable_type, day_of_week) %>%
summarise(
nbtrips = n(),
avg_duration = mean(ride_length, na.rm = TRUE), # in seconds
.groups = "drop"
)
scatter_data <- scatter_data %>%
mutate(day_index = as.numeric(day_of_week)) # Monday=1, Tuesday=2, ...
ggplot(scatter_data, aes(
x = avg_duration,
y = nbtrips,
size = day_index, # numeric now
shape = rideable_type,
color = member_casual
)) +
geom_point(alpha = 0.7) +
labs(
title = "Bike Usage: Number of Trips vs Average Duration",
x = "Average Duration (seconds)",
y = "Number of Trips",
color = "Rider Type",
shape = "Bike Type",
size = "Day of Week"
) +
theme_minimal()+
theme(
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5) # centered
)
Observation 3 : The docked bikes are only used by casuals for long trips.
Observation 4 : Casual riders take longer trips than members and members take more trips than casuals
Let’s see the riders activity by the day of the week :
library(dplyr)
library(lubridate)
# Make sure day_of_week is ordered correctly (Monday → Sunday)
all_trips <- all_trips %>%
mutate(day_of_week = factor(day_of_week,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"),
ordered = TRUE))
# Aggregate number of trips per day of week and rider type
weekly_usage <- all_trips %>%
group_by(day_of_week, member_casual) %>%
summarise(nbtrips = n(), .groups = "drop")
library(ggplot2)
ggplot(weekly_usage, aes(x = day_of_week, y = nbtrips, color = member_casual, group = member_casual)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(
title = "Number of Trips by Day of Week",
x = "Day of Week",
y = "Number of Trips",
color = "Rider Type"
) +
theme_minimal()+
theme(
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5) # centered
)+
theme(plot.background = element_rect(fill = "white", color = NA))
ggsave("C:/Users/Seifallah raach/Desktop/ridesbydays.png", width = 8, height = 6, dpi = 300)
Observation 5 : Casual riders are more active on weekends.
library(dplyr)
library(lubridate)
library(ggplot2)
# Make sure day_of_week is ordered correctly (Monday → Sunday)
all_trips <- all_trips %>%
mutate(day_of_week = factor(day_of_week,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"),
ordered = TRUE))
# Aggregate average ride duration (convert seconds → minutes)
weekly_avg_duration <- all_trips %>%
group_by(day_of_week, member_casual) %>%
summarise(avg_duration = mean(ride_length, na.rm = TRUE) / 60, .groups = "drop")
# Plot
ggplot(weekly_avg_duration, aes(x = day_of_week, y = avg_duration,
color = member_casual, group = member_casual)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(
title = "Average Ride Duration by Day of Week",
x = "Day of Week",
y = "Average Ride Duration (minutes)",
color = "Rider Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # centered
plot.background = element_rect(fill = "white", color = NA)
)
ggsave("C:/Users/Seifallah raach/Desktop/averageridebyday.png", width = 8, height = 6, dpi = 300)
Observation 6 : Casual riders and members take longer trips on weekends
Let’s see which paths are preferred by the riders :
library(dplyr)
# Create a route column for convenience
all_trips <- all_trips %>%
mutate(route = paste(start_station_name, "→", end_station_name))
# Aggregate number of trips per route and rider type
popular_routes <- all_trips %>%
group_by(member_casual, route) %>%
summarise(nbtrips = n(), .groups = "drop") %>%
arrange(member_casual, desc(nbtrips))
top_routes <- popular_routes %>%
group_by(member_casual) %>%
slice_max(nbtrips, n = 10) %>%
ungroup()
top_routes
## # A tibble: 20 × 3
## member_casual route nbtrips
## <chr> <chr> <int>
## 1 casual streeter dr & grand ave → streeter dr & grand ave 10953
## 2 casual millennium park → millennium park 5764
## 3 casual michigan ave & oak st → michigan ave & oak st 5588
## 4 casual lake shore dr & monroe st → lake shore dr & monroe st 4471
## 5 casual streeter dr & grand ave → millennium park 3308
## 6 casual buckingham fountain → buckingham fountain 3302
## 7 casual theater on the lake → theater on the lake 2987
## 8 casual dusable lake shore dr & monroe st → dusable lake shore… 2939
## 9 casual millennium park → streeter dr & grand ave 2925
## 10 casual montrose harbor → montrose harbor 2906
## 11 member ellis ave & 60th st → ellis ave & 55th st 4082
## 12 member ellis ave & 55th st → ellis ave & 60th st 3652
## 13 member ellis ave & 60th st → university ave & 57th st 3109
## 14 member university ave & 57th st → ellis ave & 60th st 3010
## 15 member calumet ave & 33rd st → state st & 33rd st 1989
## 16 member state st & 33rd st → calumet ave & 33rd st 1954
## 17 member loomis st & lexington st → morgan st & polk st 1860
## 18 member morgan st & polk st → loomis st & lexington st 1653
## 19 member mlk jr dr & 29th st → state st & 33rd st 1422
## 20 member state st & 33rd st → mlk jr dr & 29th st 1392
library(dplyr)
library(formattable)
library(RColorBrewer)
all_trips <- all_trips %>%
mutate(route = paste(start_station_name, "→", end_station_name))
popular_routes <- all_trips %>%
group_by(member_casual, route) %>%
summarise(nbtrips = n(), .groups = "drop") %>%
arrange(member_casual, desc(nbtrips))
top_routes <- popular_routes %>%
group_by(member_casual) %>%
slice_max(nbtrips, n = 10) %>%
ungroup()
# Create a **consistent color mapping** for all routes
unique_routes <- unique(top_routes$route)
palette_colors <- brewer.pal(min(length(unique_routes), 12), "Set3")
route_colors <- setNames(rep(palette_colors, length.out = length(unique_routes)), unique_routes)
# Display table with colors applied **based on route**, consistent across members
formattable(top_routes, list(
route = formatter("span",
style = ~ style(display = "block",
`background-color` = route_colors[route],
color = "black"))
))
| member_casual | route | nbtrips |
|---|---|---|
| casual | streeter dr & grand ave → streeter dr & grand ave | 10953 |
| casual | millennium park → millennium park | 5764 |
| casual | michigan ave & oak st → michigan ave & oak st | 5588 |
| casual | lake shore dr & monroe st → lake shore dr & monroe st | 4471 |
| casual | streeter dr & grand ave → millennium park | 3308 |
| casual | buckingham fountain → buckingham fountain | 3302 |
| casual | theater on the lake → theater on the lake | 2987 |
| casual | dusable lake shore dr & monroe st → dusable lake shore dr & monroe st | 2939 |
| casual | millennium park → streeter dr & grand ave | 2925 |
| casual | montrose harbor → montrose harbor | 2906 |
| member | ellis ave & 60th st → ellis ave & 55th st | 4082 |
| member | ellis ave & 55th st → ellis ave & 60th st | 3652 |
| member | ellis ave & 60th st → university ave & 57th st | 3109 |
| member | university ave & 57th st → ellis ave & 60th st | 3010 |
| member | calumet ave & 33rd st → state st & 33rd st | 1989 |
| member | state st & 33rd st → calumet ave & 33rd st | 1954 |
| member | loomis st & lexington st → morgan st & polk st | 1860 |
| member | morgan st & polk st → loomis st & lexington st | 1653 |
| member | mlk jr dr & 29th st → state st & 33rd st | 1422 |
| member | state st & 33rd st → mlk jr dr & 29th st | 1392 |
library(dplyr)
library(formattable)
library(RColorBrewer)
# Create route column
all_trips <- all_trips %>%
mutate(route = paste(start_station_name, "→", end_station_name))
# Calculate average duration per route and member type in minutes
avg_duration_routes <- all_trips %>%
group_by(member_casual, route) %>%
summarise(avg_duration = mean(ride_length, na.rm = TRUE)/60, # convert seconds to minutes
.groups = "drop")
# Keep top 10 by average duration per member type
top_avg_routes <- avg_duration_routes %>%
group_by(member_casual) %>%
slice_max(avg_duration, n = 10) %>%
ungroup()
# Create consistent color palette for routes
unique_routes <- unique(top_avg_routes$route)
palette_colors <- brewer.pal(min(length(unique_routes), 12), "Set3")
route_colors <- setNames(rep(palette_colors, length.out = length(unique_routes)), unique_routes)
# Display colored table
formattable(top_avg_routes, list(
route = formatter("span",
style = ~ style(display = "block",
`background-color` = route_colors[route],
color = "black"))
))
| member_casual | route | avg_duration |
|---|---|---|
| casual | wood st & chicago ave → base - 2132 w hubbard warehouse | 1439.3667 |
| casual | museum of science and industry → calumet ave & 33rd st | 1425.7500 |
| casual | halsted st & dickens ave → avers ave & belmont ave | 1424.7000 |
| casual | kedzie ave & chicago ave → state st & van buren st | 1422.2500 |
| casual | south shore dr & 71st st → orleans st & elm st | 1418.4833 |
| casual | ashland ave & lake st → richmond st & diversey ave | 1415.6167 |
| casual | central ave & chicago ave → damen ave & madison st | 1410.5667 |
| casual | halsted st & 96th st → elizabeth st & 92nd st | 1396.7167 |
| casual | sheridan rd & greenleaf ave → dodge ave & church st | 1394.2167 |
| casual | halsted st & willow st → base - 2132 w hubbard warehouse | 1387.0167 |
| member | green st & randolph st → halsted st & archer ave | 1328.3167 |
| member | racine ave & 18th st → wacker dr & washington st | 1306.0500 |
| member | wabash ave & adams st → base - 2132 w hubbard warehouse | 1224.9500 |
| member | state st & pearson st → central park blvd & 5th ave | 1192.2000 |
| member | delano ct & roosevelt rd → base - 2132 w hubbard warehouse | 1137.3167 |
| member | state st & kinzie st → museum of science and industry | 1091.7167 |
| member | clinton st & madison st → base - 2132 w hubbard warehouse | 1053.9500 |
| member | clark st & chicago ave → ravenswood ave & lawrence ave | 987.9500 |
| member | morgan st & 31st st → rush st & superior st | 974.4500 |
| member | lakefront trail & wilson ave → claremont ave & hirsch st | 958.2667 |
Observation 7 : These are the top 3 routes more frequented by
the casual riders :
_ streeter dr & grand ave →
streeter dr & grand ave
_ millennium park →
millennium park
_ michigan ave & oak st →
michigan ave & oak st
Let’s have a seasonal approach to see how riders activity is influenced by the seasons, this first chart will show us the number of trips by seasons :
library(dplyr)
library(lubridate)
library(ggplot2)
all_trips <- all_trips %>%
mutate(
month = month(started_at),
season = case_when(
month %in% c(12, 1, 2) ~ "Winter",
month %in% c(3, 4, 5) ~ "Spring",
month %in% c(6, 7, 8) ~ "Summer",
month %in% c(9, 10, 11) ~ "Fall"
),
season = factor(season, levels = c("Fall", "Winter", "Spring", "Summer"), ordered = TRUE)
)
season_usage_by_type <- all_trips %>%
group_by(season, member_casual) %>%
summarise(nbtrips = n(), .groups = "drop")
ggplot(season_usage_by_type, aes(x = season, y = nbtrips, color = member_casual, group = member_casual)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(
title = "Number of Trips by Season and Rider Type",
x = "Season",
y = "Number of Trips",
color = "Rider Type"
) +
theme_minimal()+
theme(
axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # centered
plot.background = element_rect(fill = "white", color = NA)
)
ggsave("C:/Users/Seifallah raach/Desktop/nbtripsbyseason.png", width = 8, height = 6, dpi = 300)
Observation 8 : The riders are more active in spring-summer-fall especially summer, casual riders are more active in the summer than members
This second chart will show us the average duration of trips by season :
library(dplyr)
library(ggplot2)
library(lubridate)
# Add month and season
all_trips <- all_trips %>%
mutate(
month = month(started_at),
season = case_when(
month %in% c(12, 1, 2) ~ "Winter",
month %in% c(3, 4, 5) ~ "Spring",
month %in% c(6, 7, 8) ~ "Summer",
month %in% c(9, 10, 11) ~ "Fall"
),
season = factor(season, levels = c("Fall", "Winter", "Spring", "Summer"), ordered = TRUE)
)
# Aggregate average ride duration by season and rider type (in minutes)
season_usage_by_type <- all_trips %>%
group_by(season, member_casual) %>%
summarise(avg_duration = mean(ride_length, na.rm = TRUE)/60, .groups = "drop") # convert seconds to minutes
# Plot
ggplot(season_usage_by_type, aes(x = season, y = avg_duration, color = member_casual, group = member_casual)) +
geom_line(size = 1) +
geom_point(size = 3) +
labs(
title = "Average Ride Duration by Season and Rider Type",
x = "Season",
y = "Average Duration (minutes)",
color = "Rider Type"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"))+
theme(plot.background = element_rect(fill = "white", color = NA))
ggsave("C:/Users/Seifallah raach/Desktop/averageridebyseason0.png", width = 8, height = 6, dpi = 300)
Observation 9 : Casual riders take longer trips in the spring , members are more consistent in the trip duration with a small increase in spring
Let’s summarize all our observations :
library(knitr)
library(kableExtra)
# Create a data frame with observations
observations <- data.frame(
Observation = paste0("Observation ", 1:9),
Description = c(
"Classic bikes are popular for both rider types, and casual riders use docked bikes unlike members.",
"For both members and casuals, the most active time is between 16h and 18h.",
"Docked bikes are only used by casuals for long trips.",
"Casual riders take longer trips than members, whereas members take more trips than casuals.",
"Casual riders are more active on weekends, unlike members whose number of trips remains mostly consistent.",
"Casual riders and members take longer trips on weekends.",
"Top 3 routes more frequented by casual riders: Streeter Dr & Grand Ave → Streeter Dr & Grand Ave; Millennium Park → Millennium Park; Michigan Ave & Oak St → Michigan Ave & Oak St.",
"Riders are more active in spring, summer, and fall, especially summer; casual riders are more active in summer than members.",
"Casual riders take longer trips in spring, while members are more consistent in trip duration with a small increase in spring."
),
stringsAsFactors = FALSE
)
# Create a formatted table with adjusted column widths
kable(observations, caption = "Cyclistic Riders Observations") %>%
kable_styling(full_width = TRUE) %>%
column_spec(1, width = "3cm") %>%
column_spec(2, width = "12cm")
| Observation | Description |
|---|---|
| Observation 1 | Classic bikes are popular for both rider types, and casual riders use docked bikes unlike members. |
| Observation 2 | For both members and casuals, the most active time is between 16h and 18h. |
| Observation 3 | Docked bikes are only used by casuals for long trips. |
| Observation 4 | Casual riders take longer trips than members, whereas members take more trips than casuals. |
| Observation 5 | Casual riders are more active on weekends, unlike members whose number of trips remains mostly consistent. |
| Observation 6 | Casual riders and members take longer trips on weekends. |
| Observation 7 | Top 3 routes more frequented by casual riders: Streeter Dr & Grand Ave → Streeter Dr & Grand Ave; Millennium Park → Millennium Park; Michigan Ave & Oak St → Michigan Ave & Oak St. |
| Observation 8 | Riders are more active in spring, summer, and fall, especially summer; casual riders are more active in summer than members. |
| Observation 9 | Casual riders take longer trips in spring, while members are more consistent in trip duration with a small increase in spring. |
We reached the end of our case study, based on the observations i
made these are my top recommendations :
1. Promote cost saving for frequent casual riders who rides
on weekends and during spring and have longer trips.
2. Targeted marketing campaigns during summer and spring for
casual riders with ads (mail, phone..).
3. Targeted
marketing campaign on the most popular routes frequented by the casual
riders on rush hours (flyers, posters..).
3. Offer
promotions during the peak seasons and weekends.
4.
Promote the docked bikes and what they offer as benefits by being a
member for using them.
5. Showcase the convenience
of a membership to casual riders for longer trips.
6. Showcase member benefits when taking longer
trips.
I will wrap up this study by sharing the link to my presentation, i hope my work was informative : Final presentation
Thank you for reading
Seifallah
Raach