CASE STUDY : CYCLISTIC BIKE-SHARE

This is a documentation for my first case study “CYCLISTIC BAKE-SHARE”.

Business understanding

What is Cyclistic?

Cyclistic is a bike share company in Chicago, it’s a program that feature more than 5800 bicycles and 600 docking stations.

What do we want to know from this case study?

  • Understand how casual riders and annual members use cyclistic bikes differently.
  • Design a new marketing strategy to convert casual riders into annual members.

Guiding questions

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Business Task

Analyze the differences in usage between annual members and casual riders to identify opportunities to convert casual riders to annual members.

Data understanding

library(knitr)
library(kableExtra)

cyclistic_data <- data.frame(
  Category = c(
    "Data source",
    "Organization",
    "Bias/credibility (ROCCC)",
    "Licensing, privacy, security and accessibility",
    "Data integrity verification",
    "Limitation",
    "Tools chosen"
  ),
  Description = c(
    "Coursera – Cyclistic Case Study",
    "Tabular format with ride ID, start/end times, start/end station names and IDs, user types, bike type, and the longitude and latitude of each station",
    "Reliable, original, comprehensive, current, cited but some missing and inconsistent data exist",
    "Data publicly released and accessible for analysis",
    "Checked for unrealistic trip durations, null values, duplicates and standardized the characters type",
    "Lack of demographic information",
    "Google Sheets,SQL server management studio, Tableau, and R"
  ),
  stringsAsFactors = FALSE
)

# Create a formatted table with adjusted column widths
kable(cyclistic_data, caption = "Cyclistic Data Metadata") %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, width = "5cm") %>%
  column_spec(2, width = "12cm")
Cyclistic Data Metadata
Category Description
Data source Coursera – Cyclistic Case Study
Organization Tabular format with ride ID, start/end times, start/end station names and IDs, user types, bike type, and the longitude and latitude of each station
Bias/credibility (ROCCC) Reliable, original, comprehensive, current, cited but some missing and inconsistent data exist
Licensing, privacy, security and accessibility Data publicly released and accessible for analysis
Data integrity verification Checked for unrealistic trip durations, null values, duplicates and standardized the characters type
Limitation Lack of demographic information
Tools chosen Google Sheets,SQL server management studio, Tableau, and R

Prepare data for analysis

In this section, i will document the entire cleaning process in Google sheets, SQL server management studio and RStudio.

Spreadsheets : Google Sheets

Important note : Our dataset is too large to import it all in a single spreadsheet so im going to explain how the cleaning process is executed in google sheet for documentation sake then i will translate my work to SQL so i can work with the data i was provided and continue my analysis.

Data manipulation

Add columns ride_length, day_of_week and clean them

In order to help us answer the business task, we need to add two new columns to our dataset :

I - ride_length : Duration of the Trip

  1. Format columns: Convert started_at and ended_at columns to Date time format.
  2. Calculate ride_length: Subtract started_at from ended_at to get the trip duration.
  3. Format duration: Convert the resulting ride_length into duration format.
  4. Verify results: we will notice some negative values and values that are either too small or too large to be considered a correct values for a trip duration, let’s clean them :
    • Correct negative values: we will apply ABS() to convert negative durations to positive ones.
    • Remove unrealistic values: we will delete records where the duration is greater than 1 day, and where the duration is lesser than 60 seconds by using the filtering function of google sheet.
  5. Final verification: we will perform summary checks and spot-check sample records to confirm cleanliness.

II - day_of_week : Day of the Trip

We will use WEEKDAY() in a new column to get the day of the week as a number from 1 (Sunday) to 7 (Saturday).

Let’s continue the cleaning for the rest of the data we have :

Station names cleaning

When looking at the data, we can notice empty cells in the station names and ids, let’s correct them :

  1. We will delete empty start station names/ ids, and end station names/ ids using the filtering function.
  2. We will standardize station names using PROPER(TRIM()) to remove extra spaces and capitalize correctly.
  3. we will Check for spelling errors: we will create a pivot table with station names as rows and count of station names as values.
Remove duplicates

Finally, let’s remove the rows that contain the same ride_id because its supposed to be unique for each ride, it is easily done by going to Data → Data cleanup → Remove duplicates.

This is the end of the section about google sheet, we will now work on SQL server management studio.

SQL : SQL Server Management Studio

Data manipulation

First, we need to import the datasets to the SQL server management studio (SSMS). We have 12 CSV files.

I will start by creating a temporary table, all_trips_tmp, where i would import all the data using CREATE TABLE:

Why a temporary table?

When i tried to import the second file of the data, i got an error related to the started_at column so i made a temporary table that will hold all our data with NVARCHAR type.


-- Create temporary table all_trips_tmp

CREATE TABLE all_trips_tmp (
    ride_id NVARCHAR(100),
    rideable_type NVARCHAR(100),
    started_at NVARCHAR(50),
    ended_at NVARCHAR(50),
    start_station_name NVARCHAR(255),
    start_station_id NVARCHAR(50),
    end_station_name NVARCHAR(255),
    end_station_id NVARCHAR(50),
    start_lat NVARCHAR(50),
    start_lng NVARCHAR(50),
    end_lat NVARCHAR(50),
    end_lng NVARCHAR(50),
    member_casual NVARCHAR(50)
);

Now i will import the data using BULK INSERT ( this query will help import the files one by one so i can make sure it was imported correctly)


-- Import data

BULK INSERT all_trips_tmp
FROM 'File_path\202101-divvy-tripdata.csv'
WITH (
    FIRSTROW = 2,
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '0x0a',
    TABLOCK,
    CODEPAGE = '65001'  -- handles UTF-8 if needed
);

BULK INSERT all_trips_tmp
FROM 'File_path\202102-divvy-tripdata.csv'
WITH (
    FIRSTROW = 2,
    FIELDTERMINATOR = ',',
    ROWTERMINATOR = '0x0a',
    TABLOCK,
    CODEPAGE = '65001'
); -- ...We will execute the same query for the 12 files we have.


Now let’s create the table, all_trips That we will work on for cleaning with the proper data type :


-- Create all_trips table with the proper data type

CREATE TABLE all_trips (
    ride_id NVARCHAR(100),
    rideable_type NVARCHAR(100),
    started_at DATETIME,
    ended_at DATETIME,
    start_station_name NVARCHAR(255),
    start_station_id NVARCHAR(50),
    end_station_name NVARCHAR(255),
    end_station_id NVARCHAR(50),
    start_lat FLOAT,
    start_lng FLOAT,
    end_lat FLOAT,
    end_lng FLOAT,
    member_casual NVARCHAR(50)
);

Let’s insert the data into all_trips by using INSERT INTO :


-- Insert the data into our table 
INSERT INTO all_trips
SELECT
    ride_id,
    rideable_type,
    TRY_CONVERT(DATETIME, started_at) AS started_at,
    TRY_CONVERT(DATETIME, ended_at) AS ended_at,
    start_station_name,
    start_station_id,
    end_station_name,
    end_station_id,
    TRY_CONVERT(FLOAT, start_lat) AS start_lat,
    TRY_CONVERT(FLOAT, start_lng) AS start_lng,
    TRY_CONVERT(FLOAT, end_lat) AS end_lat,
    TRY_CONVERT(FLOAT, end_lng) AS end_lng,
    member_casual
FROM all_trips_tmp;

We will check the data by comparing the count of the rows between all_trips and all_trips_tmp, then we will check a sample from our table :


-- Compare how many rows we have between all_trips and all_trips_tmp

SELECT COUNT(*) AS total_trips FROM all_trips;
SELECT COUNT(*) AS total_temp_rows FROM all_trips_tmp;

-- Check a sample of the data

SELECT TOP 10 * FROM all_trips;


Now, it’s time to start the cleaning process, we can notice that there are a few null values on our table, we will use DELETE FROM to get rid of them, we will also standardize the station names so they will be easy to clean by using UPPER(LTRIM(RTRIM())) and we will remove the duplicates rides :


-- Delete null values from started_at and ended_at

DELETE FROM all_trips
WHERE started_at IS NULL OR ended_at IS NULL;

-- Delete where ride_id is null

DELETE FROM all_trips
WHERE LTRIM(RTRIM(ride_id)) = '';

-- Standardize names : remove extra spaces and capitalize

UPDATE all_trips
SET start_station_name = UPPER(LTRIM(RTRIM(start_station_name))),
    end_station_name   = UPPER(LTRIM(RTRIM(end_station_name)));

-- Remove duplicates

WITH CTE AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY ride_id ORDER BY ride_id) AS rn
    FROM all_trips
)
DELETE FROM CTE WHERE rn > 1;

--  Delete empty station names

DELETE FROM all_trips
WHERE start_station_name = '' OR end_station_name = '';


Like we did in the spreadsheet section, we are going to add two new columns that will help us answer the business task, ride_length and day_of_week by using ALTER TABLE /ADD to create our new columns and UPDATE /SET to set the values :


-- Add a new column for ride_length in seconds

ALTER TABLE all_trips
ADD ride_length_sec INT;

-- Calculate ride_length as the difference between started_at and ended_at

UPDATE all_trips
SET ride_length_sec = DATEDIFF(SECOND, started_at, ended_at);


Like previously, when checking the new column, we will find some negative values and unrealistic durations, so let’s clean our new column :


-- Convert negative durations to positive

UPDATE all_trips
SET ride_length_sec = ABS(ride_length_sec);

--  Remove unrealistic durations

DELETE FROM all_trips
WHERE ride_length_sec <= 60            -- delete rides shorter than 1 minute
   OR ride_length_sec > 86400;         -- delete rides longer than 1 day


Let’s add our second new column day_of_week and also day_name :


-- Add a new column for day_of_week

ALTER TABLE all_trips
ADD day_of_week INT;

-- Extract the day from the column started_at

UPDATE all_trips
SET day_of_week = DATEPART(WEEKDAY, started_at);

-- Add a new column for day_name

ALTER TABLE all_trips
ADD day_name NVARCHAR(10);

-- Extract the day name from the column started_at

UPDATE all_trips
SET day_name = DATENAME(WEEKDAY, started_at);


Let’s do more verification of the data :


-- Check for start station names inconsistencies

SELECT start_station_name, COUNT(*) AS trips
FROM all_trips
GROUP BY start_station_name
ORDER BY start_station_name ;

-- There is a station name 351 that contain one trip ride that isn't consistent with the rest of the data so i will delete it
-- Delete station name 351

DELETE FROM all_trips
WHERE start_station_name = '351';

-- Check for end station names inconsistencies

SELECT end_station_name, COUNT(*) AS trips
FROM all_trips
GROUP BY end_station_name
ORDER BY end_station_name ;

-- Final check of the data for null values  

SELECT
  COUNT(*) AS total_rows,
  COUNT(CASE WHEN ride_id IS NULL OR LTRIM(RTRIM(ride_id)) = '' THEN 1 END) AS empty_ride_id,
  COUNT(CASE WHEN start_station_name IS NULL OR LTRIM(RTRIM(start_station_name)) = '' THEN 1 END) AS empty_start_station_name,
  COUNT(CASE WHEN end_station_name IS NULL OR LTRIM(RTRIM(end_station_name)) = '' THEN 1 END) AS empty_end_station_name,
  COUNT(CASE WHEN rideable_type IS NULL OR LTRIM(RTRIM(rideable_type)) = '' THEN 1 END) AS empty_rideable_type,
  COUNT(CASE WHEN member_casual IS NULL OR LTRIM(RTRIM(member_casual)) = '' THEN 1 END) AS empty_member_casual
FROM all_trips;

-- See a preview of the data

SELECT TOP 10 * FROM all_trips;


After a final review, the dataset is now clean and ready for analysis.

Next Steps : Tableau

I used the cleaned data from the table all_trips in Tableau for visualization ( you can import the data from sql directly to tableau without the need to export it).

These are the dashboards i made :

_ Bike type overview : members vs casuals : See the riders activity based on the type of bikes.
_ Seasonal approach : See the riders activity by seasons.
_ Weekends vs weekdays : See the difference in activity between the weekends and weekdays by the rider type.
_ Riders activity : See the stations where we have the most traffic between casual riders and members.

Note : The final analysis will be documented in the R section.

R

This section will cover the work on R, i will use the same 12 csv files i worked with in SQL.

Before importing our data and start to clean, let’s install and call the package we need for these tasks : “tidyverse” will help us wrangle the data :

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

After calling the tidyverse package we have some conflicts, so we use the conflicted package to manage them and we will set dplyr::filter and dplyr::lag as the default choices :

library(conflicted)

conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.

Data manipulation

STEP 1 : Import the data

Let’s begin, the first step is to import our data, readr library will help us import our csv files :

library(readr)

X202101_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202102_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202102-divvy-tripdata.csv")
## Rows: 49623 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202103_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202104_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202105_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202106_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202107_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202108_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202109_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202110_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202111_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
X202112_divvy_tripdata <- read_csv("DATASETS/Coursera datasets/Cyclistic BIKE-SHARE/202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
STEP 2 : Wrangle data and combine it into a single file

Now we need to combine them into a single table we can work with, but before that let’s make sure the data contained into the columns have the same type :

str(X202101_divvy_tripdata)
## spc_tbl_ [96,834 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:96834] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:96834] NA NA NA NA ...
##  $ end_station_id    : chr [1:96834] NA NA NA NA ...
##  $ start_lat         : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:96834] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(X202102_divvy_tripdata)
## spc_tbl_ [49,623 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:49623] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:49623] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr [1:49623] "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
##  $ ended_at          : chr [1:49623] "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
##  $ start_station_name: chr [1:49623] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:49623] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:49623] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:49623] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : chr [1:49623] "42.012701" "42.012701" "41.88579466666667" "41.895634" ...
##  $ start_lng         : chr [1:49623] "-87.666058" "-87.666058" "-87.63110066666667" "-87.672069" ...
##  $ end_lat           : chr [1:49623] "42.004583" "42.019537" "41.884866" "41.903119" ...
##  $ end_lng           : chr [1:49623] "-87.661406" "-87.669563" "-87.62749766666667" "-87.673935" ...
##  $ member_casual     : chr [1:49623] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_character(),
##   ..   start_lng = col_character(),
##   ..   end_lat = col_character(),
##   ..   end_lng = col_character(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
library(knitr)
library(kableExtra)

# Create a data frame with column names and types for two files

column_info <- data.frame(
  Column = c("ride_id", "rideable_type", "started_at", "ended_at",
             "start_station_name", "start_station_id", "end_station_name",
             "end_station_id", "start_lat", "start_lng", "end_lat",
             "end_lng", "member_casual"),
  File1_Type = c("col_character", "col_character", "col_datetime", "col_datetime",
                 "col_character", "col_character", "col_character", "col_character",
                 "col_double", "col_double", "col_double", "col_double", "col_character"),
  File2_Type = c("col_character", "col_character", "col_character", "col_character",
                 "col_character", "col_character", "col_character", "col_character",
                 "col_character", "col_character", "col_character", "col_character", "col_character"),
  stringsAsFactors = FALSE
)

# Create a formatted table with adjusted column widths

kable(column_info, caption = "Data Types") %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, width = "5cm") %>%
  column_spec(2, width = "5cm") %>%
  column_spec(3, width = "5cm")
Data Types
Column File1_Type File2_Type
ride_id col_character col_character
rideable_type col_character col_character
started_at col_datetime col_character
ended_at col_datetime col_character
start_station_name col_character col_character
start_station_id col_character col_character
end_station_name col_character col_character
end_station_id col_character col_character
start_lat col_double col_character
start_lng col_double col_character
end_lat col_double col_character
end_lng col_double col_character
member_casual col_character col_character

From the two files, we can notice the difference in the data types so lets correct it before joining all our data, we need to convert these columns to the correct type, this is what i am going to do :

  1. Put the data into a list dataframe_list
  2. Standardize column types with the function clean_types
library(dplyr)

dataframe_list <- list(X202101_divvy_tripdata, X202102_divvy_tripdata, X202103_divvy_tripdata, X202104_divvy_tripdata, X202105_divvy_tripdata, X202106_divvy_tripdata, X202107_divvy_tripdata, X202108_divvy_tripdata, X202109_divvy_tripdata, X202110_divvy_tripdata, X202111_divvy_tripdata, X202112_divvy_tripdata)

# Function to standardize column types

clean_types <- function(df) {
  df %>%
    mutate(
      ride_id            = as.character(ride_id),
      rideable_type      = as.character(rideable_type),
      started_at         = as.POSIXct(started_at),
      ended_at           = as.POSIXct(ended_at),
      start_station_name = as.character(start_station_name),
      start_station_id   = as.character(start_station_id),
      end_station_name   = as.character(end_station_name),
      end_station_id     = as.character(end_station_id),
      start_lat          = as.numeric(start_lat),
      start_lng          = as.numeric(start_lng),
      end_lat            = as.numeric(end_lat),
      end_lng            = as.numeric(end_lng),
      member_casual      = as.character(member_casual)
    )
}

# Apply to each dataframe

dataframe_list <- lapply(dataframe_list, clean_types)


Using this method i got an error concerning the started_at data which means some of the started_at / ended_at values are characters in a format that as.POSIXct() doesn’t automatically recognize. We will use the library lubridate for inconsistent datetime parsing and we will also use a function parse_datetime_safe to detect a different date format for example we can have in one csv file “YYYY-MM-DD” and in another “MM/DD/YYYY” :

library(dplyr)
library(lubridate)

# A safe datetime parser
parse_datetime_safe <- function(x) {
  suppressWarnings(ymd_hms(x, quiet = TRUE)) %>%
    if_else(is.na(.), mdy_hms(x, quiet = TRUE), .)
}

# Function to clean column types
clean_types <- function(df) {
  df %>%
    mutate(
      ride_id            = as.character(ride_id),
      rideable_type      = as.character(rideable_type),
      started_at         = parse_datetime_safe(as.character(started_at)),
      ended_at           = parse_datetime_safe(as.character(ended_at)),
     start_station_name  = as.character(start_station_name),
      start_station_id   = as.character(start_station_id),
      end_station_name   = as.character(end_station_name),
      end_station_id     = as.character(end_station_id),
      start_lat          = as.numeric(start_lat),
      start_lng          = as.numeric(start_lng),
      end_lat            = as.numeric(end_lat),
      end_lng            = as.numeric(end_lng),
      member_casual      = as.character(member_casual)
    )
}

Let’s reapply the clean_types function :

dataframe_list <- lapply(dataframe_list, clean_types)
## Warning: There were 4 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `start_lat = as.numeric(start_lat)`.
## Caused by warning:
## ! NAs introduits lors de la conversion automatique
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 3 remaining warnings.

We can now join all the csv files in a single table together that we will call all_trips :

library(dplyr)

all_trips <- bind_rows(dataframe_list)

Important note : using this method i got a warning which is due to NA and null values, we will clean them in the next step :

str(all_trips)
## tibble [5,595,064 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5595064] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:5595064] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5595064], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:5595064], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:5595064] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:5595064] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:5595064] NA NA NA NA ...
##  $ end_station_id    : chr [1:5595064] NA NA NA NA ...
##  $ start_lat         : num [1:5595064] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5595064] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5595064] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5595064] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5595064] "member" "member" "member" "member" ...
all_trips <- all_trips %>%
  filter(!is.na(start_station_name),
         !is.na(start_station_id),
         !is.na(end_station_name),
         !is.na(end_station_id),
         !is.na(start_lat),
         !is.na(start_lng),
         !is.na(end_lat),
         !is.na(end_lng))
head(all_trips)
## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 B9F73448DFBE0D45 classic_bike  2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike  2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
STEP 3 : create two new columns ride_length and day_of_week

Like in google sheet and SQL, let’s create the two columns that will help us in solving the business task, we will start by creating the column ride_length in seconds :

all_trips <- all_trips %>%
  mutate(
    ride_length = as.numeric(difftime(ended_at, started_at, units = "secs"))
  )
summary(all_trips$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   -3354     417     732    1309    1327 3356649      44

We can notice there are negative values and unrealistic durations, let’s clean the ride_length column :

all_trips <- all_trips %>%
  mutate(
    ride_length = abs(ride_length)
  )
all_trips <- all_trips %>%
  filter(
    !is.na(ride_length),           
    ride_length >= 60,             # remove rides shorter than 60 seconds
    ride_length <= 86400           # remove rides longer than 1 day 
  )
summary(all_trips$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     427     742    1208    1338   86362

Now, we will create the week_of_day column :

library(lubridate)
library(dplyr)

all_trips <- all_trips %>%
  mutate(
    day_of_week = weekdays(started_at)
  )

Note : This gave me day_of_week in french so i will change that because i want it in english ( it’s due to my system’s settings) :

library(dplyr)
library(lubridate)

# Create English day names manually

english_days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

all_trips <- all_trips %>%
  mutate(
    # wday returns 1=Sunday, 2=Monday, ...
    day_of_week = english_days[wday(started_at)]
  ) %>%
  # make it ordered 
  mutate(
    day_of_week = factor(day_of_week,
                         levels = c("Monday", "Tuesday", "Wednesday", 
                                    "Thursday", "Friday", "Saturday", "Sunday"))
  )
head(all_trips)
## # A tibble: 6 × 15
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 B9F73448DFBE0D45 classic_bike  2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457C7F4B5D3DA135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57C750326F9FDABE electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4D518C65E338D070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9D08A3AFF410474D classic_bike  2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49FCE1F8598F12C6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <dbl>, day_of_week <fct>
STEP 4 : standardize the texts
library(dplyr)
library(stringr)

all_trips <- all_trips %>%
  mutate(across(where(is.character),
                ~ str_trim(.) %>%      # remove leading/trailing spaces
                  str_to_lower()       # convert to lowercase
  ))
STEP 5 : Data verification

Let’s inspect the new table all_trips that we have created :

sum(is.na(all_trips))
## [1] 0
colnames(all_trips)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"      "ride_length"        "day_of_week"
nrow(all_trips)
## [1] 4527665
dim(all_trips)
## [1] 4527665      15
head(all_trips)
## # A tibble: 6 × 15
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 b9f73448dfbe0d45 classic_bike  2021-01-24 19:15:38 2021-01-24 19:22:51
## 2 457c7f4b5d3da135 electric_bike 2021-01-23 12:57:38 2021-01-23 13:02:10
## 3 57c750326f9fdabe electric_bike 2021-01-09 15:28:04 2021-01-09 15:37:51
## 4 4d518c65e338d070 electric_bike 2021-01-09 15:28:57 2021-01-09 15:37:54
## 5 9d08a3aff410474d classic_bike  2021-01-24 15:56:59 2021-01-24 16:07:08
## 6 49fce1f8598f12c6 electric_bike 2021-01-22 15:15:28 2021-01-22 15:36:01
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <dbl>, day_of_week <fct>
str(all_trips)
## tibble [4,527,665 × 15] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4527665] "b9f73448dfbe0d45" "457c7f4b5d3da135" "57c750326f9fdabe" "4d518c65e338d070" ...
##  $ rideable_type     : chr [1:4527665] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:4527665], format: "2021-01-24 19:15:38" "2021-01-23 12:57:38" ...
##  $ ended_at          : POSIXct[1:4527665], format: "2021-01-24 19:22:51" "2021-01-23 13:02:10" ...
##  $ start_station_name: chr [1:4527665] "california ave & cortez st" "california ave & cortez st" "california ave & cortez st" "california ave & cortez st" ...
##  $ start_station_id  : chr [1:4527665] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:4527665] "wood st & augusta blvd" "california ave & north ave" "wood st & augusta blvd" "wood st & augusta blvd" ...
##  $ end_station_id    : chr [1:4527665] "657" "13258" "657" "657" ...
##  $ start_lat         : num [1:4527665] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:4527665] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:4527665] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:4527665] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:4527665] "member" "member" "casual" "casual" ...
##  $ ride_length       : num [1:4527665] 433 272 587 537 609 ...
##  $ day_of_week       : Factor w/ 7 levels "Monday","Tuesday",..: 7 6 6 6 7 5 2 6 3 5 ...
summary(all_trips)
##    ride_id          rideable_type        started_at                 
##  Length:4527665     Length:4527665     Min.   :2021-01-01 00:02:24  
##  Class :character   Class :character   1st Qu.:2021-06-04 17:20:07  
##  Mode  :character   Mode  :character   Median :2021-07-28 17:36:29  
##                                        Mean   :2021-07-25 05:36:58  
##                                        3rd Qu.:2021-09-18 15:18:46  
##                                        Max.   :2021-12-31 23:59:48  
##                                                                     
##     ended_at                   start_station_name start_station_id  
##  Min.   :2021-01-01 00:08:39   Length:4527665     Length:4527665    
##  1st Qu.:2021-06-04 17:40:03   Class :character   Class :character  
##  Median :2021-07-28 17:53:36   Mode  :character   Mode  :character  
##  Mean   :2021-07-25 05:57:07                                        
##  3rd Qu.:2021-09-18 15:45:50                                        
##  Max.   :2022-01-01 12:17:58                                        
##                                                                     
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:4527665     Length:4527665     Min.   :41.65   Min.   :-87.83  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.64  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.06   Max.   :-87.53  
##                                                                        
##     end_lat         end_lng       member_casual       ride_length   
##  Min.   :41.65   Min.   :-87.83   Length:4527665     Min.   :   60  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:  427  
##  Median :41.90   Median :-87.64   Mode  :character   Median :  742  
##  Mean   :41.90   Mean   :-87.64                      Mean   : 1208  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.: 1338  
##  Max.   :42.17   Max.   :-87.52                      Max.   :86362  
##                                                                     
##     day_of_week    
##  Monday   :567614  
##  Tuesday  :595203  
##  Wednesday:607888  
##  Thursday :590003  
##  Friday   :647249  
##  Saturday :814546  
##  Sunday   :705162

We will add more columns that will provide us with additional opportunities to aggregate the data which are date, month, day and year for each ride :

all_trips$date<-as.Date(all_trips$started_at)
all_trips$month<-format(as.Date(all_trips$date),"%m")
all_trips$day<-format(as.Date(all_trips$date),"%d")
all_trips$year<-format(as.Date(all_trips$date),"%y")

Now we’re all set, we have the necessary data to start our analysis, let’s save our table with saveRDS() so we dont rerun the same code chunks everytime we want to use the file :

saveRDS(all_trips, "all_trips_clean.rds")

We can call our table all_trips with readRDS()

 all_trips <- readRDS("all_trips_clean.rds")
STEP 6 Descriptive analysis

Let’s start making our visualization, we will begin by making a pie chart that shows us the preference of the bike type by member vs casuals

library(dplyr)
library(tidyr)

bike_pref <- all_trips %>%
  group_by(member_casual, rideable_type) %>%
  summarise(count = n(), .groups = "drop") %>%
  complete(member_casual, rideable_type, fill = list(count = 0)) %>%  # fill missing types with 0
  group_by(member_casual) %>%
  mutate(perc = count / sum(count) * 100,
         label = ifelse(count > 0, paste0(round(perc, 1), "%"), "")) %>%
  ungroup()

ggplot(bike_pref, aes(x = "", y = perc, fill = rideable_type)) +
  geom_col(width = 1, color= "black") + coord_polar(theta = "y") + # makes it a pie chart
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5),
            size = 4,fontface = "bold", color = "white") +  # labels inside slices
  facet_wrap(~ member_casual) + # separate pies for member vs casual
  labs(title = "Bike Type Preference : Member vs Casual", fill = "Bike Type") +
  theme_void() + # clean theme without axes
  theme(
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),  # centered title
    strip.text = element_text(size = 14)  # facet labels
  ) +
    scale_fill_brewer(palette = "Set2")

Observation 1 : Classic bikes are popular for the two type of riders and casual riders use the docked bikes unlike the members.

We will see now the usage of bikes by the hour of the days :

library(dplyr)
library(lubridate)

# Extract hour from started_at
all_trips <- all_trips %>%
  mutate(hour = hour(started_at))

# Aggregate trips by hour, rideable_type, and rider type
hourly_usage <- all_trips %>%
  group_by(hour, rideable_type, member_casual) %>%
  summarise(nbtrips = n(), .groups = "drop")

library(ggplot2)

ggplot(hourly_usage, aes(x = hour, y = nbtrips, color = rideable_type)) +
  geom_line(size = 1) +
  facet_wrap(~ member_casual) +                 # separate charts for member / casual
  labs(title = "Bike Type Usage by Hour of Day",
       x = "Hour of Day",
       y = "Number of Trips",
       color = "Bike Type") +
  theme_minimal() +
  scale_x_continuous(breaks = seq(0, 23, by = 2)) +  # every 2 hours
 theme(
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5)  # centered
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Observation 2 : For both members and casuals their most active time is between 16h and 18h.

Let’s see a scatter plot that shows us the correlation between the number of trips and the average ride duration by bike type and rider type.

library(dplyr)
library(lubridate)

scatter_data <- all_trips %>%
  group_by(member_casual, rideable_type, day_of_week) %>%
  summarise(
    nbtrips = n(),
    avg_duration = mean(ride_length, na.rm = TRUE),  # in seconds
    .groups = "drop"
  )

scatter_data <- scatter_data %>%
  mutate(day_index = as.numeric(day_of_week))  # Monday=1, Tuesday=2, ...

ggplot(scatter_data, aes(
  x = avg_duration,
  y = nbtrips,
  size = day_index,       # numeric now
  shape = rideable_type,
  color = member_casual
)) +
  geom_point(alpha = 0.7) +
  labs(
    title = "Bike Usage: Number of Trips vs Average Duration",
    x = "Average Duration (seconds)",
    y = "Number of Trips",
    color = "Rider Type",
    shape = "Bike Type",
    size = "Day of Week"
  ) +
  theme_minimal()+
   theme(
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5)  # centered
  )

Observation 3 : The docked bikes are only used by casuals for long trips.

Observation 4 : Casual riders take longer trips than members and members take more trips than casuals

Let’s see the riders activity by the day of the week :

library(dplyr)
library(lubridate)

# Make sure day_of_week is ordered correctly (Monday → Sunday)
all_trips <- all_trips %>%
  mutate(day_of_week = factor(day_of_week,
                              levels = c("Monday", "Tuesday", "Wednesday",
                                         "Thursday", "Friday", "Saturday", "Sunday"),
                              ordered = TRUE))

# Aggregate number of trips per day of week and rider type
weekly_usage <- all_trips %>%
  group_by(day_of_week, member_casual) %>%
  summarise(nbtrips = n(), .groups = "drop")
library(ggplot2)

ggplot(weekly_usage, aes(x = day_of_week, y = nbtrips, color = member_casual, group = member_casual)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Number of Trips by Day of Week",
    x = "Day of Week",
    y = "Number of Trips",
    color = "Rider Type"
  ) +
  theme_minimal()+
   theme(
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5)  # centered
  )+
    theme(plot.background = element_rect(fill = "white", color = NA))

ggsave("C:/Users/Seifallah raach/Desktop/ridesbydays.png", width = 8, height = 6, dpi = 300)

Observation 5 : Casual riders are more active on weekends.

library(dplyr)
library(lubridate)
library(ggplot2)

# Make sure day_of_week is ordered correctly (Monday → Sunday)
all_trips <- all_trips %>%
  mutate(day_of_week = factor(day_of_week,
                              levels = c("Monday", "Tuesday", "Wednesday",
                                         "Thursday", "Friday", "Saturday", "Sunday"),
                              ordered = TRUE))

# Aggregate average ride duration (convert seconds → minutes)
weekly_avg_duration <- all_trips %>%
  group_by(day_of_week, member_casual) %>%
  summarise(avg_duration = mean(ride_length, na.rm = TRUE) / 60, .groups = "drop")

# Plot
ggplot(weekly_avg_duration, aes(x = day_of_week, y = avg_duration,
                                color = member_casual, group = member_casual)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Average Ride Duration by Day of Week",
    x = "Day of Week",
    y = "Average Ride Duration (minutes)",
    color = "Rider Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # centered
    plot.background = element_rect(fill = "white", color = NA)
  )

ggsave("C:/Users/Seifallah raach/Desktop/averageridebyday.png", width = 8, height = 6, dpi = 300)

Observation 6 : Casual riders and members take longer trips on weekends

Let’s see which paths are preferred by the riders :

library(dplyr)

# Create a route column for convenience
all_trips <- all_trips %>%
  mutate(route = paste(start_station_name, "→", end_station_name))

# Aggregate number of trips per route and rider type
popular_routes <- all_trips %>%
  group_by(member_casual, route) %>%
  summarise(nbtrips = n(), .groups = "drop") %>%
  arrange(member_casual, desc(nbtrips))

top_routes <- popular_routes %>%
  group_by(member_casual) %>%
  slice_max(nbtrips, n = 10) %>%
  ungroup()

top_routes
## # A tibble: 20 × 3
##    member_casual route                                                   nbtrips
##    <chr>         <chr>                                                     <int>
##  1 casual        streeter dr & grand ave → streeter dr & grand ave         10953
##  2 casual        millennium park → millennium park                          5764
##  3 casual        michigan ave & oak st → michigan ave & oak st              5588
##  4 casual        lake shore dr & monroe st → lake shore dr & monroe st      4471
##  5 casual        streeter dr & grand ave → millennium park                  3308
##  6 casual        buckingham fountain → buckingham fountain                  3302
##  7 casual        theater on the lake → theater on the lake                  2987
##  8 casual        dusable lake shore dr & monroe st → dusable lake shore…    2939
##  9 casual        millennium park → streeter dr & grand ave                  2925
## 10 casual        montrose harbor → montrose harbor                          2906
## 11 member        ellis ave & 60th st → ellis ave & 55th st                  4082
## 12 member        ellis ave & 55th st → ellis ave & 60th st                  3652
## 13 member        ellis ave & 60th st → university ave & 57th st             3109
## 14 member        university ave & 57th st → ellis ave & 60th st             3010
## 15 member        calumet ave & 33rd st → state st & 33rd st                 1989
## 16 member        state st & 33rd st → calumet ave & 33rd st                 1954
## 17 member        loomis st & lexington st → morgan st & polk st             1860
## 18 member        morgan st & polk st → loomis st & lexington st             1653
## 19 member        mlk jr dr & 29th st → state st & 33rd st                   1422
## 20 member        state st & 33rd st → mlk jr dr & 29th st                   1392
library(dplyr)
library(formattable)
library(RColorBrewer)

all_trips <- all_trips %>%
  mutate(route = paste(start_station_name, "→", end_station_name))

popular_routes <- all_trips %>%
  group_by(member_casual, route) %>%
  summarise(nbtrips = n(), .groups = "drop") %>%
  arrange(member_casual, desc(nbtrips))

top_routes <- popular_routes %>%
  group_by(member_casual) %>%
  slice_max(nbtrips, n = 10) %>%
  ungroup()

# Create a **consistent color mapping** for all routes
unique_routes <- unique(top_routes$route)
palette_colors <- brewer.pal(min(length(unique_routes), 12), "Set3")
route_colors <- setNames(rep(palette_colors, length.out = length(unique_routes)), unique_routes)

# Display table with colors applied **based on route**, consistent across members
formattable(top_routes, list(
  route = formatter("span",
                    style = ~ style(display = "block",
                                    `background-color` = route_colors[route],
                                    color = "black"))
))
member_casual route nbtrips
casual streeter dr & grand ave → streeter dr & grand ave 10953
casual millennium park → millennium park 5764
casual michigan ave & oak st → michigan ave & oak st 5588
casual lake shore dr & monroe st → lake shore dr & monroe st 4471
casual streeter dr & grand ave → millennium park 3308
casual buckingham fountain → buckingham fountain 3302
casual theater on the lake → theater on the lake 2987
casual dusable lake shore dr & monroe st → dusable lake shore dr & monroe st 2939
casual millennium park → streeter dr & grand ave 2925
casual montrose harbor → montrose harbor 2906
member ellis ave & 60th st → ellis ave & 55th st 4082
member ellis ave & 55th st → ellis ave & 60th st 3652
member ellis ave & 60th st → university ave & 57th st 3109
member university ave & 57th st → ellis ave & 60th st 3010
member calumet ave & 33rd st → state st & 33rd st 1989
member state st & 33rd st → calumet ave & 33rd st 1954
member loomis st & lexington st → morgan st & polk st 1860
member morgan st & polk st → loomis st & lexington st 1653
member mlk jr dr & 29th st → state st & 33rd st 1422
member state st & 33rd st → mlk jr dr & 29th st 1392
library(dplyr)
library(formattable)
library(RColorBrewer)

# Create route column
all_trips <- all_trips %>%
  mutate(route = paste(start_station_name, "→", end_station_name))

# Calculate average duration per route and member type in minutes
avg_duration_routes <- all_trips %>%
  group_by(member_casual, route) %>%
  summarise(avg_duration = mean(ride_length, na.rm = TRUE)/60,  # convert seconds to minutes
            .groups = "drop")

# Keep top 10 by average duration per member type
top_avg_routes <- avg_duration_routes %>%
  group_by(member_casual) %>%
  slice_max(avg_duration, n = 10) %>%
  ungroup()

# Create consistent color palette for routes
unique_routes <- unique(top_avg_routes$route)
palette_colors <- brewer.pal(min(length(unique_routes), 12), "Set3")
route_colors <- setNames(rep(palette_colors, length.out = length(unique_routes)), unique_routes)

# Display colored table
formattable(top_avg_routes, list(
  route = formatter("span",
                    style = ~ style(display = "block",
                                    `background-color` = route_colors[route],
                                    color = "black"))
))
member_casual route avg_duration
casual wood st & chicago ave → base - 2132 w hubbard warehouse 1439.3667
casual museum of science and industry → calumet ave & 33rd st 1425.7500
casual halsted st & dickens ave → avers ave & belmont ave 1424.7000
casual kedzie ave & chicago ave → state st & van buren st 1422.2500
casual south shore dr & 71st st → orleans st & elm st 1418.4833
casual ashland ave & lake st → richmond st & diversey ave 1415.6167
casual central ave & chicago ave → damen ave & madison st 1410.5667
casual halsted st & 96th st → elizabeth st & 92nd st 1396.7167
casual sheridan rd & greenleaf ave → dodge ave & church st 1394.2167
casual halsted st & willow st → base - 2132 w hubbard warehouse 1387.0167
member green st & randolph st → halsted st & archer ave 1328.3167
member racine ave & 18th st → wacker dr & washington st 1306.0500
member wabash ave & adams st → base - 2132 w hubbard warehouse 1224.9500
member state st & pearson st → central park blvd & 5th ave 1192.2000
member delano ct & roosevelt rd → base - 2132 w hubbard warehouse 1137.3167
member state st & kinzie st → museum of science and industry 1091.7167
member clinton st & madison st → base - 2132 w hubbard warehouse 1053.9500
member clark st & chicago ave → ravenswood ave & lawrence ave 987.9500
member morgan st & 31st st → rush st & superior st 974.4500
member lakefront trail & wilson ave → claremont ave & hirsch st 958.2667

Observation 7 : These are the top 3 routes more frequented by the casual riders :
_ streeter dr & grand ave → streeter dr & grand ave
_ millennium park → millennium park
_ michigan ave & oak st → michigan ave & oak st

Let’s have a seasonal approach to see how riders activity is influenced by the seasons, this first chart will show us the number of trips by seasons :

library(dplyr)
library(lubridate)
library(ggplot2)

all_trips <- all_trips %>%
  mutate(
    month = month(started_at),
    season = case_when(
      month %in% c(12, 1, 2)  ~ "Winter",
      month %in% c(3, 4, 5)   ~ "Spring",
      month %in% c(6, 7, 8)   ~ "Summer",
      month %in% c(9, 10, 11) ~ "Fall"
    ),
        season = factor(season, levels = c("Fall", "Winter", "Spring", "Summer"), ordered = TRUE)
  )
season_usage_by_type <- all_trips %>%
  group_by(season, member_casual) %>%
  summarise(nbtrips = n(), .groups = "drop")

ggplot(season_usage_by_type, aes(x = season, y = nbtrips, color = member_casual, group = member_casual)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(
    title = "Number of Trips by Season and Rider Type",
    x = "Season",
    y = "Number of Trips",
    color = "Rider Type"
  ) +
  theme_minimal()+
 theme(
    axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5), # centered
    plot.background = element_rect(fill = "white", color = NA)
  )

ggsave("C:/Users/Seifallah raach/Desktop/nbtripsbyseason.png", width = 8, height = 6, dpi = 300)

Observation 8 : The riders are more active in spring-summer-fall especially summer, casual riders are more active in the summer than members

This second chart will show us the average duration of trips by season :

library(dplyr)
library(ggplot2)
library(lubridate)

# Add month and season
all_trips <- all_trips %>%
  mutate(
    month = month(started_at),
    season = case_when(
      month %in% c(12, 1, 2)  ~ "Winter",
      month %in% c(3, 4, 5)   ~ "Spring",
      month %in% c(6, 7, 8)   ~ "Summer",
      month %in% c(9, 10, 11) ~ "Fall"
    ),
    season = factor(season, levels = c("Fall", "Winter", "Spring", "Summer"), ordered = TRUE)
  )

# Aggregate average ride duration by season and rider type (in minutes)
season_usage_by_type <- all_trips %>%
  group_by(season, member_casual) %>%
  summarise(avg_duration = mean(ride_length, na.rm = TRUE)/60, .groups = "drop")  # convert seconds to minutes

# Plot
ggplot(season_usage_by_type, aes(x = season, y = avg_duration, color = member_casual, group = member_casual)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(
    title = "Average Ride Duration by Season and Rider Type",
    x = "Season",
    y = "Average Duration (minutes)",
    color = "Rider Type"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"))+
  theme(plot.background = element_rect(fill = "white", color = NA))

ggsave("C:/Users/Seifallah raach/Desktop/averageridebyseason0.png", width = 8, height = 6, dpi = 300)

Observation 9 : Casual riders take longer trips in the spring , members are more consistent in the trip duration with a small increase in spring

Data analysis

Let’s summarize all our observations :

library(knitr)
library(kableExtra)

# Create a data frame with observations
observations <- data.frame(
  Observation = paste0("Observation ", 1:9),
  Description = c(
    "Classic bikes are popular for both rider types, and casual riders use docked bikes unlike members.",
    "For both members and casuals, the most active time is between 16h and 18h.",
    "Docked bikes are only used by casuals for long trips.",
    "Casual riders take longer trips than members, whereas members take more trips than casuals.",
    "Casual riders are more active on weekends, unlike members whose number of trips remains mostly consistent.",
    "Casual riders and members take longer trips on weekends.",
    "Top 3 routes more frequented by casual riders: Streeter Dr & Grand Ave → Streeter Dr & Grand Ave; Millennium Park → Millennium Park; Michigan Ave & Oak St → Michigan Ave & Oak St.",
    "Riders are more active in spring, summer, and fall, especially summer; casual riders are more active in summer than members.",
    "Casual riders take longer trips in spring, while members are more consistent in trip duration with a small increase in spring."
  ),
  stringsAsFactors = FALSE
)

# Create a formatted table with adjusted column widths
kable(observations, caption = "Cyclistic Riders Observations") %>%
  kable_styling(full_width = TRUE) %>%
  column_spec(1, width = "3cm") %>%
  column_spec(2, width = "12cm")
Cyclistic Riders Observations
Observation Description
Observation 1 Classic bikes are popular for both rider types, and casual riders use docked bikes unlike members.
Observation 2 For both members and casuals, the most active time is between 16h and 18h.
Observation 3 Docked bikes are only used by casuals for long trips.
Observation 4 Casual riders take longer trips than members, whereas members take more trips than casuals.
Observation 5 Casual riders are more active on weekends, unlike members whose number of trips remains mostly consistent.
Observation 6 Casual riders and members take longer trips on weekends.
Observation 7 Top 3 routes more frequented by casual riders: Streeter Dr & Grand Ave → Streeter Dr & Grand Ave; Millennium Park → Millennium Park; Michigan Ave & Oak St → Michigan Ave & Oak St. 
Observation 8 Riders are more active in spring, summer, and fall, especially summer; casual riders are more active in summer than members.
Observation 9 Casual riders take longer trips in spring, while members are more consistent in trip duration with a small increase in spring.

Top recommandations

We reached the end of our case study, based on the observations i made these are my top recommendations :

1. Promote cost saving for frequent casual riders who rides on weekends and during spring and have longer trips.
2. Targeted marketing campaigns during summer and spring for casual riders with ads (mail, phone..).
3. Targeted marketing campaign on the most popular routes frequented by the casual riders on rush hours (flyers, posters..).
3. Offer promotions during the peak seasons and weekends.
4. Promote the docked bikes and what they offer as benefits by being a member for using them.
5. Showcase the convenience of a membership to casual riders for longer trips.
6. Showcase member benefits when taking longer trips.

Presentation

I will wrap up this study by sharing the link to my presentation, i hope my work was informative : Final presentation

Thank you for reading
Seifallah Raach