This case study is my capstone project for the Google Data Analytics Certificate. It involves analysis of historical data for a fictional company, Cyclistic, a bike sharing company in Chicago. Although the company and scenario are fictitious, the data used for this project are real data collected between August 2023 – July 2024 from a bike share program in Chicago. In this project I am assuming the role of the junior analyst. I use R studio desktop environment for my analysis.

Scenario

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistics marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistics finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the director of marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, the director of marketing believes there is a solid opportunity to convert casual riders into members.

The director of marketing has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. My role here is to analyze the Cyclistic historical bike trip data to identify trends.

Step 1: Ask

The first step is to identify the business task and consider the stakeholders.

Business Tasks

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Stakeholders

  1. Cyclistic Executive Team
  2. Cyclistic Marketing Director
  3. Cyclistic Finance Analysts

Step 2: Prepare

The second step is to prepare data needed, and identify how it is organized, and determine credibility of the data.

Data Source

The data has been made available here by Motivate International Inc. under this license. For privacy purposes riders personally identifiable information has been excluded from the data.

Data Used

I am using the most recent dataset covering the past 12 months, from August 2023 to July 2024.

-202407-divvy-tripdata.csv
-202406-divvy-tripdata.csv
-202405-divvy-tripdata.csv
-202404-divvy-tripdata.csv
-202403-divvy-tripdata.csv
-202402-divvy-tripdata.csv
-202401-divvy-tripdata.csv
-202312-divvy-tripdata.csv
-202311-divvy-tripdata.csv
-202310-divvy-tripdata.csv
-202309-divvy-tripdata.csv
-202308-divvy-tripdata.csv

Is the data ROCCC? (Reliable, Original, Comprehensive, Current, and Cited)

  1. Reliability: Yes. The dataset comes from the City of Chicago’s Divvy program, which is managed by a publicly traded company called Lyft Bikes and Scooters, LLC.
  2. Originality: Yes. The data is original and collected directly from the first party.
  3. Comprehensiveness: Yes. The data is comprehensive to achieve the business task, and answer the questions at hand.
  4. Current: Yes. The data is current. It has been collected in the past 12 months.
  5. Cited: Yes. The data has been made available by Motivate International Inc. under the license. Cited by Mobility on Demand (MOD) Sandbox Demonstration.

Preparing RStudio - load required packages.

library(tidyverse)
library(lubridate)
library(skimr)
library(scales)

Reading CSV files into Data Frames.

tripdata202407 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202407-divvy-tripdata.csv")
tripdata202406 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202406-divvy-tripdata.csv")
tripdata202405 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202405-divvy-tripdata.csv")
tripdata202404 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202404-divvy-tripdata.csv")
tripdata202403 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202403-divvy-tripdata.csv")
tripdata202402 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202402-divvy-tripdata.csv")
tripdata202401 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202401-divvy-tripdata.csv")
tripdata202312 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202312-divvy-tripdata.csv")
tripdata202311 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202311-divvy-tripdata.csv")
tripdata202310 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202310-divvy-tripdata.csv")
tripdata202309 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202309-divvy-tripdata.csv")
tripdata202308 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202308-divvy-tripdata.csv")

Each .CSV file contains data organized into rows and columns. Each row represents a single trip, identified by a unique ride_id. The dataset includes 13 columns for each trip, providing comprehensive details on the ride.

glimpse(tripdata202407)
## Rows: 748,962
## Columns: 13
## $ ride_id            <chr> "2658E319B13141F9", "B2176315168A47CE", "C2A9D33DF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2024-07-11 08:15:14.784", "2024-07-11 15:45:07.851…
## $ ended_at           <chr> "2024-07-11 08:17:56.335", "2024-07-11 16:06:04.243…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "California Ave & M…
## $ start_station_id   <chr> "", "", "", "", "", "", "", "", "13084", "", "", ""…
## $ end_station_name   <chr> "", "", "", "", "", "", "", "", "California Ave & M…
## $ end_station_id     <chr> "", "", "", "", "", "", "", "", "13084", "", "", ""…
## $ start_lat          <dbl> 41.80000, 41.79000, 41.79000, 41.88000, 41.95000, 4…
## $ start_lng          <dbl> -87.59000, -87.60000, -87.59000, -87.64000, -87.640…
## $ end_lat            <dbl> 41.79000, 41.80000, 41.79000, 41.90000, 41.91000, 4…
## $ end_lng            <dbl> -87.59000, -87.59000, -87.60000, -87.67000, -87.620…
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "…

Step 3: Process

The third step is to inspect the data for errors, clean, and transform it to enable effective analysis.
I am checking each Data Frame for consistency to ensure the column counts and data types match before merging them.

glimpse(tripdata202308)
## Rows: 771,693
## Columns: 13
## $ ride_id            <chr> "903C30C2D810A53B", "F2FB18A98E110A2B", "D0DEC7C94E…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <chr> "2023-08-19 15:41:53", "2023-08-18 15:30:18", "2023…
## $ ended_at           <chr> "2023-08-19 15:53:36", "2023-08-18 15:45:25", "2023…
## $ start_station_name <chr> "LaSalle St & Illinois St", "Clark St & Randolph St…
## $ start_station_id   <chr> "13430", "TA1305000030", "TA1305000030", "KA1504000…
## $ end_station_name   <chr> "Clark St & Elm St", "", "", "", "", "", "", "", ""…
## $ end_station_id     <chr> "TA1307000039", "", "", "", "", "", "", "", "", "",…
## $ start_lat          <dbl> 41.89072, 41.88451, 41.88498, 41.90310, 41.88555, 4…
## $ start_lng          <dbl> -87.63148, -87.63155, -87.63079, -87.63467, -87.632…
## $ end_lat            <dbl> 41.90297, 41.93000, 41.91000, 41.90000, 41.89000, 4…
## $ end_lng            <dbl> -87.63128, -87.64000, -87.63000, -87.62000, -87.680…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
glimpse(tripdata202309)
## Rows: 666,371
## Columns: 13
## $ ride_id            <chr> "011C1903BF4E2E28", "87DB80E048A1BF9F", "7C2EB7AF66…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "electric_bike", "c…
## $ started_at         <chr> "2023-09-23 00:27:50", "2023-09-02 09:26:43", "2023…
## $ ended_at           <chr> "2023-09-23 00:33:27", "2023-09-02 09:38:19", "2023…
## $ start_station_name <chr> "Halsted St & Wrightwood Ave", "Clark St & Drummond…
## $ start_station_id   <chr> "TA1309000061", "TA1307000142", "SL-010", "TA130700…
## $ end_station_name   <chr> "Sheffield Ave & Wellington Ave", "Racine Ave & Ful…
## $ end_station_id     <chr> "TA1307000052", "TA1306000026", "13304", "TA1308000…
## $ start_lat          <dbl> 41.92914, 41.93125, 41.87506, 41.93125, 41.92914, 4…
## $ start_lng          <dbl> -87.64908, -87.64434, -87.63314, -87.64434, -87.649…
## $ end_lat            <dbl> 41.93625, 41.92557, 41.86127, 41.93974, 41.92557, 4…
## $ end_lng            <dbl> -87.65266, -87.65842, -87.65663, -87.65887, -87.658…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

Consistency of File Structure

Using the method above, I verified that all 12 data sets have consistent column names, types, and counts. I then combined them into a single Data Frame called all_data.

all_data <- bind_rows(tripdata202407,tripdata202406,tripdata202405,tripdata202404,tripdata202403,tripdata202402,tripdata202401, tripdata202312,tripdata202311,tripdata202310,tripdata202309, tripdata202308)

The data frame is evaluated.

skim_without_charts(all_data)
Data summary
Name all_data
Number of rows 5715693
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 5715482 0
rideable_type 0 1 11 13 0 3 0
started_at 0 1 19 23 0 5100086 0
ended_at 0 1 19 23 0 5107236 0
start_station_name 0 1 0 64 947025 1706 0
start_station_id 0 1 0 14 947025 1670 0
end_station_name 0 1 0 64 989476 1720 0
end_station_id 0 1 0 36 989476 1682 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.63 41.88 41.90 41.93 42.07
start_lng 0 1 -87.65 0.03 -87.94 -87.66 -87.64 -87.63 -87.46
end_lat 7756 1 41.90 0.05 0.00 41.88 41.90 41.93 87.96
end_lng 7756 1 -87.65 0.05 -144.05 -87.66 -87.64 -87.63 0.00

Clean Duplication

There is a discrepancy between the total row count and the unique ride ID count: 5,715,693 rows versus 5,715,482 unique ride IDs. This suggests there may be duplicated trips in the all_data data set.

duplicates <- all_data %>% 
  group_by(ride_id) %>% 
  mutate(dupe = n()>1) %>% 
  filter(dupe == 'TRUE')

nrow(duplicates)
## [1] 422

A total of 422 records are duplicated. I’ll examine an example to identify the underlying cause.

all_data %>% 
  filter(ride_id =='3B5CE4D8B3EE6ED8')
##            ride_id rideable_type              started_at
## 1 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04.153
## 2 3B5CE4D8B3EE6ED8 electric_bike     2024-05-31 23:50:04
##                  ended_at start_station_name start_station_id end_station_name
## 1 2024-06-01 00:06:08.273                                                     
## 2     2024-06-01 00:06:08                                                     
##   end_station_id start_lat start_lng end_lat end_lng member_casual
## 1                    41.98    -87.67   41.98  -87.67        casual
## 2                    41.98    -87.67   41.98  -87.67        casual

The started_at date is in May, while the ended_at date is in June. Upon reviewing the list of duplicates, all records show a similar pattern: the trip starts in one month and ends in the next.

tripdata202405 %>% 
  filter(ride_id =='3B5CE4D8B3EE6ED8')
##            ride_id rideable_type          started_at            ended_at
## 1 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04 2024-06-01 00:06:08
##   start_station_name start_station_id end_station_name end_station_id start_lat
## 1                                                                         41.98
##   start_lng end_lat end_lng member_casual
## 1    -87.67   41.98  -87.67        casual
tripdata202406 %>% 
  filter(ride_id =='3B5CE4D8B3EE6ED8')
##            ride_id rideable_type              started_at
## 1 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04.153
##                  ended_at start_station_name start_station_id end_station_name
## 1 2024-06-01 00:06:08.273                                                     
##   end_station_id start_lat start_lng end_lat end_lng member_casual
## 1                    41.98    -87.67   41.98  -87.67        casual

This confirms that the same ride_id appears in two different source files: one corresponding to the month when the trip started and the other corresponding to the month when the trip ended.

Eliminating the duplicated trips.

all_data_2 <- distinct(all_data, ride_id,.keep_all = TRUE)
skim_without_charts(all_data_2)
Data summary
Name all_data_2
Number of rows 5715482
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 5715482 0
rideable_type 0 1 11 13 0 3 0
started_at 0 1 19 23 0 5099911 0
ended_at 0 1 19 23 0 5107031 0
start_station_name 0 1 0 64 947002 1706 0
start_station_id 0 1 0 14 947002 1670 0
end_station_name 0 1 0 64 989396 1720 0
end_station_id 0 1 0 36 989396 1682 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.63 41.88 41.90 41.93 42.07
start_lng 0 1 -87.65 0.03 -87.94 -87.66 -87.64 -87.63 -87.46
end_lat 7716 1 41.90 0.05 0.00 41.88 41.90 41.93 87.96
end_lng 7716 1 -87.65 0.05 -144.05 -87.66 -87.64 -87.63 0.00

Now, the total number of rows and the unique count for ride_id both match at 5,715,482. The duplicates we identified initially totaled 422 records. After removing these duplicates, 211 duplicated entries have been eliminated, which aligns with the expected result: 5,715,693 total rows minus 211 equals 5,715,482.

Add Duration Field

Next, I want to analyze the date, time, and duration of trips. To do this, I first convert the started_at and ended_at fields from character format to date format. Then, I create a new column, duration_trip, which calculates the difference between ended_at and started_at.

str(all_data_2)
## 'data.frame':    5715482 obs. of  13 variables:
##  $ ride_id           : chr  "2658E319B13141F9" "B2176315168A47CE" "C2A9D33DF7EBB422" "8BFEA406DF01D8AD" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2024-07-11 08:15:14.784" "2024-07-11 15:45:07.851" "2024-07-11 08:24:48.192" "2024-07-11 08:46:06.864" ...
##  $ ended_at          : chr  "2024-07-11 08:17:56.335" "2024-07-11 16:06:04.243" "2024-07-11 08:28:05.237" "2024-07-11 09:14:11.664" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.8 41.8 41.8 41.9 42 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.8 41.8 41.8 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
all_data_2 <- all_data_2 %>%
  mutate(started_time = ymd_hms(started_at), ended_time = ymd_hms(ended_at)) %>%
  mutate(duration_trip = difftime(ended_time, started_time, units='mins' )) %>% 
  arrange(duration_trip)

str(all_data_2)
## 'data.frame':    5715482 obs. of  16 variables:
##  $ ride_id           : chr  "F584D47AE67FD388" "AE046C379C20B7CA" "A21D6507DA3C5AD4" "DEC5EF8DE27398A0" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2023-11-05 21:08:17" "2023-11-05 20:46:59" "2023-11-05 16:41:54" "2023-11-05 11:56:19" ...
##  $ ended_at          : chr  "2023-10-25 07:31:46" "2023-10-25 07:31:46" "2023-10-25 07:31:46" "2023-10-25 07:31:46" ...
##  $ start_station_name: chr  "Sheffield Ave & Waveland Ave" "Sheridan Rd & Irving Park Rd" "Pine Grove Ave & Irving Park Rd" "Pine Grove Ave & Irving Park Rd" ...
##  $ start_station_id  : chr  "TA1307000126" "13063" "TA1308000022" "TA1308000022" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 42 42 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  42 42 42 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "member" "member" "casual" ...
##  $ started_time      : POSIXct, format: "2023-11-05 21:08:17" "2023-11-05 20:46:59" ...
##  $ ended_time        : POSIXct, format: "2023-10-25 07:31:46" "2023-10-25 07:31:46" ...
##  $ duration_trip     : 'difftime' num  -16656.5166666667 -16635.2166666667 -16390.1333333333 -16104.55 ...
##   ..- attr(*, "units")= chr "mins"

Errors in Durations

I will examine the top 30 trip durations in both ascending and descending order. Start by checking the top 30 longest durations in descending order.

all_data_2%>% 
  select(ride_id,duration_trip) %>% 
  arrange(desc(duration_trip)) %>% 
  head(., 30)
##             ride_id duration_trip
## 1  992B647C22E22513 98489.07 mins
## 2  24154619AEE96E2C 92569.92 mins
## 3  9EC4ACDAA96E516F 83382.58 mins
## 4  ED31EE7C8024D292 79775.02 mins
## 5  1A572575C18B600B 64171.70 mins
## 6  07DD54EC457E4395 64009.25 mins
## 7  9DA133B7B8988662 62867.10 mins
## 8  0219824575E2861E 56194.53 mins
## 9  36871DD19ACB6213 55070.43 mins
## 10 6154C42892409B22 49928.55 mins
## 11 546DC24A21A6F971 41362.72 mins
## 12 26F68D0429B6BF91 40567.23 mins
## 13 50FF52776B3FE2F4 37943.72 mins
## 14 06CAB713FD37C365 37333.05 mins
## 15 B1D738A1392D2ABF 36839.50 mins
## 16 E64733C6B18FD768 36825.88 mins
## 17 915085EDE8057FBA 36235.98 mins
## 18 62FC1E17C20B5116 35203.85 mins
## 19 C5CE2D735E80F85A 33839.85 mins
## 20 41D1CC391FC73736 33740.62 mins
## 21 19AE36E06971290E 32327.80 mins
## 22 305EFFAC8E5FF16C 32290.33 mins
## 23 7406AFEC7CD42296 29431.98 mins
## 24 0C46B573756F7492 27519.60 mins
## 25 3F41803650E03489 27329.65 mins
## 26 99DBC39AA8909D49 24998.60 mins
## 27 C44274716AF61542 24322.65 mins
## 28 C3A4B0708604BA46 23850.43 mins
## 29 417895DBFBF67726 23802.68 mins
## 30 FE924E5E15D781CB 23740.68 mins

The top records indicate trip durations exceeding 2 months. Since most ride-sharing companies require bikes to be returned within 24 hours (1440 minutes), I will remove any trips longer than 24 hours to exclude non-trip records, such as those related to repairs or maintenance.

counts_more_one_day <-all_data_2 %>% 
  filter(duration_trip > 1440) 

nrow(counts_more_one_day)
## [1] 7958

The total of rows that is longer than 24 hr is 7,958. Removing those.

nrow(all_data_2)
## [1] 5715482
all_data_3 <- all_data_2 %>% 
  filter(duration_trip <= 1440)

nrow(all_data_3)
## [1] 5707524
nrow(all_data_2)-nrow(all_data_3)
## [1] 7958

The reduction is 7958. This matches with the expected reduction in counts.

Next, I review the top 30 shortest trip durations, sorted in ascending order.

all_data_3%>% 
  select(ride_id,duration_trip) %>% 
  arrange(duration_trip) %>% 
  head(., 30)
##             ride_id     duration_trip
## 1  F584D47AE67FD388 -16656.51667 mins
## 2  AE046C379C20B7CA -16635.21667 mins
## 3  A21D6507DA3C5AD4 -16390.13333 mins
## 4  DEC5EF8DE27398A0 -16104.55000 mins
## 5  7850F6E2343BF766 -10626.40000 mins
## 6  5A5DDAFFF234FB69 -10475.75000 mins
## 7  9A4225B190E96360  -2748.31667 mins
## 8  0F93FAE9410E4A13  -1283.13333 mins
## 9  A9CF0518718BC255   -160.03333 mins
## 10 EE37F58076FC6659   -115.00000 mins
## 11 D8D9D4D695F852EA    -81.68333 mins
## 12 3CFD92C1661FD4BE    -72.68333 mins
## 13 8A5539BF7A770638    -72.63333 mins
## 14 ACFF55150DC33066    -69.66667 mins
## 15 998B2907B57E46C8    -69.33333 mins
## 16 2EF61FB1465E9D97    -68.61667 mins
## 17 40405DE1BEB5A318    -68.51667 mins
## 18 348C9F7C40B1549A    -67.86667 mins
## 19 F75D8921B3431AA5    -67.76667 mins
## 20 97E948FC84B894C9    -67.51667 mins
## 21 BE71C467B8FD1BCE    -67.23333 mins
## 22 5AB3D70A92A6F3FB    -67.01667 mins
## 23 2D45B11B6F2AC720    -66.80000 mins
## 24 3896F65BC5DCB40F    -66.50000 mins
## 25 16A9D3A073BA48C8    -64.96667 mins
## 26 342C65E863A728CB    -64.70000 mins
## 27 16703BADCB2D82C6    -64.18333 mins
## 28 2A616CC146F5E254    -63.50000 mins
## 29 3A8A13C546C52780    -62.65000 mins
## 30 DE66BFB1053661E6    -62.33333 mins

Some records have a negative duration for trips. I will remove these, along with any trips lasting 1 minute or less, to eliminate potential errors and non-trip records.

counts_less_one_min <-all_data_3 %>% 
  filter(duration_trip <= 1) 

nrow(counts_less_one_min)
## [1] 132261

The total of rows that has less than 1 min is 132,261. Removing those.

nrow(all_data_3)
## [1] 5707524
all_data_4 <- all_data_3 %>% 
  filter(duration_trip > 1)

nrow(all_data_4)
## [1] 5575263
nrow(all_data_3)-nrow(all_data_4)
## [1] 132261

The reduction is 132,261. This matches with the expected reduction in counts.

Rideable Types Error

Checking rideable_type in data set.

check_rideable_type <- all_data_4 %>% 
  group_by(rideable_type) %>% 
  summarize(n())

print(check_rideable_type)
## # A tibble: 3 × 2
##   rideable_type   `n()`
##   <chr>           <int>
## 1 classic_bike  2822057
## 2 docked_bike     15459
## 3 electric_bike 2737747

There are 15,459 records listed under docked_bike, which is not a valid bike type. I will remove these records.

nrow(all_data_4)
## [1] 5575263
all_data_5 <- all_data_4 %>% 
  filter(rideable_type !='docked_bike')
nrow(all_data_5)
## [1] 5559804
nrow(all_data_4)-nrow(all_data_5)
## [1] 15459

The reduction is 15,459. This matches with the expected reduction in counts.

Transform

To analyze the time of day, day of the week, and month of the year for trips, I will add new fields to the data set. Additionally, I will remove any fields that are no longer relevant to the study. I will retain the following fields: ‘ride_id’, ‘rideable_type’, member_casual, started_time, ended_time, duration_trip, and start_station_name for my analysis.

all_data_6 <- all_data_5 %>% 
  select(ride_id, rideable_type, member_casual, started_time, ended_time, duration_trip, start_station_name) %>% 
  mutate(day_of_week = wday(started_time, label =TRUE, week_start=1)) %>%
  mutate(time_of_day = hour(started_time)) %>%
  mutate(month_of_year=month(started_time)) 

head(all_data_6)
##            ride_id rideable_type member_casual        started_time
## 1 40BE0A8329F9D037 electric_bike        casual 2024-07-01 13:52:47
## 2 9E7FFA938C011405 electric_bike        casual 2024-07-19 18:48:30
## 3 3D205C0CBDA67EB2 electric_bike        casual 2024-07-05 11:44:25
## 4 24FA4E059A078BFC electric_bike        member 2024-07-12 18:00:20
## 5 C48702E18A51D9F2 electric_bike        casual 2024-06-22 15:54:08
## 6 D86A3D7B47129D63 electric_bike        casual 2024-06-10 13:18:06
##            ended_time duration_trip                 start_station_name
## 1 2024-07-01 13:53:47 1.000017 mins                                   
## 2 2024-07-19 18:49:30 1.000050 mins DuSable Lake Shore Dr & North Blvd
## 3 2024-07-05 11:45:25 1.000083 mins                                   
## 4 2024-07-12 18:01:20 1.000100 mins               Halsted St & 18th St
## 5 2024-06-22 15:55:08 1.000150 mins                                   
## 6 2024-06-10 13:19:06 1.000200 mins                                   
##   day_of_week time_of_day month_of_year
## 1         Mon          13             7
## 2         Fri          18             7
## 3         Fri          11             7
## 4         Fri          18             7
## 5         Sat          15             6
## 6         Mon          13             6

Step 4: Analyze

Data is cleaned and organized, and ready for analysis. Total of 5,559,804 trip records are used for the analysis. In this step, I will find patterns, relationships and trends of trips.

Trips by Member Type

I want to check number of trips for annual members compared to casual riders.

ggplot(data = all_data_6) +
  geom_bar(mapping = aes(x = member_casual,fill=member_casual))+
  ggtitle("Number of Trips by Member Type")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

The counts of annual members trips are much greater.

Type of Bikes Used

How do different types of bikes get used by annual members and casual riders?

ggplot(data = all_data_6) +
  geom_bar(mapping = aes(x = member_casual, fill=rideable_type))+
  ggtitle("Number of Trips - Rideable Types")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

It shows that both annual members and casual riders use classic and electric bikes approximately equally.

Weekly Riding Patterns

How do riders take trips throughout the week?

ggplot(data=all_data_6) +
  geom_bar(mapping = aes(x = day_of_week,fill=member_casual)) +
  facet_wrap(~member_casual)+
  ggtitle("Number of Trips by Day of Week")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

It shows annual members takes more trips during weekdays, and casual riders ride the most trips on weekends.

Monthly Riding Patterns

Next, I will analyze the monthly riding patterns of annual members versus casual riders. To achieve this, I will plot the number of trips for each month.

ggplot(data=all_data_6) +
  geom_bar(mapping = aes(x = month_of_year,fill =member_casual)) +
  facet_wrap(~member_casual)+
  scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12))+
  ggtitle("Number of Trips by Month")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

The data reveals that both types of riders experience peak usage during June, July, and August.

Daily Riding Patterns

Next, I want to analyze trips by time of day.

ggplot(data=all_data_6, aes(x=time_of_day, fill=member_casual)) +
  geom_histogram(bins = 24, color="black") +
  facet_wrap(~member_casual)+
  scale_x_continuous(breaks=c(0,3,6,9,12,15,18,21,24))+
  ggtitle("Number of Trips by Time of Day")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

Since annual members mostly ride during weekdays and casual riders primarily on weekends, I will separately analyze trips for weekdays and weekends.

all_data_6 %>%
  filter(day_of_week== 'Mon' | day_of_week =='Tue'| day_of_week =='Wed' | day_of_week =='Thu' |day_of_week == 'Fri') %>% 
  ggplot(aes(x=time_of_day, fill = member_casual)) +
  geom_histogram(bins = 24, color="black")+
  scale_x_continuous(breaks=c(0,3,6,9,12,15,18,21,24))+
  facet_wrap(~member_casual)+
  ggtitle("Number of Trips - Weekdays")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

The data shows that on weekdays, annual members experience peak usage between 7–8 AM and 4–6 PM, while casual riders peak around 5 PM.

Next, let’s plot the same data for weekends.

all_data_6 %>%
  filter(day_of_week== 'Sat' | day_of_week =='Sun') %>% 
  ggplot(aes(x=time_of_day, fill=member_casual))+
  geom_histogram(bins = 24, color="black")+
  scale_x_continuous(breaks=c(0,3,6,9,12,15,18,21,24))+
  facet_wrap(~member_casual)+
  ggtitle("Number of Trips - Weekends")+
  scale_y_continuous(labels = label_comma())+
  ylab("count of trips")

On weekends, both types of riders tend to ride throughout the day, from 10 AM to 7 PM.

Duration and Count of Trips

How does the duration of trips differ between annual members and casual riders? Let’s get the average duration, the count of trips, and the percentage of trip counts for annual members and casual riders.

summary_trips <- all_data_6 %>% 
  group_by(member_casual) %>% 
  summarize(counts_usage = n(), average_duration = mean(duration_trip)) 


summary_trips_percentage<- summary_trips %>% 
  mutate(perc_counts = counts_usage/sum(counts_usage)*100)

print(summary_trips_percentage)
## # A tibble: 2 × 4
##   member_casual counts_usage average_duration perc_counts
##   <chr>                <int> <drtn>                 <dbl>
## 1 casual             1960930 21.56300 mins           35.3
## 2 member             3598874 12.56288 mins           64.7

The average trip duration by casual riders is 22 mins vs. the average trip duration by annual members is 13 mins. 65% of the trip are made by annual members.

Frequently Used Stations

What are the frequently used stations by annual members and casual riders? First, I am showing the top 30 stations used by annual members.

all_data_6 %>% 
  filter(start_station_name != "") %>% 
  filter(member_casual =="member") %>% 
  group_by(start_station_name) %>% 
  summarize(counts = n()) %>% 
  arrange(desc(counts)) %>% 
  top_n(10)
## Selecting by counts
## # A tibble: 10 × 2
##    start_station_name           counts
##    <chr>                         <int>
##  1 Clinton St & Washington Blvd  27524
##  2 Kingsbury St & Kinzie St      26666
##  3 Clark St & Elm St             24115
##  4 Clinton St & Madison St       22884
##  5 Wells St & Concord Ln         20299
##  6 Wells St & Elm St             20031
##  7 University Ave & 57th St      18849
##  8 Clinton St & Jackson Blvd     18039
##  9 State St & Chicago Ave        17949
## 10 Dearborn St & Erie St         17685

Next, I am showing the top 30 stations used by casual riders.

all_data_6 %>% 
  filter(start_station_name != "") %>% 
  filter(member_casual =="casual") %>% 
  group_by(start_station_name) %>% 
  summarize(counts = n()) %>% 
  arrange(desc(counts)) %>% 
  top_n(10)
## Selecting by counts
## # A tibble: 10 × 2
##    start_station_name                 counts
##    <chr>                               <int>
##  1 Streeter Dr & Grand Ave             46036
##  2 DuSable Lake Shore Dr & Monroe St   30318
##  3 Michigan Ave & Oak St               23106
##  4 DuSable Lake Shore Dr & North Blvd  20905
##  5 Millennium Park                     19502
##  6 Shedd Aquarium                      19021
##  7 Dusable Harbor                      16960
##  8 Theater on the Lake                 15784
##  9 Adler Planetarium                   12588
## 10 Michigan Ave & 8th St               12138

The stations used frequently by annual members and casual riders are quite different.

Step 5: Share

There are the key findings.

Key Findings

-Annual members ride more trips (65%) than casual riders (35%).
-Casual riders ride longer trips (av 22 mins) than annual members (av 13 mins).
-Casual riders ride more on weekends.
-Annual members ride more during the commute hours, 7-8am and 4-6pm on weekdays .
-The peak season for both annual members and casual riders are in June, July and Aug.
-Frequently used stations by annual member and casual riders are different.

Step 6: Act

Based on the key findings, these are my top 3 recommendations.

My top 3 recommendations

  1. Introduce an annual membership with weekend discounts: Casual riders tend to ride most frequently during the weekends. By offering an annual membership pass that includes a weekend discount, we can encourage these casual riders to switch to annual members.

  2. Create a seasonal membership. Casual riders, along with annual members, tend to ride most frequently during June, July, and August. Offering a seasonal membership pass specifically for these peak months could attract casual riders to become members.

  3. Target advertising at key stations for casual riders. The top 10 stations most frequented by casual riders differ significantly from those used by annual members, providing Cyclistic with a strategic opportunity to convert casual riders into members. By focusing advertising efforts at these high-traffic stations, Cyclistic can effectively reach its target audience and promote membership benefits. The top 10 stations by casual rider usage are:

    1. Streeter Dr & Grand Ave 47170
    2. DuSable Lake Shore Dr & Monroe St 31101
    3. Michigan Ave & Oak St 23579
    4. DuSable Lake Shore Dr & North Blvd 21278
    5. Millennium Park 20049
    6. Shedd Aquarium 19472
    7. Dusable Harbor 17364
    8. Theater on the Lake 16062
    9. Adler Planetarium 12884
    10. Michigan Ave & 8th St 12440