This case study is my capstone project for the Google Data Analytics Certificate. It involves analysis of historical data for a fictional company, Cyclistic, a bike sharing company in Chicago. Although the company and scenario are fictitious, the data used for this project are real data collected between August 2023 – July 2024 from a bike share program in Chicago. In this project I am assuming the role of the junior analyst. I use R studio desktop environment for my analysis.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistics marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistics finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the director of marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, the director of marketing believes there is a solid opportunity to convert casual riders into members.
The director of marketing has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. My role here is to analyze the Cyclistic historical bike trip data to identify trends.
The first step is to identify the business task and consider the stakeholders.
Business Tasks
Stakeholders
The second step is to prepare data needed, and identify how it is organized, and determine credibility of the data.
Data Source
The data has been made available here by Motivate International Inc. under this license. For privacy purposes riders personally identifiable information has been excluded from the data.
Data Used
I am using the most recent dataset covering the past 12 months, from August 2023 to July 2024.
-202407-divvy-tripdata.csv
-202406-divvy-tripdata.csv
-202405-divvy-tripdata.csv
-202404-divvy-tripdata.csv
-202403-divvy-tripdata.csv
-202402-divvy-tripdata.csv
-202401-divvy-tripdata.csv
-202312-divvy-tripdata.csv
-202311-divvy-tripdata.csv
-202310-divvy-tripdata.csv
-202309-divvy-tripdata.csv
-202308-divvy-tripdata.csv
Is the data ROCCC? (Reliable, Original, Comprehensive, Current, and Cited)
Preparing RStudio - load required packages.
library(tidyverse)
library(lubridate)
library(skimr)
library(scales)
Reading CSV files into Data Frames.
tripdata202407 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202407-divvy-tripdata.csv")
tripdata202406 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202406-divvy-tripdata.csv")
tripdata202405 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202405-divvy-tripdata.csv")
tripdata202404 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202404-divvy-tripdata.csv")
tripdata202403 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202403-divvy-tripdata.csv")
tripdata202402 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202402-divvy-tripdata.csv")
tripdata202401 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202401-divvy-tripdata.csv")
tripdata202312 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202312-divvy-tripdata.csv")
tripdata202311 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202311-divvy-tripdata.csv")
tripdata202310 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202310-divvy-tripdata.csv")
tripdata202309 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202309-divvy-tripdata.csv")
tripdata202308 = read.csv("C:\\Users\\asosa\\OneDrive\\Documents\\Data_Projects\\Case_Study\\Data\\202308-divvy-tripdata.csv")
Each .CSV file contains data organized into rows and columns. Each row represents a single trip, identified by a unique ride_id. The dataset includes 13 columns for each trip, providing comprehensive details on the ride.
glimpse(tripdata202407)
## Rows: 748,962
## Columns: 13
## $ ride_id <chr> "2658E319B13141F9", "B2176315168A47CE", "C2A9D33DF7…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <chr> "2024-07-11 08:15:14.784", "2024-07-11 15:45:07.851…
## $ ended_at <chr> "2024-07-11 08:17:56.335", "2024-07-11 16:06:04.243…
## $ start_station_name <chr> "", "", "", "", "", "", "", "", "California Ave & M…
## $ start_station_id <chr> "", "", "", "", "", "", "", "", "13084", "", "", ""…
## $ end_station_name <chr> "", "", "", "", "", "", "", "", "California Ave & M…
## $ end_station_id <chr> "", "", "", "", "", "", "", "", "13084", "", "", ""…
## $ start_lat <dbl> 41.80000, 41.79000, 41.79000, 41.88000, 41.95000, 4…
## $ start_lng <dbl> -87.59000, -87.60000, -87.59000, -87.64000, -87.640…
## $ end_lat <dbl> 41.79000, 41.80000, 41.79000, 41.90000, 41.91000, 4…
## $ end_lng <dbl> -87.59000, -87.59000, -87.60000, -87.67000, -87.620…
## $ member_casual <chr> "casual", "casual", "casual", "casual", "casual", "…
The third step is to inspect the data for errors, clean, and
transform it to enable effective analysis.
I am checking each Data Frame for consistency to ensure the column
counts and data types match before merging them.
glimpse(tripdata202308)
## Rows: 771,693
## Columns: 13
## $ ride_id <chr> "903C30C2D810A53B", "F2FB18A98E110A2B", "D0DEC7C94E…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <chr> "2023-08-19 15:41:53", "2023-08-18 15:30:18", "2023…
## $ ended_at <chr> "2023-08-19 15:53:36", "2023-08-18 15:45:25", "2023…
## $ start_station_name <chr> "LaSalle St & Illinois St", "Clark St & Randolph St…
## $ start_station_id <chr> "13430", "TA1305000030", "TA1305000030", "KA1504000…
## $ end_station_name <chr> "Clark St & Elm St", "", "", "", "", "", "", "", ""…
## $ end_station_id <chr> "TA1307000039", "", "", "", "", "", "", "", "", "",…
## $ start_lat <dbl> 41.89072, 41.88451, 41.88498, 41.90310, 41.88555, 4…
## $ start_lng <dbl> -87.63148, -87.63155, -87.63079, -87.63467, -87.632…
## $ end_lat <dbl> 41.90297, 41.93000, 41.91000, 41.90000, 41.89000, 4…
## $ end_lng <dbl> -87.63128, -87.64000, -87.63000, -87.62000, -87.680…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
glimpse(tripdata202309)
## Rows: 666,371
## Columns: 13
## $ ride_id <chr> "011C1903BF4E2E28", "87DB80E048A1BF9F", "7C2EB7AF66…
## $ rideable_type <chr> "classic_bike", "classic_bike", "electric_bike", "c…
## $ started_at <chr> "2023-09-23 00:27:50", "2023-09-02 09:26:43", "2023…
## $ ended_at <chr> "2023-09-23 00:33:27", "2023-09-02 09:38:19", "2023…
## $ start_station_name <chr> "Halsted St & Wrightwood Ave", "Clark St & Drummond…
## $ start_station_id <chr> "TA1309000061", "TA1307000142", "SL-010", "TA130700…
## $ end_station_name <chr> "Sheffield Ave & Wellington Ave", "Racine Ave & Ful…
## $ end_station_id <chr> "TA1307000052", "TA1306000026", "13304", "TA1308000…
## $ start_lat <dbl> 41.92914, 41.93125, 41.87506, 41.93125, 41.92914, 4…
## $ start_lng <dbl> -87.64908, -87.64434, -87.63314, -87.64434, -87.649…
## $ end_lat <dbl> 41.93625, 41.92557, 41.86127, 41.93974, 41.92557, 4…
## $ end_lng <dbl> -87.65266, -87.65842, -87.65663, -87.65887, -87.658…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
Using the method above, I verified that all 12 data sets have consistent column names, types, and counts. I then combined them into a single Data Frame called all_data.
all_data <- bind_rows(tripdata202407,tripdata202406,tripdata202405,tripdata202404,tripdata202403,tripdata202402,tripdata202401, tripdata202312,tripdata202311,tripdata202310,tripdata202309, tripdata202308)
The data frame is evaluated.
skim_without_charts(all_data)
| Name | all_data |
| Number of rows | 5715693 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5715482 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 23 | 0 | 5100086 | 0 |
| ended_at | 0 | 1 | 19 | 23 | 0 | 5107236 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 947025 | 1706 | 0 |
| start_station_id | 0 | 1 | 0 | 14 | 947025 | 1670 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 989476 | 1720 | 0 |
| end_station_id | 0 | 1 | 0 | 36 | 989476 | 1682 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.63 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.94 | -87.66 | -87.64 | -87.63 | -87.46 |
| end_lat | 7756 | 1 | 41.90 | 0.05 | 0.00 | 41.88 | 41.90 | 41.93 | 87.96 |
| end_lng | 7756 | 1 | -87.65 | 0.05 | -144.05 | -87.66 | -87.64 | -87.63 | 0.00 |
There is a discrepancy between the total row count and the unique ride ID count: 5,715,693 rows versus 5,715,482 unique ride IDs. This suggests there may be duplicated trips in the all_data data set.
duplicates <- all_data %>%
group_by(ride_id) %>%
mutate(dupe = n()>1) %>%
filter(dupe == 'TRUE')
nrow(duplicates)
## [1] 422
A total of 422 records are duplicated. I’ll examine an example to identify the underlying cause.
all_data %>%
filter(ride_id =='3B5CE4D8B3EE6ED8')
## ride_id rideable_type started_at
## 1 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04.153
## 2 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04
## ended_at start_station_name start_station_id end_station_name
## 1 2024-06-01 00:06:08.273
## 2 2024-06-01 00:06:08
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 41.98 -87.67 41.98 -87.67 casual
## 2 41.98 -87.67 41.98 -87.67 casual
The started_at date is in May, while the ended_at date is in June. Upon reviewing the list of duplicates, all records show a similar pattern: the trip starts in one month and ends in the next.
tripdata202405 %>%
filter(ride_id =='3B5CE4D8B3EE6ED8')
## ride_id rideable_type started_at ended_at
## 1 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04 2024-06-01 00:06:08
## start_station_name start_station_id end_station_name end_station_id start_lat
## 1 41.98
## start_lng end_lat end_lng member_casual
## 1 -87.67 41.98 -87.67 casual
tripdata202406 %>%
filter(ride_id =='3B5CE4D8B3EE6ED8')
## ride_id rideable_type started_at
## 1 3B5CE4D8B3EE6ED8 electric_bike 2024-05-31 23:50:04.153
## ended_at start_station_name start_station_id end_station_name
## 1 2024-06-01 00:06:08.273
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 41.98 -87.67 41.98 -87.67 casual
This confirms that the same ride_id appears in two different source files: one corresponding to the month when the trip started and the other corresponding to the month when the trip ended.
Eliminating the duplicated trips.
all_data_2 <- distinct(all_data, ride_id,.keep_all = TRUE)
skim_without_charts(all_data_2)
| Name | all_data_2 |
| Number of rows | 5715482 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5715482 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 23 | 0 | 5099911 | 0 |
| ended_at | 0 | 1 | 19 | 23 | 0 | 5107031 | 0 |
| start_station_name | 0 | 1 | 0 | 64 | 947002 | 1706 | 0 |
| start_station_id | 0 | 1 | 0 | 14 | 947002 | 1670 | 0 |
| end_station_name | 0 | 1 | 0 | 64 | 989396 | 1720 | 0 |
| end_station_id | 0 | 1 | 0 | 36 | 989396 | 1682 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.63 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.94 | -87.66 | -87.64 | -87.63 | -87.46 |
| end_lat | 7716 | 1 | 41.90 | 0.05 | 0.00 | 41.88 | 41.90 | 41.93 | 87.96 |
| end_lng | 7716 | 1 | -87.65 | 0.05 | -144.05 | -87.66 | -87.64 | -87.63 | 0.00 |
Now, the total number of rows and the unique count for ride_id both match at 5,715,482. The duplicates we identified initially totaled 422 records. After removing these duplicates, 211 duplicated entries have been eliminated, which aligns with the expected result: 5,715,693 total rows minus 211 equals 5,715,482.
Next, I want to analyze the date, time, and duration of trips. To do this, I first convert the started_at and ended_at fields from character format to date format. Then, I create a new column, duration_trip, which calculates the difference between ended_at and started_at.
str(all_data_2)
## 'data.frame': 5715482 obs. of 13 variables:
## $ ride_id : chr "2658E319B13141F9" "B2176315168A47CE" "C2A9D33DF7EBB422" "8BFEA406DF01D8AD" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2024-07-11 08:15:14.784" "2024-07-11 15:45:07.851" "2024-07-11 08:24:48.192" "2024-07-11 08:46:06.864" ...
## $ ended_at : chr "2024-07-11 08:17:56.335" "2024-07-11 16:06:04.243" "2024-07-11 08:28:05.237" "2024-07-11 09:14:11.664" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.8 41.8 41.8 41.9 42 ...
## $ start_lng : num -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num 41.8 41.8 41.8 41.9 41.9 ...
## $ end_lng : num -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
all_data_2 <- all_data_2 %>%
mutate(started_time = ymd_hms(started_at), ended_time = ymd_hms(ended_at)) %>%
mutate(duration_trip = difftime(ended_time, started_time, units='mins' )) %>%
arrange(duration_trip)
str(all_data_2)
## 'data.frame': 5715482 obs. of 16 variables:
## $ ride_id : chr "F584D47AE67FD388" "AE046C379C20B7CA" "A21D6507DA3C5AD4" "DEC5EF8DE27398A0" ...
## $ rideable_type : chr "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2023-11-05 21:08:17" "2023-11-05 20:46:59" "2023-11-05 16:41:54" "2023-11-05 11:56:19" ...
## $ ended_at : chr "2023-10-25 07:31:46" "2023-10-25 07:31:46" "2023-10-25 07:31:46" "2023-10-25 07:31:46" ...
## $ start_station_name: chr "Sheffield Ave & Waveland Ave" "Sheridan Rd & Irving Park Rd" "Pine Grove Ave & Irving Park Rd" "Pine Grove Ave & Irving Park Rd" ...
## $ start_station_id : chr "TA1307000126" "13063" "TA1308000022" "TA1308000022" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 42 42 42 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ end_lat : num 42 42 42 42 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr "casual" "member" "member" "casual" ...
## $ started_time : POSIXct, format: "2023-11-05 21:08:17" "2023-11-05 20:46:59" ...
## $ ended_time : POSIXct, format: "2023-10-25 07:31:46" "2023-10-25 07:31:46" ...
## $ duration_trip : 'difftime' num -16656.5166666667 -16635.2166666667 -16390.1333333333 -16104.55 ...
## ..- attr(*, "units")= chr "mins"
I will examine the top 30 trip durations in both ascending and descending order. Start by checking the top 30 longest durations in descending order.
all_data_2%>%
select(ride_id,duration_trip) %>%
arrange(desc(duration_trip)) %>%
head(., 30)
## ride_id duration_trip
## 1 992B647C22E22513 98489.07 mins
## 2 24154619AEE96E2C 92569.92 mins
## 3 9EC4ACDAA96E516F 83382.58 mins
## 4 ED31EE7C8024D292 79775.02 mins
## 5 1A572575C18B600B 64171.70 mins
## 6 07DD54EC457E4395 64009.25 mins
## 7 9DA133B7B8988662 62867.10 mins
## 8 0219824575E2861E 56194.53 mins
## 9 36871DD19ACB6213 55070.43 mins
## 10 6154C42892409B22 49928.55 mins
## 11 546DC24A21A6F971 41362.72 mins
## 12 26F68D0429B6BF91 40567.23 mins
## 13 50FF52776B3FE2F4 37943.72 mins
## 14 06CAB713FD37C365 37333.05 mins
## 15 B1D738A1392D2ABF 36839.50 mins
## 16 E64733C6B18FD768 36825.88 mins
## 17 915085EDE8057FBA 36235.98 mins
## 18 62FC1E17C20B5116 35203.85 mins
## 19 C5CE2D735E80F85A 33839.85 mins
## 20 41D1CC391FC73736 33740.62 mins
## 21 19AE36E06971290E 32327.80 mins
## 22 305EFFAC8E5FF16C 32290.33 mins
## 23 7406AFEC7CD42296 29431.98 mins
## 24 0C46B573756F7492 27519.60 mins
## 25 3F41803650E03489 27329.65 mins
## 26 99DBC39AA8909D49 24998.60 mins
## 27 C44274716AF61542 24322.65 mins
## 28 C3A4B0708604BA46 23850.43 mins
## 29 417895DBFBF67726 23802.68 mins
## 30 FE924E5E15D781CB 23740.68 mins
The top records indicate trip durations exceeding 2 months. Since most ride-sharing companies require bikes to be returned within 24 hours (1440 minutes), I will remove any trips longer than 24 hours to exclude non-trip records, such as those related to repairs or maintenance.
counts_more_one_day <-all_data_2 %>%
filter(duration_trip > 1440)
nrow(counts_more_one_day)
## [1] 7958
The total of rows that is longer than 24 hr is 7,958. Removing those.
nrow(all_data_2)
## [1] 5715482
all_data_3 <- all_data_2 %>%
filter(duration_trip <= 1440)
nrow(all_data_3)
## [1] 5707524
nrow(all_data_2)-nrow(all_data_3)
## [1] 7958
The reduction is 7958. This matches with the expected reduction in counts.
Next, I review the top 30 shortest trip durations, sorted in ascending order.
all_data_3%>%
select(ride_id,duration_trip) %>%
arrange(duration_trip) %>%
head(., 30)
## ride_id duration_trip
## 1 F584D47AE67FD388 -16656.51667 mins
## 2 AE046C379C20B7CA -16635.21667 mins
## 3 A21D6507DA3C5AD4 -16390.13333 mins
## 4 DEC5EF8DE27398A0 -16104.55000 mins
## 5 7850F6E2343BF766 -10626.40000 mins
## 6 5A5DDAFFF234FB69 -10475.75000 mins
## 7 9A4225B190E96360 -2748.31667 mins
## 8 0F93FAE9410E4A13 -1283.13333 mins
## 9 A9CF0518718BC255 -160.03333 mins
## 10 EE37F58076FC6659 -115.00000 mins
## 11 D8D9D4D695F852EA -81.68333 mins
## 12 3CFD92C1661FD4BE -72.68333 mins
## 13 8A5539BF7A770638 -72.63333 mins
## 14 ACFF55150DC33066 -69.66667 mins
## 15 998B2907B57E46C8 -69.33333 mins
## 16 2EF61FB1465E9D97 -68.61667 mins
## 17 40405DE1BEB5A318 -68.51667 mins
## 18 348C9F7C40B1549A -67.86667 mins
## 19 F75D8921B3431AA5 -67.76667 mins
## 20 97E948FC84B894C9 -67.51667 mins
## 21 BE71C467B8FD1BCE -67.23333 mins
## 22 5AB3D70A92A6F3FB -67.01667 mins
## 23 2D45B11B6F2AC720 -66.80000 mins
## 24 3896F65BC5DCB40F -66.50000 mins
## 25 16A9D3A073BA48C8 -64.96667 mins
## 26 342C65E863A728CB -64.70000 mins
## 27 16703BADCB2D82C6 -64.18333 mins
## 28 2A616CC146F5E254 -63.50000 mins
## 29 3A8A13C546C52780 -62.65000 mins
## 30 DE66BFB1053661E6 -62.33333 mins
Some records have a negative duration for trips. I will remove these, along with any trips lasting 1 minute or less, to eliminate potential errors and non-trip records.
counts_less_one_min <-all_data_3 %>%
filter(duration_trip <= 1)
nrow(counts_less_one_min)
## [1] 132261
The total of rows that has less than 1 min is 132,261. Removing those.
nrow(all_data_3)
## [1] 5707524
all_data_4 <- all_data_3 %>%
filter(duration_trip > 1)
nrow(all_data_4)
## [1] 5575263
nrow(all_data_3)-nrow(all_data_4)
## [1] 132261
The reduction is 132,261. This matches with the expected reduction in counts.
Checking rideable_type in data set.
check_rideable_type <- all_data_4 %>%
group_by(rideable_type) %>%
summarize(n())
print(check_rideable_type)
## # A tibble: 3 × 2
## rideable_type `n()`
## <chr> <int>
## 1 classic_bike 2822057
## 2 docked_bike 15459
## 3 electric_bike 2737747
There are 15,459 records listed under docked_bike, which is not a valid bike type. I will remove these records.
nrow(all_data_4)
## [1] 5575263
all_data_5 <- all_data_4 %>%
filter(rideable_type !='docked_bike')
nrow(all_data_5)
## [1] 5559804
nrow(all_data_4)-nrow(all_data_5)
## [1] 15459
The reduction is 15,459. This matches with the expected reduction in counts.
To analyze the time of day, day of the week, and month of the year for trips, I will add new fields to the data set. Additionally, I will remove any fields that are no longer relevant to the study. I will retain the following fields: ‘ride_id’, ‘rideable_type’, member_casual, started_time, ended_time, duration_trip, and start_station_name for my analysis.
all_data_6 <- all_data_5 %>%
select(ride_id, rideable_type, member_casual, started_time, ended_time, duration_trip, start_station_name) %>%
mutate(day_of_week = wday(started_time, label =TRUE, week_start=1)) %>%
mutate(time_of_day = hour(started_time)) %>%
mutate(month_of_year=month(started_time))
head(all_data_6)
## ride_id rideable_type member_casual started_time
## 1 40BE0A8329F9D037 electric_bike casual 2024-07-01 13:52:47
## 2 9E7FFA938C011405 electric_bike casual 2024-07-19 18:48:30
## 3 3D205C0CBDA67EB2 electric_bike casual 2024-07-05 11:44:25
## 4 24FA4E059A078BFC electric_bike member 2024-07-12 18:00:20
## 5 C48702E18A51D9F2 electric_bike casual 2024-06-22 15:54:08
## 6 D86A3D7B47129D63 electric_bike casual 2024-06-10 13:18:06
## ended_time duration_trip start_station_name
## 1 2024-07-01 13:53:47 1.000017 mins
## 2 2024-07-19 18:49:30 1.000050 mins DuSable Lake Shore Dr & North Blvd
## 3 2024-07-05 11:45:25 1.000083 mins
## 4 2024-07-12 18:01:20 1.000100 mins Halsted St & 18th St
## 5 2024-06-22 15:55:08 1.000150 mins
## 6 2024-06-10 13:19:06 1.000200 mins
## day_of_week time_of_day month_of_year
## 1 Mon 13 7
## 2 Fri 18 7
## 3 Fri 11 7
## 4 Fri 18 7
## 5 Sat 15 6
## 6 Mon 13 6
Data is cleaned and organized, and ready for analysis. Total of 5,559,804 trip records are used for the analysis. In this step, I will find patterns, relationships and trends of trips.
I want to check number of trips for annual members compared to casual riders.
ggplot(data = all_data_6) +
geom_bar(mapping = aes(x = member_casual,fill=member_casual))+
ggtitle("Number of Trips by Member Type")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
The counts of annual members trips are much greater.
How do different types of bikes get used by annual members and casual riders?
ggplot(data = all_data_6) +
geom_bar(mapping = aes(x = member_casual, fill=rideable_type))+
ggtitle("Number of Trips - Rideable Types")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
It shows that both annual members and casual riders use classic and electric bikes approximately equally.
How do riders take trips throughout the week?
ggplot(data=all_data_6) +
geom_bar(mapping = aes(x = day_of_week,fill=member_casual)) +
facet_wrap(~member_casual)+
ggtitle("Number of Trips by Day of Week")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
It shows annual members takes more trips during weekdays, and casual riders ride the most trips on weekends.
Next, I will analyze the monthly riding patterns of annual members versus casual riders. To achieve this, I will plot the number of trips for each month.
ggplot(data=all_data_6) +
geom_bar(mapping = aes(x = month_of_year,fill =member_casual)) +
facet_wrap(~member_casual)+
scale_x_continuous(breaks=c(1,2,3,4,5,6,7,8,9,10,11,12))+
ggtitle("Number of Trips by Month")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
The data reveals that both types of riders experience peak usage during June, July, and August.
Next, I want to analyze trips by time of day.
ggplot(data=all_data_6, aes(x=time_of_day, fill=member_casual)) +
geom_histogram(bins = 24, color="black") +
facet_wrap(~member_casual)+
scale_x_continuous(breaks=c(0,3,6,9,12,15,18,21,24))+
ggtitle("Number of Trips by Time of Day")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
Since annual members mostly ride during weekdays and casual riders primarily on weekends, I will separately analyze trips for weekdays and weekends.
all_data_6 %>%
filter(day_of_week== 'Mon' | day_of_week =='Tue'| day_of_week =='Wed' | day_of_week =='Thu' |day_of_week == 'Fri') %>%
ggplot(aes(x=time_of_day, fill = member_casual)) +
geom_histogram(bins = 24, color="black")+
scale_x_continuous(breaks=c(0,3,6,9,12,15,18,21,24))+
facet_wrap(~member_casual)+
ggtitle("Number of Trips - Weekdays")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
The data shows that on weekdays, annual members experience peak usage between 7–8 AM and 4–6 PM, while casual riders peak around 5 PM.
Next, let’s plot the same data for weekends.
all_data_6 %>%
filter(day_of_week== 'Sat' | day_of_week =='Sun') %>%
ggplot(aes(x=time_of_day, fill=member_casual))+
geom_histogram(bins = 24, color="black")+
scale_x_continuous(breaks=c(0,3,6,9,12,15,18,21,24))+
facet_wrap(~member_casual)+
ggtitle("Number of Trips - Weekends")+
scale_y_continuous(labels = label_comma())+
ylab("count of trips")
On weekends, both types of riders tend to ride throughout the day, from 10 AM to 7 PM.
How does the duration of trips differ between annual members and casual riders? Let’s get the average duration, the count of trips, and the percentage of trip counts for annual members and casual riders.
summary_trips <- all_data_6 %>%
group_by(member_casual) %>%
summarize(counts_usage = n(), average_duration = mean(duration_trip))
summary_trips_percentage<- summary_trips %>%
mutate(perc_counts = counts_usage/sum(counts_usage)*100)
print(summary_trips_percentage)
## # A tibble: 2 × 4
## member_casual counts_usage average_duration perc_counts
## <chr> <int> <drtn> <dbl>
## 1 casual 1960930 21.56300 mins 35.3
## 2 member 3598874 12.56288 mins 64.7
The average trip duration by casual riders is 22 mins vs. the average trip duration by annual members is 13 mins. 65% of the trip are made by annual members.
What are the frequently used stations by annual members and casual riders? First, I am showing the top 30 stations used by annual members.
all_data_6 %>%
filter(start_station_name != "") %>%
filter(member_casual =="member") %>%
group_by(start_station_name) %>%
summarize(counts = n()) %>%
arrange(desc(counts)) %>%
top_n(10)
## Selecting by counts
## # A tibble: 10 × 2
## start_station_name counts
## <chr> <int>
## 1 Clinton St & Washington Blvd 27524
## 2 Kingsbury St & Kinzie St 26666
## 3 Clark St & Elm St 24115
## 4 Clinton St & Madison St 22884
## 5 Wells St & Concord Ln 20299
## 6 Wells St & Elm St 20031
## 7 University Ave & 57th St 18849
## 8 Clinton St & Jackson Blvd 18039
## 9 State St & Chicago Ave 17949
## 10 Dearborn St & Erie St 17685
Next, I am showing the top 30 stations used by casual riders.
all_data_6 %>%
filter(start_station_name != "") %>%
filter(member_casual =="casual") %>%
group_by(start_station_name) %>%
summarize(counts = n()) %>%
arrange(desc(counts)) %>%
top_n(10)
## Selecting by counts
## # A tibble: 10 × 2
## start_station_name counts
## <chr> <int>
## 1 Streeter Dr & Grand Ave 46036
## 2 DuSable Lake Shore Dr & Monroe St 30318
## 3 Michigan Ave & Oak St 23106
## 4 DuSable Lake Shore Dr & North Blvd 20905
## 5 Millennium Park 19502
## 6 Shedd Aquarium 19021
## 7 Dusable Harbor 16960
## 8 Theater on the Lake 15784
## 9 Adler Planetarium 12588
## 10 Michigan Ave & 8th St 12138
The stations used frequently by annual members and casual riders are quite different.
Based on the key findings, these are my top 3 recommendations.
My top 3 recommendations
Introduce an annual membership with weekend discounts: Casual riders tend to ride most frequently during the weekends. By offering an annual membership pass that includes a weekend discount, we can encourage these casual riders to switch to annual members.
Create a seasonal membership. Casual riders, along with annual members, tend to ride most frequently during June, July, and August. Offering a seasonal membership pass specifically for these peak months could attract casual riders to become members.
Target advertising at key stations for casual riders. The top 10 stations most frequented by casual riders differ significantly from those used by annual members, providing Cyclistic with a strategic opportunity to convert casual riders into members. By focusing advertising efforts at these high-traffic stations, Cyclistic can effectively reach its target audience and promote membership benefits. The top 10 stations by casual rider usage are: