Welcome to the Cyclistic bike-share analysis case study! This case study is a part of my capstone project for Google Data Analytics Professional Certificate. In this scenario, I am a part of a the Cyclistic team, a fictional company, along with key team members. To address the business questions, I will follow the data analysis process, which involves the following steps: Ask, Prepare, Process, Analyze, Share, and Act.
In this phase, we will define the business task and the goals of the analysis. Furthermore, we will determine what we want to achieve with the data analysis such as understanding how different customer types use Cyclistic bikes and designing a new marketing strategy.
Business Task
To analyze Cyclistic’s past bike trip records and understand the distinct usage patterns between annual members and casual riders, and use these findings to devise an updated marketing approach targeting casual riders, with the aim of converting them into annual members.
Key Stakeholders
Lily Moreno: The director of marketing and manager at Cyclistic and is responsible for developing campaigns and initiatives to promote the bike-share program.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The well-known detail-oriented executive team that will decide whether to approve the recommended marketing program.
We will gather and organize the data needed for analysis. This involves locating the data, ensuring its integrity, and addressing any private or security concerns. For this case study, I have downloaded the previous 12 months of bike trip data, which is public and anonymized to protect user privacy.
Data Source
The data has been provided by Motivate International Inc. under a specific license. This data is publicly available and can be used to analyze different customer types on Cyclistic bikes.
Data Limitation
It’s important to note that data-privacy issues prevent the use of personally identifiable information of riders. This limitation means that it is prohibited to link pass purchases to credit card numbers to determine whether casual riders reside in Cyclistic’s service area or have purchased multiple single passes.
Library Used
> library(tidyverse)
> library(conflicted)
> library(ggplot2)
> library(scales)
> library(dplyr)
>
> # Set dplyr::filter and dplyr::lag as the default choices
> conflict_prefer("filter", "dplyr")
> conflict_prefer("lag", "dplyr")
Collect Data
> q1_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q1.csv")
> q2_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q2.csv")
> q3_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q3.csv")
> q4_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q4.csv")
Sort and Filter Data
> colnames(q1_2019)
[1] "trip_id" "start_time" "end_time"
[4] "bikeid" "tripduration" "from_station_id"
[7] "from_station_name" "to_station_id" "to_station_name"
[10] "usertype" "gender" "birthyear"
> colnames(q2_2019)
[1] "01 - Rental Details Rental ID"
[2] "01 - Rental Details Local Start Time"
[3] "01 - Rental Details Local End Time"
[4] "01 - Rental Details Bike ID"
[5] "01 - Rental Details Duration In Seconds Uncapped"
[6] "03 - Rental Start Station ID"
[7] "03 - Rental Start Station Name"
[8] "02 - Rental End Station ID"
[9] "02 - Rental End Station Name"
[10] "User Type"
[11] "Member Gender"
[12] "05 - Member Details Member Birthday Year"
> colnames(q3_2019)
[1] "trip_id" "start_time" "end_time"
[4] "bikeid" "tripduration" "from_station_id"
[7] "from_station_name" "to_station_id" "to_station_name"
[10] "usertype" "gender" "birthyear"
> colnames(q4_2019)
[1] "trip_id" "start_time" "end_time"
[4] "bikeid" "tripduration" "from_station_id"
[7] "from_station_name" "to_station_id" "to_station_name"
[10] "usertype" "gender" "birthyear"
We find out that the column names from each data is different from another. To solve this issue, we will rename column names to make them consistent
> (q1_2019 <- rename(q1_2019
+ ,ride_id = trip_id
+ ,rideable_type = bikeid
+ ,started_at = start_time
+ ,ended_at = end_time
+ ,start_station_name = from_station_name
+ ,start_station_id = from_station_id
+ ,end_station_name = to_station_name
+ ,end_station_id = to_station_id
+ ,member_casual = usertype
+ ))
# A tibble: 365,069 × 12
ride_id started_at ended_at rideable_type tripduration
<dbl> <dttm> <dttm> <dbl> <dbl>
1 21742443 2019-01-01 00:04:37 2019-01-01 00:11:07 2167 390
2 21742444 2019-01-01 00:08:13 2019-01-01 00:15:34 4386 441
3 21742445 2019-01-01 00:13:23 2019-01-01 00:27:12 1524 829
4 21742446 2019-01-01 00:13:45 2019-01-01 00:43:28 252 1783
5 21742447 2019-01-01 00:14:52 2019-01-01 00:20:56 1170 364
6 21742448 2019-01-01 00:15:33 2019-01-01 00:19:09 2437 216
7 21742449 2019-01-01 00:16:06 2019-01-01 00:19:03 2708 177
8 21742450 2019-01-01 00:18:41 2019-01-01 00:20:21 2796 100
9 21742451 2019-01-01 00:18:43 2019-01-01 00:47:30 6205 1727
10 21742452 2019-01-01 00:19:18 2019-01-01 00:24:54 3939 336
# ℹ 365,059 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
# end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
# gender <chr>, birthyear <dbl>
> (q2_2019 <- rename(q2_2019
+ ,ride_id = '01 - Rental Details Rental ID'
+ ,rideable_type = '01 - Rental Details Bike ID'
+ ,started_at = '01 - Rental Details Local Start Time'
+ ,ended_at = '01 - Rental Details Local End Time'
+ ,start_station_name = '03 - Rental Start Station Name'
+ ,start_station_id = '03 - Rental Start Station ID'
+ ,end_station_name = '02 - Rental End Station Name'
+ ,end_station_id = '02 - Rental End Station ID'
+ ,member_casual = 'User Type'
+ ,gender = 'Member Gender'
+ ,birthyear = '05 - Member Details Member Birthday Year'
+ ,duration = '01 - Rental Details Duration In Seconds Uncapped'
+ ))
# A tibble: 1,108,163 × 12
ride_id started_at ended_at rideable_type duration
<dbl> <dttm> <dttm> <dbl> <dbl>
1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48 6251 446
2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30 6226 1048
3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19 5649 252
4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58 4151 357
5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13 3270 1007
6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56 3123 257
7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41 6418 548
8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11 4513 383
9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44 3280 2137
10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39 5534 2120
# ℹ 1,108,153 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
# end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
# gender <chr>, birthyear <dbl>
> (q3_2019 <- rename(q3_2019
+ ,ride_id = trip_id
+ ,rideable_type = bikeid
+ ,started_at = start_time
+ ,ended_at = end_time
+ ,start_station_name = from_station_name
+ ,start_station_id = from_station_id
+ ,end_station_name = to_station_name
+ ,end_station_id = to_station_id
+ ,member_casual = usertype
+ ))
# A tibble: 1,640,718 × 12
ride_id started_at ended_at rideable_type tripduration
<dbl> <dttm> <dttm> <dbl> <dbl>
1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41 3591 1214
2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44 5353 1048
3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42 6180 1554
4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10 5540 1503
5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26 6014 1213
6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31 4941 310
7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12 3770 1248
8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16 5442 1550
9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57 2957 1583
10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14 6091 1589
# ℹ 1,640,708 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
# end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
# gender <chr>, birthyear <dbl>
> (q4_2019 <- rename(q4_2019
+ ,ride_id = trip_id
+ ,rideable_type = bikeid
+ ,started_at = start_time
+ ,ended_at = end_time
+ ,start_station_name = from_station_name
+ ,start_station_id = from_station_id
+ ,end_station_name = to_station_name
+ ,end_station_id = to_station_id
+ ,member_casual = usertype
+ ))
# A tibble: 704,054 × 12
ride_id started_at ended_at rideable_type tripduration
<dbl> <dttm> <dttm> <dbl> <dbl>
1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20 2215 940
2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34 6328 258
3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43 3003 850
4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43 3275 2350
5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42 5294 1867
6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51 1891 373
7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45 1061 1072
8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16 1274 1458
9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18 6011 1437
10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46 2957 8306
# ℹ 704,044 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
# end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
# gender <chr>, birthyear <dbl>
> str(q1_2019)
spc_tbl_ [365,069 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ride_id : num [1:365069] 21742443 21742444 21742445 21742446 21742447 ...
$ started_at : POSIXct[1:365069], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
$ ended_at : POSIXct[1:365069], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
$ rideable_type : num [1:365069] 2167 4386 1524 252 1170 ...
$ tripduration : num [1:365069] 390 441 829 1783 364 ...
$ start_station_id : num [1:365069] 199 44 15 123 173 98 98 211 150 268 ...
$ start_station_name: chr [1:365069] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
$ end_station_id : num [1:365069] 84 624 644 176 35 49 49 142 148 141 ...
$ end_station_name : chr [1:365069] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
$ member_casual : chr [1:365069] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
$ gender : chr [1:365069] "Male" "Female" "Female" "Male" ...
$ birthyear : num [1:365069] 1989 1990 1994 1993 1994 ...
- attr(*, "spec")=
.. cols(
.. trip_id = col_double(),
.. start_time = col_datetime(format = ""),
.. end_time = col_datetime(format = ""),
.. bikeid = col_double(),
.. tripduration = col_number(),
.. from_station_id = col_double(),
.. from_station_name = col_character(),
.. to_station_id = col_double(),
.. to_station_name = col_character(),
.. usertype = col_character(),
.. gender = col_character(),
.. birthyear = col_double()
.. )
- attr(*, "problems")=<externalptr>
> str(q2_2019)
spc_tbl_ [1,108,163 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ride_id : num [1:1108163] 22178529 22178530 22178531 22178532 22178533 ...
$ started_at : POSIXct[1:1108163], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
$ ended_at : POSIXct[1:1108163], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
$ rideable_type : num [1:1108163] 6251 6226 5649 4151 3270 ...
$ duration : num [1:1108163] 446 1048 252 357 1007 ...
$ start_station_id : num [1:1108163] 81 317 283 26 202 420 503 260 211 211 ...
$ start_station_name: chr [1:1108163] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
$ end_station_id : num [1:1108163] 56 59 174 133 129 426 500 499 211 211 ...
$ end_station_name : chr [1:1108163] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
$ member_casual : chr [1:1108163] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
$ gender : chr [1:1108163] "Male" "Female" "Male" "Male" ...
$ birthyear : num [1:1108163] 1975 1984 1990 1993 1992 ...
- attr(*, "spec")=
.. cols(
.. `01 - Rental Details Rental ID` = col_double(),
.. `01 - Rental Details Local Start Time` = col_datetime(format = ""),
.. `01 - Rental Details Local End Time` = col_datetime(format = ""),
.. `01 - Rental Details Bike ID` = col_double(),
.. `01 - Rental Details Duration In Seconds Uncapped` = col_number(),
.. `03 - Rental Start Station ID` = col_double(),
.. `03 - Rental Start Station Name` = col_character(),
.. `02 - Rental End Station ID` = col_double(),
.. `02 - Rental End Station Name` = col_character(),
.. `User Type` = col_character(),
.. `Member Gender` = col_character(),
.. `05 - Member Details Member Birthday Year` = col_double()
.. )
- attr(*, "problems")=<externalptr>
> str(q3_2019)
spc_tbl_ [1,640,718 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ride_id : num [1:1640718] 23479388 23479389 23479390 23479391 23479392 ...
$ started_at : POSIXct[1:1640718], format: "2019-07-01 00:00:27" "2019-07-01 00:01:16" ...
$ ended_at : POSIXct[1:1640718], format: "2019-07-01 00:20:41" "2019-07-01 00:18:44" ...
$ rideable_type : num [1:1640718] 3591 5353 6180 5540 6014 ...
$ tripduration : num [1:1640718] 1214 1048 1554 1503 1213 ...
$ start_station_id : num [1:1640718] 117 381 313 313 168 300 168 313 43 43 ...
$ start_station_name: chr [1:1640718] "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
$ end_station_id : num [1:1640718] 497 203 144 144 62 232 62 144 195 195 ...
$ end_station_name : chr [1:1640718] "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
$ member_casual : chr [1:1640718] "Subscriber" "Customer" "Customer" "Customer" ...
$ gender : chr [1:1640718] "Male" NA NA NA ...
$ birthyear : num [1:1640718] 1992 NA NA NA NA ...
- attr(*, "spec")=
.. cols(
.. trip_id = col_double(),
.. start_time = col_datetime(format = ""),
.. end_time = col_datetime(format = ""),
.. bikeid = col_double(),
.. tripduration = col_number(),
.. from_station_id = col_double(),
.. from_station_name = col_character(),
.. to_station_id = col_double(),
.. to_station_name = col_character(),
.. usertype = col_character(),
.. gender = col_character(),
.. birthyear = col_double()
.. )
- attr(*, "problems")=<externalptr>
> str(q4_2019)
spc_tbl_ [704,054 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ ride_id : num [1:704054] 25223640 25223641 25223642 25223643 25223644 ...
$ started_at : POSIXct[1:704054], format: "2019-10-01 00:01:39" "2019-10-01 00:02:16" ...
$ ended_at : POSIXct[1:704054], format: "2019-10-01 00:17:20" "2019-10-01 00:06:34" ...
$ rideable_type : num [1:704054] 2215 6328 3003 3275 5294 ...
$ tripduration : num [1:704054] 940 258 850 2350 1867 ...
$ start_station_id : num [1:704054] 20 19 84 313 210 156 84 156 156 336 ...
$ start_station_name: chr [1:704054] "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
$ end_station_id : num [1:704054] 309 241 199 290 382 226 142 463 463 336 ...
$ end_station_name : chr [1:704054] "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
$ member_casual : chr [1:704054] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
$ gender : chr [1:704054] "Male" "Male" "Female" "Male" ...
$ birthyear : num [1:704054] 1987 1998 1991 1990 1987 ...
- attr(*, "spec")=
.. cols(
.. trip_id = col_double(),
.. start_time = col_datetime(format = ""),
.. end_time = col_datetime(format = ""),
.. bikeid = col_double(),
.. tripduration = col_number(),
.. from_station_id = col_double(),
.. from_station_name = col_character(),
.. to_station_id = col_double(),
.. to_station_name = col_character(),
.. usertype = col_character(),
.. gender = col_character(),
.. birthyear = col_double()
.. )
- attr(*, "problems")=<externalptr>
> q1_2019 <- mutate(q1_2019, ride_id = as.character(ride_id)
+ ,rideable_type = as.character(rideable_type))
> q2_2019 <- mutate(q2_2019, ride_id = as.character(ride_id)
+ ,rideable_type = as.character(rideable_type))
> q3_2019 <- mutate(q3_2019, ride_id = as.character(ride_id)
+ ,rideable_type = as.character(rideable_type))
> q4_2019 <- mutate(q4_2019, ride_id = as.character(ride_id)
+ ,rideable_type = as.character(rideable_type))
> all_trips <- bind_rows(q1_2019,q2_2019, q3_2019, q4_2019)
> all_trips <- all_trips %>%
+ select(-c(birthyear, gender, duration, "tripduration"))
In this phase we will clean and transform the data to make it suitable for analysis. This step includes handling missing values, removing duplicates, and formatting data into a usable format. We might also need to merge or join different datasets if necessary.
Check data for errors
> colnames(all_trips)
[1] "ride_id" "started_at" "ended_at"
[4] "rideable_type" "start_station_id" "start_station_name"
[7] "end_station_id" "end_station_name" "member_casual"
> nrow(all_trips)
[1] 3818004
> dim(all_trips)
[1] 3818004 9
> head(all_trips)
# A tibble: 6 × 9
ride_id started_at ended_at rideable_type start_station_id
<chr> <dttm> <dttm> <chr> <dbl>
1 217424… 2019-01-01 00:04:37 2019-01-01 00:11:07 2167 199
2 217424… 2019-01-01 00:08:13 2019-01-01 00:15:34 4386 44
3 217424… 2019-01-01 00:13:23 2019-01-01 00:27:12 1524 15
4 217424… 2019-01-01 00:13:45 2019-01-01 00:43:28 252 123
5 217424… 2019-01-01 00:14:52 2019-01-01 00:20:56 1170 173
6 217424… 2019-01-01 00:15:33 2019-01-01 00:19:09 2437 98
# ℹ 4 more variables: start_station_name <chr>, end_station_id <dbl>,
# end_station_name <chr>, member_casual <chr>
> str(all_trips)
tibble [3,818,004 × 9] (S3: tbl_df/tbl/data.frame)
$ ride_id : chr [1:3818004] "21742443" "21742444" "21742445" "21742446" ...
$ started_at : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
$ ended_at : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
$ rideable_type : chr [1:3818004] "2167" "4386" "1524" "252" ...
$ start_station_id : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
$ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
$ end_station_id : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
$ end_station_name : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
$ member_casual : chr [1:3818004] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
> summary(all_trips)
ride_id started_at
Length:3818004 Min. :2019-01-01 00:04:37.00
Class :character 1st Qu.:2019-05-29 15:49:26.50
Mode :character Median :2019-07-25 17:50:54.00
Mean :2019-07-19 21:47:37.11
3rd Qu.:2019-09-15 06:48:05.75
Max. :2019-12-31 23:57:17.00
ended_at rideable_type start_station_id
Min. :2019-01-01 00:11:07.00 Length:3818004 Min. : 1.0
1st Qu.:2019-05-29 16:09:28.25 Class :character 1st Qu.: 77.0
Median :2019-07-25 18:12:23.00 Mode :character Median :174.0
Mean :2019-07-19 22:11:47.56 Mean :201.7
3rd Qu.:2019-09-15 08:30:13.25 3rd Qu.:289.0
Max. :2020-01-21 13:54:35.00 Max. :673.0
start_station_name end_station_id end_station_name member_casual
Length:3818004 Min. : 1.0 Length:3818004 Length:3818004
Class :character 1st Qu.: 77.0 Class :character Class :character
Mode :character Median :174.0 Mode :character Mode :character
Mean :202.6
3rd Qu.:291.0
Max. :673.0
There are a few problems we will need to fix:
Data Manipulation
> all_trips <- all_trips %>%
+ mutate(member_casual = recode(member_casual
+ ,"Subscriber" = "member"
+ ,"Customer" = "casual"))
> table(all_trips$member_casual)
casual member
880637 2937367
> all_trips$date <- as.Date(all_trips$started_at)
> all_trips$month <- format(as.Date(all_trips$date), "%m")
> all_trips$day <- format(as.Date(all_trips$date), "%d")
> all_trips$year <- format(as.Date(all_trips$date), "%Y")
> all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
> all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
> str(all_trips)
tibble [3,818,004 × 15] (S3: tbl_df/tbl/data.frame)
$ ride_id : chr [1:3818004] "21742443" "21742444" "21742445" "21742446" ...
$ started_at : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
$ ended_at : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
$ rideable_type : chr [1:3818004] "2167" "4386" "1524" "252" ...
$ start_station_id : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
$ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
$ end_station_id : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
$ end_station_name : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
$ member_casual : chr [1:3818004] "member" "member" "member" "member" ...
$ date : Date[1:3818004], format: "2019-01-01" "2019-01-01" ...
$ month : chr [1:3818004] "01" "01" "01" "01" ...
$ day : chr [1:3818004] "01" "01" "01" "01" ...
$ year : chr [1:3818004] "2019" "2019" "2019" "2019" ...
$ day_of_week : chr [1:3818004] "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
$ ride_length : 'difftime' num [1:3818004] 6.5 7.35 13.8166666666667 29.7166666666667 ...
..- attr(*, "units")= chr "mins"
> is.factor(all_trips$ride_length)
[1] FALSE
> all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
> is.numeric(all_trips$ride_length)
[1] TRUE
Data Cleaning
> all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
We will perform Exploratory Data Analysis (EDA) and apply statistical methods to uncover patterns, trends, and insights in the data. This step helps us to answer business task and identify key findings, such as trends in bike usage by customer type
Descriptive Analysis on ride_length by User
Type
> # Filter for casual users
> casual_trips <- all_trips_v2 %>%
+ filter(member_casual == "casual")
>
> # Filter for member users
> member_trips <- all_trips_v2 %>%
+ filter(member_casual == "member")
>
> # Descriptive analysis on ride_length (all figures in minutes) for casual and member users
> summary(casual_trips$ride_length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.02 15.25 25.83 57.02 45.30 177200.37
> summary(member_trips$ride_length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.02 6.03 9.80 14.33 16.12 150943.90
The result suggest that, on average, casual users tend to take longer rides compared to member users.
User Type Distribution
> # Count the number of casual and member customers
> customer_distribution <- all_trips_v2 %>%
+ count(member_casual)
>
> # Calculate the percentage
> customer_distribution <- customer_distribution %>%
+ mutate(percentage = n / sum(n) * 100)
>
> # Plotting
> ggplot(customer_distribution, aes(x = "", y = n, fill = member_casual)) +
+ geom_bar(stat = "identity", width = 1) +
+ coord_polar("y") +
+ geom_text(aes(label = paste0(round(percentage), "%")), position = position_stack(vjust = 0.5)) +
+ labs(title = "Distribution of Casual and Member Users",
+ fill = "Customer Type") +
+ theme_void() +
+ scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))
Number of Rides Comparison
> # Count the number of rides for each user type
> ride_counts <- all_trips_v2 %>%
+ group_by(member_casual) %>%
+ summarise(num_rides = n())
>
> ggplot(ride_counts, aes(x = member_casual, y = num_rides, fill = member_casual)) +
+ geom_bar(stat = "identity") +
+ labs(title = "Number of Rides by User Type", x = "User Type", y = "Number of Rides") +
+ scale_y_continuous(labels = scales::comma) + # Format y-axis labels with commas
+ theme_minimal() +
+ scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))
Average ride length comparison
> # Calculate average ride length for each user type
> ride_duration <- all_trips_v2 %>%
+ group_by(member_casual) %>%
+ summarise(avg_ride_length = mean(ride_length, na.rm = TRUE))
>
> # Bar plot of average ride length by user type
> ggplot(ride_duration, aes(x = member_casual, y = avg_ride_length, fill = member_casual)) +
+ geom_bar(stat = "identity") +
+ labs(title = "Average Ride Length by User Type", x = "User Type", y = "Average Ride Length", fill = "User Type") +
+ theme_minimal() +
+ scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))
Activity pattern of member riders
> activity_by_day_time_member <- all_trips_v2 %>%
+ filter(member_casual == "member") %>%
+ mutate(hour = lubridate::hour(started_at)) %>%
+ group_by(member_casual, day_of_week, hour, .drop = FALSE) %>%
+ summarise(num_rides = n()) %>%
+ arrange(member_casual, day_of_week, hour)
>
> # Plotting
> ggplot(activity_by_day_time_member, aes(x = hour, y = day_of_week, fill = num_rides)) +
+ geom_tile() +
+ labs(title = "Member Users Ride Activity by Day of Week and Time of Day",
+ x = "Hour of Day", y = "Day of Week", fill = "Number of Rides") +
+ theme_minimal() +
+ scale_fill_viridis_c()
Activity pattern of casual riders
> activity_by_day_time_casual <- all_trips_v2 %>%
+ filter(member_casual == "casual") %>%
+ mutate(hour = lubridate::hour(started_at)) %>%
+ group_by(member_casual, day_of_week, hour, .drop = FALSE) %>%
+ summarise(num_rides = n()) %>%
+ arrange(member_casual, day_of_week, hour)
>
> ggplot(activity_by_day_time_casual, aes(x = hour, y = day_of_week, fill = num_rides)) +
+ geom_tile() +
+ labs(title = "Casual Users Ride Activity by Day of Week and Time of Day",
+ x = "Hour of Day", y = "Day of Week", fill = "Number of Rides") +
+ theme_minimal() +
+ scale_fill_viridis_c()
User type activity by day of the week
> # Extract day of week from started_at
> all_trips_v2$day_of_week <- factor(weekdays(all_trips_v2$started_at, abbreviate = TRUE),
+ levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
>
> # Calculate number of rows by user type and day of week
> rows_by_user_type_day <- all_trips_v2 %>%
+ group_by(day_of_week, member_casual) %>%
+ summarise(num_rows = n())
>
> # Plotting
> ggplot(rows_by_user_type_day, aes(x = day_of_week, y = num_rows, fill = member_casual)) +
+ geom_bar(stat = "identity", position = "dodge") +
+ labs(title = "Ride Counts by User Type During a Week",
+ x = "Day of Week", y = "Ride Counts", fill = "User Type") +
+ scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k")) +
+ theme_minimal()+
+ scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))
User type activity by month of the year
> # Calculate number of rows by user type and month
> rows_by_user_type_month <- all_trips_v2 %>%
+ group_by(month, member_casual) %>%
+ summarise(num_rows = n())
>
> # Plotting
> ggplot(rows_by_user_type_month, aes(x = factor(month, labels = month.name), y = num_rows, fill = member_casual)) +
+ geom_bar(stat = "identity", position = "dodge") +
+ labs(title = "Ride Counts by User Type During a Year",
+ x = "Month", y = "Ride Counts", fill = "User Type") +
+ theme_minimal() +
+ theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
+ scale_y_continuous(labels = scales::comma_format()) +
+ scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))
We will suggest action based on the analysis to achieve the business task. This step could involve implementing a new marketing strategy, optimizing bike availability, or making other operational changes based on our findings.
Based on the insights gathered:
a. Application of Insights: Implement targeted marketing campaigns based on the identified user behaviors and preferences. Use the insights to tailor promotions, messaging, and user experience to better appeal to casual riders and encourage them to become annual members.
b. Next Steps: Initiate a comprehensive marketing strategy that includes promotional offers, engagement campaigns, and user experience enhancements. Develop a timeline for implementation and establish key performance indicators (KPIs) to measure the success of the strategy.
c. Additional Data: Explore additional data sources, such as user surveys or feedback, to gain deeper insights into the reasons behind casual riders’ usage patterns and preferences. This data can help refine the marketing strategy and improve its effectiveness in converting casual riders into annual members.
Furthermore, based on the analysis, here are my top three recommendations:
1. Promotional Offers: Create targeted promotional offers aimed at casual riders, such as discounted annual memberships or exclusive benefits for converting to annual membership. Highlight the cost savings and convenience of annual membership compared to pay-as-you-ride options.
2. Engagement Campaigns: Launch engagement campaigns to educate casual riders about the benefits of annual membership, such as unlimited rides, priority access to bikes, and member-exclusive events. Use social media, email marketing, and in-app notifications to reach out to casual riders.
3. User Experience Enhancements: Improve the user experience for both casual and annual members to encourage conversion. This could include streamlining the registration process, enhancing bike availability and maintenance, and offering personalized recommedations based on ride history.
for exploring my analysis! I hope you found my insights valuable and informative.