Overview

Welcome to the Cyclistic bike-share analysis case study! This case study is a part of my capstone project for Google Data Analytics Professional Certificate. In this scenario, I am a part of a the Cyclistic team, a fictional company, along with key team members. To address the business questions, I will follow the data analysis process, which involves the following steps: Ask, Prepare, Process, Analyze, Share, and Act.

Ask Phase

In this phase, we will define the business task and the goals of the analysis. Furthermore, we will determine what we want to achieve with the data analysis such as understanding how different customer types use Cyclistic bikes and designing a new marketing strategy.

Business Task

To analyze Cyclistic’s past bike trip records and understand the distinct usage patterns between annual members and casual riders, and use these findings to devise an updated marketing approach targeting casual riders, with the aim of converting them into annual members.

Key Stakeholders

Lily Moreno: The director of marketing and manager at Cyclistic and is responsible for developing campaigns and initiatives to promote the bike-share program.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The well-known detail-oriented executive team that will decide whether to approve the recommended marketing program.

Prepare Phase

We will gather and organize the data needed for analysis. This involves locating the data, ensuring its integrity, and addressing any private or security concerns. For this case study, I have downloaded the previous 12 months of bike trip data, which is public and anonymized to protect user privacy.

Data Source

The data has been provided by Motivate International Inc. under a specific license. This data is publicly available and can be used to analyze different customer types on Cyclistic bikes.

Data Limitation

It’s important to note that data-privacy issues prevent the use of personally identifiable information of riders. This limitation means that it is prohibited to link pass purchases to credit card numbers to determine whether casual riders reside in Cyclistic’s service area or have purchased multiple single passes.

Library Used

> library(tidyverse)
> library(conflicted)
> library(ggplot2)
> library(scales)
> library(dplyr)
> 
> # Set dplyr::filter and dplyr::lag as the default choices
> conflict_prefer("filter", "dplyr")
> conflict_prefer("lag", "dplyr")

Collect Data

> q1_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q1.csv")
> q2_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q2.csv")
> q3_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q3.csv")
> q4_2019 <- read_csv("C:/Users/acer/Downloads/Divvy_Trips_2019_Q4.csv")

Sort and Filter Data

Check column names consistency

> colnames(q1_2019)
 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"        
> colnames(q2_2019)
 [1] "01 - Rental Details Rental ID"                   
 [2] "01 - Rental Details Local Start Time"            
 [3] "01 - Rental Details Local End Time"              
 [4] "01 - Rental Details Bike ID"                     
 [5] "01 - Rental Details Duration In Seconds Uncapped"
 [6] "03 - Rental Start Station ID"                    
 [7] "03 - Rental Start Station Name"                  
 [8] "02 - Rental End Station ID"                      
 [9] "02 - Rental End Station Name"                    
[10] "User Type"                                       
[11] "Member Gender"                                   
[12] "05 - Member Details Member Birthday Year"        
> colnames(q3_2019)
 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"        
> colnames(q4_2019)
 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

We find out that the column names from each data is different from another. To solve this issue, we will rename column names to make them consistent

> (q1_2019 <- rename(q1_2019
+                    ,ride_id = trip_id
+                    ,rideable_type = bikeid
+                    ,started_at = start_time
+                    ,ended_at = end_time
+                    ,start_station_name = from_station_name
+                    ,start_station_id = from_station_id
+                    ,end_station_name = to_station_name
+                    ,end_station_id = to_station_id
+                    ,member_casual = usertype
+ ))
# A tibble: 365,069 × 12
    ride_id started_at          ended_at            rideable_type tripduration
      <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
 1 21742443 2019-01-01 00:04:37 2019-01-01 00:11:07          2167          390
 2 21742444 2019-01-01 00:08:13 2019-01-01 00:15:34          4386          441
 3 21742445 2019-01-01 00:13:23 2019-01-01 00:27:12          1524          829
 4 21742446 2019-01-01 00:13:45 2019-01-01 00:43:28           252         1783
 5 21742447 2019-01-01 00:14:52 2019-01-01 00:20:56          1170          364
 6 21742448 2019-01-01 00:15:33 2019-01-01 00:19:09          2437          216
 7 21742449 2019-01-01 00:16:06 2019-01-01 00:19:03          2708          177
 8 21742450 2019-01-01 00:18:41 2019-01-01 00:20:21          2796          100
 9 21742451 2019-01-01 00:18:43 2019-01-01 00:47:30          6205         1727
10 21742452 2019-01-01 00:19:18 2019-01-01 00:24:54          3939          336
# ℹ 365,059 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
#   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
#   gender <chr>, birthyear <dbl>

> (q2_2019 <- rename(q2_2019
+                    ,ride_id = '01 - Rental Details Rental ID'
+                    ,rideable_type = '01 - Rental Details Bike ID'
+                    ,started_at = '01 - Rental Details Local Start Time'
+                    ,ended_at = '01 - Rental Details Local End Time'
+                    ,start_station_name = '03 - Rental Start Station Name'
+                    ,start_station_id = '03 - Rental Start Station ID'
+                    ,end_station_name = '02 - Rental End Station Name'
+                    ,end_station_id = '02 - Rental End Station ID'
+                    ,member_casual = 'User Type'
+                    ,gender = 'Member Gender'
+                    ,birthyear = '05 - Member Details Member Birthday Year'
+                    ,duration = '01 - Rental Details Duration In Seconds Uncapped'
+ ))
# A tibble: 1,108,163 × 12
    ride_id started_at          ended_at            rideable_type duration
      <dbl> <dttm>              <dttm>                      <dbl>    <dbl>
 1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251      446
 2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226     1048
 3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649      252
 4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151      357
 5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270     1007
 6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123      257
 7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418      548
 8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513      383
 9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280     2137
10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534     2120
# ℹ 1,108,153 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
#   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
#   gender <chr>, birthyear <dbl>

> (q3_2019 <- rename(q3_2019
+                    ,ride_id = trip_id
+                    ,rideable_type = bikeid
+                    ,started_at = start_time
+                    ,ended_at = end_time
+                    ,start_station_name = from_station_name
+                    ,start_station_id = from_station_id
+                    ,end_station_name = to_station_name
+                    ,end_station_id = to_station_id
+                    ,member_casual = usertype
+ ))
# A tibble: 1,640,718 × 12
    ride_id started_at          ended_at            rideable_type tripduration
      <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
 1 23479388 2019-07-01 00:00:27 2019-07-01 00:20:41          3591         1214
 2 23479389 2019-07-01 00:01:16 2019-07-01 00:18:44          5353         1048
 3 23479390 2019-07-01 00:01:48 2019-07-01 00:27:42          6180         1554
 4 23479391 2019-07-01 00:02:07 2019-07-01 00:27:10          5540         1503
 5 23479392 2019-07-01 00:02:13 2019-07-01 00:22:26          6014         1213
 6 23479393 2019-07-01 00:02:21 2019-07-01 00:07:31          4941          310
 7 23479394 2019-07-01 00:02:24 2019-07-01 00:23:12          3770         1248
 8 23479395 2019-07-01 00:02:26 2019-07-01 00:28:16          5442         1550
 9 23479396 2019-07-01 00:02:34 2019-07-01 00:28:57          2957         1583
10 23479397 2019-07-01 00:02:45 2019-07-01 00:29:14          6091         1589
# ℹ 1,640,708 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
#   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
#   gender <chr>, birthyear <dbl>

> (q4_2019 <- rename(q4_2019
+                    ,ride_id = trip_id
+                    ,rideable_type = bikeid
+                    ,started_at = start_time
+                    ,ended_at = end_time
+                    ,start_station_name = from_station_name
+                    ,start_station_id = from_station_id
+                    ,end_station_name = to_station_name
+                    ,end_station_id = to_station_id
+                    ,member_casual = usertype
+ ))
# A tibble: 704,054 × 12
    ride_id started_at          ended_at            rideable_type tripduration
      <dbl> <dttm>              <dttm>                      <dbl>        <dbl>
 1 25223640 2019-10-01 00:01:39 2019-10-01 00:17:20          2215          940
 2 25223641 2019-10-01 00:02:16 2019-10-01 00:06:34          6328          258
 3 25223642 2019-10-01 00:04:32 2019-10-01 00:18:43          3003          850
 4 25223643 2019-10-01 00:04:32 2019-10-01 00:43:43          3275         2350
 5 25223644 2019-10-01 00:04:34 2019-10-01 00:35:42          5294         1867
 6 25223645 2019-10-01 00:04:38 2019-10-01 00:10:51          1891          373
 7 25223646 2019-10-01 00:04:52 2019-10-01 00:22:45          1061         1072
 8 25223647 2019-10-01 00:04:57 2019-10-01 00:29:16          1274         1458
 9 25223648 2019-10-01 00:05:20 2019-10-01 00:29:18          6011         1437
10 25223649 2019-10-01 00:05:20 2019-10-01 02:23:46          2957         8306
# ℹ 704,044 more rows
# ℹ 7 more variables: start_station_id <dbl>, start_station_name <chr>,
#   end_station_id <dbl>, end_station_name <chr>, member_casual <chr>,
#   gender <chr>, birthyear <dbl>

Inspect the dataframes and look for incongruencies

> str(q1_2019)
spc_tbl_ [365,069 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : num [1:365069] 21742443 21742444 21742445 21742446 21742447 ...
 $ started_at        : POSIXct[1:365069], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
 $ ended_at          : POSIXct[1:365069], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
 $ rideable_type     : num [1:365069] 2167 4386 1524 252 1170 ...
 $ tripduration      : num [1:365069] 390 441 829 1783 364 ...
 $ start_station_id  : num [1:365069] 199 44 15 123 173 98 98 211 150 268 ...
 $ start_station_name: chr [1:365069] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
 $ end_station_id    : num [1:365069] 84 624 644 176 35 49 49 142 148 141 ...
 $ end_station_name  : chr [1:365069] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
 $ member_casual     : chr [1:365069] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
 $ gender            : chr [1:365069] "Male" "Female" "Female" "Male" ...
 $ birthyear         : num [1:365069] 1989 1990 1994 1993 1994 ...
 - attr(*, "spec")=
  .. cols(
  ..   trip_id = col_double(),
  ..   start_time = col_datetime(format = ""),
  ..   end_time = col_datetime(format = ""),
  ..   bikeid = col_double(),
  ..   tripduration = col_number(),
  ..   from_station_id = col_double(),
  ..   from_station_name = col_character(),
  ..   to_station_id = col_double(),
  ..   to_station_name = col_character(),
  ..   usertype = col_character(),
  ..   gender = col_character(),
  ..   birthyear = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
> str(q2_2019)
spc_tbl_ [1,108,163 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : num [1:1108163] 22178529 22178530 22178531 22178532 22178533 ...
 $ started_at        : POSIXct[1:1108163], format: "2019-04-01 00:02:22" "2019-04-01 00:03:02" ...
 $ ended_at          : POSIXct[1:1108163], format: "2019-04-01 00:09:48" "2019-04-01 00:20:30" ...
 $ rideable_type     : num [1:1108163] 6251 6226 5649 4151 3270 ...
 $ duration          : num [1:1108163] 446 1048 252 357 1007 ...
 $ start_station_id  : num [1:1108163] 81 317 283 26 202 420 503 260 211 211 ...
 $ start_station_name: chr [1:1108163] "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
 $ end_station_id    : num [1:1108163] 56 59 174 133 129 426 500 499 211 211 ...
 $ end_station_name  : chr [1:1108163] "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
 $ member_casual     : chr [1:1108163] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
 $ gender            : chr [1:1108163] "Male" "Female" "Male" "Male" ...
 $ birthyear         : num [1:1108163] 1975 1984 1990 1993 1992 ...
 - attr(*, "spec")=
  .. cols(
  ..   `01 - Rental Details Rental ID` = col_double(),
  ..   `01 - Rental Details Local Start Time` = col_datetime(format = ""),
  ..   `01 - Rental Details Local End Time` = col_datetime(format = ""),
  ..   `01 - Rental Details Bike ID` = col_double(),
  ..   `01 - Rental Details Duration In Seconds Uncapped` = col_number(),
  ..   `03 - Rental Start Station ID` = col_double(),
  ..   `03 - Rental Start Station Name` = col_character(),
  ..   `02 - Rental End Station ID` = col_double(),
  ..   `02 - Rental End Station Name` = col_character(),
  ..   `User Type` = col_character(),
  ..   `Member Gender` = col_character(),
  ..   `05 - Member Details Member Birthday Year` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
> str(q3_2019)
spc_tbl_ [1,640,718 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : num [1:1640718] 23479388 23479389 23479390 23479391 23479392 ...
 $ started_at        : POSIXct[1:1640718], format: "2019-07-01 00:00:27" "2019-07-01 00:01:16" ...
 $ ended_at          : POSIXct[1:1640718], format: "2019-07-01 00:20:41" "2019-07-01 00:18:44" ...
 $ rideable_type     : num [1:1640718] 3591 5353 6180 5540 6014 ...
 $ tripduration      : num [1:1640718] 1214 1048 1554 1503 1213 ...
 $ start_station_id  : num [1:1640718] 117 381 313 313 168 300 168 313 43 43 ...
 $ start_station_name: chr [1:1640718] "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
 $ end_station_id    : num [1:1640718] 497 203 144 144 62 232 62 144 195 195 ...
 $ end_station_name  : chr [1:1640718] "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
 $ member_casual     : chr [1:1640718] "Subscriber" "Customer" "Customer" "Customer" ...
 $ gender            : chr [1:1640718] "Male" NA NA NA ...
 $ birthyear         : num [1:1640718] 1992 NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   trip_id = col_double(),
  ..   start_time = col_datetime(format = ""),
  ..   end_time = col_datetime(format = ""),
  ..   bikeid = col_double(),
  ..   tripduration = col_number(),
  ..   from_station_id = col_double(),
  ..   from_station_name = col_character(),
  ..   to_station_id = col_double(),
  ..   to_station_name = col_character(),
  ..   usertype = col_character(),
  ..   gender = col_character(),
  ..   birthyear = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
> str(q4_2019)
spc_tbl_ [704,054 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : num [1:704054] 25223640 25223641 25223642 25223643 25223644 ...
 $ started_at        : POSIXct[1:704054], format: "2019-10-01 00:01:39" "2019-10-01 00:02:16" ...
 $ ended_at          : POSIXct[1:704054], format: "2019-10-01 00:17:20" "2019-10-01 00:06:34" ...
 $ rideable_type     : num [1:704054] 2215 6328 3003 3275 5294 ...
 $ tripduration      : num [1:704054] 940 258 850 2350 1867 ...
 $ start_station_id  : num [1:704054] 20 19 84 313 210 156 84 156 156 336 ...
 $ start_station_name: chr [1:704054] "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
 $ end_station_id    : num [1:704054] 309 241 199 290 382 226 142 463 463 336 ...
 $ end_station_name  : chr [1:704054] "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
 $ member_casual     : chr [1:704054] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
 $ gender            : chr [1:704054] "Male" "Male" "Female" "Male" ...
 $ birthyear         : num [1:704054] 1987 1998 1991 1990 1987 ...
 - attr(*, "spec")=
  .. cols(
  ..   trip_id = col_double(),
  ..   start_time = col_datetime(format = ""),
  ..   end_time = col_datetime(format = ""),
  ..   bikeid = col_double(),
  ..   tripduration = col_number(),
  ..   from_station_id = col_double(),
  ..   from_station_name = col_character(),
  ..   to_station_id = col_double(),
  ..   to_station_name = col_character(),
  ..   usertype = col_character(),
  ..   gender = col_character(),
  ..   birthyear = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Convert ride_id and rideable_type to character so that they can stack correctly

> q1_2019 <-  mutate(q1_2019, ride_id = as.character(ride_id)
+                    ,rideable_type = as.character(rideable_type)) 
> q2_2019 <-  mutate(q2_2019, ride_id = as.character(ride_id)
+                    ,rideable_type = as.character(rideable_type)) 
> q3_2019 <-  mutate(q3_2019, ride_id = as.character(ride_id)
+                    ,rideable_type = as.character(rideable_type)) 
> q4_2019 <-  mutate(q4_2019, ride_id = as.character(ride_id)
+                    ,rideable_type = as.character(rideable_type))

Merge individual quarter’s data frames into one big data frame

> all_trips <- bind_rows(q1_2019,q2_2019, q3_2019, q4_2019)

Remove lat, long, birthyear, and gender fields as this data was dropped beginning in 2020

> all_trips <- all_trips %>%  
+   select(-c(birthyear, gender, duration,  "tripduration"))

Process Phase

In this phase we will clean and transform the data to make it suitable for analysis. This step includes handling missing values, removing duplicates, and formatting data into a usable format. We might also need to merge or join different datasets if necessary.

Check data for errors

Inspect the new table that has been created

> colnames(all_trips)  
[1] "ride_id"            "started_at"         "ended_at"          
[4] "rideable_type"      "start_station_id"   "start_station_name"
[7] "end_station_id"     "end_station_name"   "member_casual"     
> nrow(all_trips)  
[1] 3818004
> dim(all_trips)  
[1] 3818004       9
> head(all_trips)  
# A tibble: 6 × 9
  ride_id started_at          ended_at            rideable_type start_station_id
  <chr>   <dttm>              <dttm>              <chr>                    <dbl>
1 217424… 2019-01-01 00:04:37 2019-01-01 00:11:07 2167                       199
2 217424… 2019-01-01 00:08:13 2019-01-01 00:15:34 4386                        44
3 217424… 2019-01-01 00:13:23 2019-01-01 00:27:12 1524                        15
4 217424… 2019-01-01 00:13:45 2019-01-01 00:43:28 252                        123
5 217424… 2019-01-01 00:14:52 2019-01-01 00:20:56 1170                       173
6 217424… 2019-01-01 00:15:33 2019-01-01 00:19:09 2437                        98
# ℹ 4 more variables: start_station_name <chr>, end_station_id <dbl>,
#   end_station_name <chr>, member_casual <chr>
> str(all_trips)  
tibble [3,818,004 × 9] (S3: tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:3818004] "21742443" "21742444" "21742445" "21742446" ...
 $ started_at        : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
 $ ended_at          : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
 $ rideable_type     : chr [1:3818004] "2167" "4386" "1524" "252" ...
 $ start_station_id  : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
 $ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
 $ end_station_id    : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
 $ end_station_name  : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
 $ member_casual     : chr [1:3818004] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
> summary(all_trips)  
   ride_id            started_at                    
 Length:3818004     Min.   :2019-01-01 00:04:37.00  
 Class :character   1st Qu.:2019-05-29 15:49:26.50  
 Mode  :character   Median :2019-07-25 17:50:54.00  
                    Mean   :2019-07-19 21:47:37.11  
                    3rd Qu.:2019-09-15 06:48:05.75  
                    Max.   :2019-12-31 23:57:17.00  
    ended_at                      rideable_type      start_station_id
 Min.   :2019-01-01 00:11:07.00   Length:3818004     Min.   :  1.0   
 1st Qu.:2019-05-29 16:09:28.25   Class :character   1st Qu.: 77.0   
 Median :2019-07-25 18:12:23.00   Mode  :character   Median :174.0   
 Mean   :2019-07-19 22:11:47.56                      Mean   :201.7   
 3rd Qu.:2019-09-15 08:30:13.25                      3rd Qu.:289.0   
 Max.   :2020-01-21 13:54:35.00                      Max.   :673.0   
 start_station_name end_station_id  end_station_name   member_casual     
 Length:3818004     Min.   :  1.0   Length:3818004     Length:3818004    
 Class :character   1st Qu.: 77.0   Class :character   Class :character  
 Mode  :character   Median :174.0   Mode  :character   Mode  :character  
                    Mean   :202.6                                        
                    3rd Qu.:291.0                                        
                    Max.   :673.0

There are a few problems we will need to fix:

In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.
The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.
We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “tripduration” column. We will add “ride_length” to the entire dataframe for consistency.
There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

Data Manipulation

Reassign to the desired values (we will go with the current 2020 labels)

> all_trips <-  all_trips %>% 
+   mutate(member_casual = recode(member_casual
+                                 ,"Subscriber" = "member"
+                                 ,"Customer" = "casual"))

Check to make sure the proper number of observations were reassigned

> table(all_trips$member_casual)

 casual  member 
 880637 2937367

Add columns that list the date, month, day, and year of each ride

> all_trips$date <- as.Date(all_trips$started_at)
> all_trips$month <- format(as.Date(all_trips$date), "%m")
> all_trips$day <- format(as.Date(all_trips$date), "%d")
> all_trips$year <- format(as.Date(all_trips$date), "%Y")
> all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Add a “ride_length” calculation to all_trips (in seconds)

> all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Inspect the structure of the columns

> str(all_trips)
tibble [3,818,004 × 15] (S3: tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:3818004] "21742443" "21742444" "21742445" "21742446" ...
 $ started_at        : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
 $ ended_at          : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
 $ rideable_type     : chr [1:3818004] "2167" "4386" "1524" "252" ...
 $ start_station_id  : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
 $ start_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
 $ end_station_id    : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
 $ end_station_name  : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
 $ member_casual     : chr [1:3818004] "member" "member" "member" "member" ...
 $ date              : Date[1:3818004], format: "2019-01-01" "2019-01-01" ...
 $ month             : chr [1:3818004] "01" "01" "01" "01" ...
 $ day               : chr [1:3818004] "01" "01" "01" "01" ...
 $ year              : chr [1:3818004] "2019" "2019" "2019" "2019" ...
 $ day_of_week       : chr [1:3818004] "Tuesday" "Tuesday" "Tuesday" "Tuesday" ...
 $ ride_length       : 'difftime' num [1:3818004] 6.5 7.35 13.8166666666667 29.7166666666667 ...
  ..- attr(*, "units")= chr "mins"

Convert “ride_length” from Factor to numeric so we can run calculations on the data

> is.factor(all_trips$ride_length)
[1] FALSE
> all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
> is.numeric(all_trips$ride_length)
[1] TRUE

Data Cleaning

The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative. We will create a new version of the dataframe (v2) since data is being removed

> all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

Analyze Phase

We will perform Exploratory Data Analysis (EDA) and apply statistical methods to uncover patterns, trends, and insights in the data. This step helps us to answer business task and identify key findings, such as trends in bike usage by customer type

Descriptive Analysis on ride_length by User Type

> # Filter for casual users
> casual_trips <- all_trips_v2 %>%
+   filter(member_casual == "casual")
> 
> # Filter for member users
> member_trips <- all_trips_v2 %>%
+   filter(member_casual == "member")
> 
> # Descriptive analysis on ride_length (all figures in minutes) for casual and member users
> summary(casual_trips$ride_length)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
     1.02     15.25     25.83     57.02     45.30 177200.37 
> summary(member_trips$ride_length)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
     1.02      6.03      9.80     14.33     16.12 150943.90

Key Insights

The result suggest that, on average, casual users tend to take longer rides compared to member users.

User Type Distribution

> # Count the number of casual and member customers
> customer_distribution <- all_trips_v2 %>%
+   count(member_casual)
> 
> # Calculate the percentage
> customer_distribution <- customer_distribution %>%
+   mutate(percentage = n / sum(n) * 100)
> 
> # Plotting
> ggplot(customer_distribution, aes(x = "", y = n, fill = member_casual)) +
+   geom_bar(stat = "identity", width = 1) +
+   coord_polar("y") +
+   geom_text(aes(label = paste0(round(percentage), "%")), position = position_stack(vjust = 0.5)) +
+   labs(title = "Distribution of Casual and Member Users",
+        fill = "Customer Type") +
+   theme_void() +
+   scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))

Key insights
There are significantly more member users compared to casual users.

Number of Rides Comparison

> # Count the number of rides for each user type
> ride_counts <- all_trips_v2 %>% 
+   group_by(member_casual) %>% 
+   summarise(num_rides = n())
> 
> ggplot(ride_counts, aes(x = member_casual, y = num_rides, fill = member_casual)) +
+   geom_bar(stat = "identity") +
+   labs(title = "Number of Rides by User Type", x = "User Type", y = "Number of Rides") +
+   scale_y_continuous(labels = scales::comma) +  # Format y-axis labels with commas
+   theme_minimal() +
+   scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))

Key insights
The graph suggest that most of the bike rides were made up by member users.

Average ride length comparison

> # Calculate average ride length for each user type
> ride_duration <- all_trips_v2 %>% 
+   group_by(member_casual) %>% 
+   summarise(avg_ride_length = mean(ride_length, na.rm = TRUE))
> 
> # Bar plot of average ride length by user type
> ggplot(ride_duration, aes(x = member_casual, y = avg_ride_length, fill = member_casual)) +
+   geom_bar(stat = "identity") +
+   labs(title = "Average Ride Length by User Type", x = "User Type", y = "Average Ride Length", fill = "User Type") +
+   theme_minimal() +
+   scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))

Key insights
Casual members ride the longest between user types. This is interesting because from previous information, we found out that most rides were made by member users.

Activity pattern of member riders

> activity_by_day_time_member <- all_trips_v2 %>%
+   filter(member_casual == "member") %>%
+   mutate(hour = lubridate::hour(started_at)) %>%
+   group_by(member_casual, day_of_week, hour, .drop = FALSE) %>%
+   summarise(num_rides = n()) %>%
+   arrange(member_casual, day_of_week, hour)
> 
> # Plotting
> ggplot(activity_by_day_time_member, aes(x = hour, y = day_of_week, fill = num_rides)) +
+   geom_tile() +
+   labs(title = "Member Users Ride Activity by Day of Week and Time of Day",
+        x = "Hour of Day", y = "Day of Week", fill = "Number of Rides") +
+   theme_minimal() +
+   scale_fill_viridis_c()

Key insights

Highest Activity: Higher levels of activity (represented by warmer colors, like yellow) by member users are visible during peak times. These are roughly around 8 AM and 5 PM, which likely correspond to typical commuting hours for work
Day of the Week Patterns: The activity levels vary across the days of the week. It seems that weekdays have more consistent activity during peak hours than weekends.
Weekend Trends: During the weekends, the activity distribution is less peaked around traditional commuting hours and more spread out throughout the day.
Midday and Late Evening: On weekdays, there is a notable decrease in activity during the late evening and midday hours.
Least Activity: The least activity seems to occur in the early morning hours, around 0 to 5 AM across all day of the week.

Activity pattern of casual riders

> activity_by_day_time_casual <- all_trips_v2 %>%
+   filter(member_casual == "casual") %>%
+   mutate(hour = lubridate::hour(started_at)) %>%
+   group_by(member_casual, day_of_week, hour, .drop = FALSE) %>%
+   summarise(num_rides = n()) %>%
+   arrange(member_casual, day_of_week, hour)
> 
> ggplot(activity_by_day_time_casual, aes(x = hour, y = day_of_week, fill = num_rides)) +
+   geom_tile() +
+   labs(title = "Casual Users Ride Activity by Day of Week and Time of Day",
+        x = "Hour of Day", y = "Day of Week", fill = "Number of Rides") +
+   theme_minimal() +
+   scale_fill_viridis_c()

Key insights

Highest Activity: The highest activity occurs in the middle of the day, particularly around midday to early afternoon. This is indicated by the bright yellow color centered around 12-15 hours (around noon to 3 PM) on Sunday.
Weekend vs. Weekday Activity: Casual users exhibit higher overall activity on weekends compared to weekdays, indicating a preference for leisure activities or errands during weekends.
Weekday Patterns: While weekdays show lower activity level overall, there are noticeable spikes during typical rush hours in the morning and evening. This suggest that some casual users use Cyclistic’s bike for commuting purposes.
Day-to-Day Variations: Monday shows the least activity among casual users, with consistently low ride activity throughout the day.
Early Morning Inactivity: Across all days, the early morning hours (0 to around 5) exhibit lowest activity levels, represented by dark blue colors.

User type activity by day of the week

> # Extract day of week from started_at
> all_trips_v2$day_of_week <- factor(weekdays(all_trips_v2$started_at, abbreviate = TRUE),
+                                    levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
> 
> # Calculate number of rows by user type and day of week
> rows_by_user_type_day <- all_trips_v2 %>%
+   group_by(day_of_week, member_casual) %>%
+   summarise(num_rows = n())
> 
> # Plotting
> ggplot(rows_by_user_type_day, aes(x = day_of_week, y = num_rows, fill = member_casual)) +
+   geom_bar(stat = "identity", position = "dodge") +
+   labs(title = "Ride Counts by User Type During a Week",
+        x = "Day of Week", y = "Ride Counts", fill = "User Type") +
+   scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "k")) +
+   theme_minimal()+
+   scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))

Key insights

Member users tend to use Cyclistic’s service more during workweek, with a peak in usage on wednesday and a decline towards the weekend.
Casual users, on the other hand, show a different trend, with lower usage on Monday and increasing usage throughout the week, peaking on Saturday.
This patterns suggests that member users may be using the service for commuting. But casual members may be using the service for leisure activities.

User type activity by month of the year

> # Calculate number of rows by user type and month
> rows_by_user_type_month <- all_trips_v2 %>%
+   group_by(month, member_casual) %>%
+   summarise(num_rows = n())
> 
> # Plotting
> ggplot(rows_by_user_type_month, aes(x = factor(month, labels = month.name), y = num_rows, fill = member_casual)) +
+   geom_bar(stat = "identity", position = "dodge") +
+   labs(title = "Ride Counts by User Type During a Year",
+        x = "Month", y = "Ride Counts", fill = "User Type") +
+   theme_minimal() +
+   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
+   scale_y_continuous(labels = scales::comma_format()) +
+   scale_fill_manual(values = c("casual" = "#54bebe", "member" = "#d7658b"))

Key Insights

Both casual and member users exhibit seasonal patterns in bike usage, with usage starting to rise in March, peaking in July, and then declining through December and January. This pattern likely reflects seasonal weather changes, with higher usage during warmer months.
The peak in bike usage occurs during the summer month of June until August for both user types. This indicates a preference during warmer weather, potentially for recreational or commuting purposes.
Casual users show more variability in their usage patterns. Their usage increases significantly from winter to summer, suggesting that their biking habits are more influenced by seasonal factors like weather.
In contrast, member users demonstrate more consistent usage levels across all months, indicating that their bike usage may be more tied to regular commuting or consistent travel needs.

Share Phase

In this step we will communicate your findings and insights through visualizations, reports, or presentations. This step involves selecting the most relevant and impactful way to present our analysis to stakeholders.

Here are a few key findings on the difference between member and casual users:
1. Differences in Ride Duration: On average, casual users tend to take longer rides compared to member users. This could indicate different usage patterns or preferences between the two user types.

2. User Distribution: There is a significant difference in the number of member users compared to casual users.

3. Rides Usage: Most of the bike rides were made up by member users

4. Longest Rides: Casual users ride the longest compared to other user types. This finding is interesting because it contrasts with the fact that most rides were made by member users, suggesting different behavior patterns between the two groups.

5. Member Users Activity Patterns:

Member users show higher activity during peak commuting times, around 8 AM and 5 PM, indicating a pattern of commuting to and from work.
Activity levels vary across the days of the week, with more consistent activity during peak hours on weekdays compared to weekends.
On weekends, activity is less peaked around commuting hours, suggesting different usage patterns on weekends.
Noticeable decrease in activity during late evening and midday hours on weekdays.
The least activity seems to occur in the early morning hours, around 0 to 5 AM across all day of the week.

6. Casual Users Activity Patterns:

Casual users are most active in the middle of the day, particularly around midday to early afternoon, indicating a preference for daytime leisure activities.
Higher overall activity on weekends compared to weekdays, suggesting leisure activities or errands during weekends.
There are spikes in activity during typical rush hours on weekdays, indicating some casual users use Cyclistic service for commuting.
Monday shows the least activity among casual users, with consistently low ride activity throughout the day.
And the lowest activity levels in the early morning hours across all days.

7. Seasonal Trends: Both casual and member users exhibit seasonal patterns in bike usage, with higher usage during warmer months from March to July, and declining through December and January. This suggests that weather significantly impacts bike usage.

8. Summer Popularity: There is a peak in bike usage during summer months of June to August for both user types, indicating a preference for biking during warmer weather.

Act Phase

We will suggest action based on the analysis to achieve the business task. This step could involve implementing a new marketing strategy, optimizing bike availability, or making other operational changes based on our findings.

Based on the insights gathered:

a. Application of Insights: Implement targeted marketing campaigns based on the identified user behaviors and preferences. Use the insights to tailor promotions, messaging, and user experience to better appeal to casual riders and encourage them to become annual members.

b. Next Steps: Initiate a comprehensive marketing strategy that includes promotional offers, engagement campaigns, and user experience enhancements. Develop a timeline for implementation and establish key performance indicators (KPIs) to measure the success of the strategy.

c. Additional Data: Explore additional data sources, such as user surveys or feedback, to gain deeper insights into the reasons behind casual riders’ usage patterns and preferences. This data can help refine the marketing strategy and improve its effectiveness in converting casual riders into annual members.

Furthermore, based on the analysis, here are my top three recommendations:

1. Promotional Offers: Create targeted promotional offers aimed at casual riders, such as discounted annual memberships or exclusive benefits for converting to annual membership. Highlight the cost savings and convenience of annual membership compared to pay-as-you-ride options.

2. Engagement Campaigns: Launch engagement campaigns to educate casual riders about the benefits of annual membership, such as unlimited rides, priority access to bikes, and member-exclusive events. Use social media, email marketing, and in-app notifications to reach out to casual riders.

3. User Experience Enhancements: Improve the user experience for both casual and annual members to encourage conversion. This could include streamlining the registration process, enhancing bike availability and maintenance, and offering personalized recommedations based on ride history.

Thank you

for exploring my analysis! I hope you found my insights valuable and informative.

Cyclistic Bike-Share Analysis Case Study

Kayla Aisya Zahra

2023-12-10

Overview

Ask Phase

Prepare Phase

Process Phase

Analyze Phase

Act Phase

Thank you