Google Data Analytics: Cyclistic Case Study

Phase 1: Ask

Business task:

The goal of this analysis is to understand how annual members and casual riders use Cyclistic bikes differently, and to identify any insights that can inform a marketing strategy aimed at converting casual riders to annual members. By understanding the differences between these two customer segments, we can better target our marketing efforts to ultimately drive future growth for the company.

Key stakeholders:

Cyclistic executive team: The executive team will be responsible for deciding whether to approve the recommended marketing program.
Lily Moreno: The director of marketing who is responsible for the development of campaigns and initiatives to promote the bike-share program.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.

Phase 2: Prepare

Description of data source:

We will use Cyclistic’s historical bike trip data to analyze and identify trends. We will download the previous 12 months of Cyclistic trip data (January 2022 - December 2022) located here. (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable us to address the key business task. The data has been made available by Motivate International Inc under this license).

The 12 monthly datasets are organized as separate files in comma-delimited (.CSV) format and stored on my local drive. Each file consists of 13 columns (listed below) that contain information related to each bike trip that was recorded during that particular month.

ride_id
rideable_type
started_at
ended_at
start_station_name
start_station_id
end_station_name
end_station_id
start_lat
start_lng
end_lat
end_lng
member_casual

In order to understand how casual riders and annual members use Cyclistic bikes differently, we will need to analyze and compare various usage metrics between the two groups, including:

Ride length
Distance traveled
Hourly, weekly, and monthly trends and patterns
Popular bikes, stations and routes

After an initial review of the data, we will need to add some calculated fields for length and distance of rides, and also extract from start times the month of year, day of week, and hour of day to provide us with more opportunities to aggregate the data.

Before conducting any analysis, however, we first need to verify the data’s integrity! In the process phase, we will perform several pre-cleaning activities to ensure the overall accuracy, consistency, and completeness of the data.

Phase 3: Process

For this particular case study, I decided to use R programming in RStudio as we are working with extremely large data sets (+5M rows of data, +1 GB in size), which spreadsheets might not be able to handle as well. The primary advantage for using R is that it provides an accessible language to organize, modify, and clean data frames, and also create insightful data visualizations. RStudio, an integrated development environment (IDE), makes it easy to reproduce your work on different data sets. For instance, when you input your code, it’s simple to just load a new data set and run your scripts again.

Documentation of data cleaning and manipulation:

Install required packages:

install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("ggmap")
install.packages("scales")
install.packages("geosphere")

Load packages:

library(tidyverse)
library(lubridate) 
library(ggplot2) 
library(ggmap)
library(scales)
library(geosphere)

Import the 12 monthly data sets (csv files) into Rstudio:

m1 <- read_csv("cyclistic_data/202201-divvy-tripdata.csv")
m2 <- read_csv("cyclistic_data/202202-divvy-tripdata.csv")
m3 <- read_csv("cyclistic_data/202203-divvy-tripdata.csv")
m4 <- read_csv("cyclistic_data/202204-divvy-tripdata.csv")
m5 <- read_csv("cyclistic_data/202205-divvy-tripdata.csv")
m6 <- read_csv("cyclistic_data/202206-divvy-tripdata.csv")
m7 <- read_csv("cyclistic_data/202207-divvy-tripdata.csv")
m8 <- read_csv("cyclistic_data/202208-divvy-tripdata.csv")
m9 <- read_csv("cyclistic_data/202209-divvy-tripdata.csv")
m10 <- read_csv("cyclistic_data/202210-divvy-tripdata.csv")
m11 <- read_csv("cyclistic_data/202211-divvy-tripdata.csv")
m12 <- read_csv("cyclistic_data/202212-divvy-tripdata.csv")

Organize individual monthly data sets into one big data frame. Make sure column names in each file match perfectly before we can use a command to join them into one file:

tripdata <- bind_rows(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12)

Rename columns for easier understanding and analysis:

tripdata <- rename(tripdata, 
                      bike_type = rideable_type,
                      customer_type = member_casual)

Inspect the new df:

colnames(tripdata) #check column names

##  [1] "ride_id"            "bike_type"          "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "customer_type"

str(tripdata) #check data types for each column

## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ bike_type         : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ customer_type     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Add columns that list the date, month, day of week, and start time of each ride that will allow us to perform our analysis:

tripdata$date <- as.Date(tripdata$started_at) #extract date (YYYY-MM-DD)
tripdata$month <- format(as.Date(tripdata$date), "%B") #extract month (full name)
tripdata$day_of_week <- format(as.Date(tripdata$date), "%A") #extract day of week (full name)
tripdata$start_time <- strftime(tripdata$started_at, "%H") #extract start time in decimal hours (24 hour)

Add a calculated column ‘ride length’ for all trips (in seconds):

tripdata$ride_length <- difftime(tripdata$ended_at, tripdata$started_at)
is.factor(tripdata$ride_length)

## [1] FALSE

tripdata$ride_length <- as.numeric(as.character(tripdata$ride_length))
is.numeric(tripdata$ride_length)

## [1] TRUE

Add a calculated column ‘ride distance’ for all trips (in km):

tripdata$ride_distance <- distGeo(matrix(c(tripdata$start_lng, tripdata$start_lat), ncol = 2), matrix(c(tripdata$end_lng, tripdata$end_lat), ncol = 2))
tripdata$ride_distance <- tripdata$ride_distance/1000 #distance in km

Check summary statistics of added columns:

summary(tripdata)

##    ride_id           bike_type           started_at                    
##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
##                                        Mean   :2022-07-20 07:21:18.74  
##                                        3rd Qu.:2022-09-16 07:21:29.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:40:45.33                                        
##  3rd Qu.:2022-09-16 07:39:03.00                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       customer_type           date           
##  Min.   : 0.00   Min.   :-88.14   Length:5667717     Min.   :2022-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-05-28  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-22  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-19  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-16  
##  Max.   :42.37   Max.   :  0.00                      Max.   :2022-12-31  
##  NA's   :5858    NA's   :5858                                            
##     month           day_of_week         start_time         ride_length     
##  Length:5667717     Length:5667717     Length:5667717     Min.   :-621201  
##  Class :character   Class :character   Class :character   1st Qu.:    349  
##  Mode  :character   Mode  :character   Mode  :character   Median :    617  
##                                                           Mean   :   1167  
##                                                           3rd Qu.:   1108  
##                                                           Max.   :2483235  
##                                                                            
##  ride_distance     
##  Min.   :   0.000  
##  1st Qu.:   0.873  
##  Median :   1.575  
##  Mean   :   2.140  
##  3rd Qu.:   2.781  
##  Max.   :9817.319  
##  NA's   :5858

There are few “bad data” problems that will need to be cleaned:

1. Trips where ‘ride length’ is below 60 seconds or above 9 hours
1. Trips that started or ended at “repair” stations
1. Trips where ‘ride distance’ is NA, 0 km or above 100 km

tripdata_clean <- tripdata %>%
  filter(ride_length > 60 & ride_length < 32400) %>%
  filter(!grepl("repair", start_station_name, ignore.case = TRUE)) %>%
  filter(!grepl("repair", end_station_name, ignore.case = TRUE)) %>%
  filter(!is.na(ride_distance)) %>%
  filter(ride_distance > 0.1 & ride_distance < 100)

Check summary statistics of clean df:

summary(tripdata_clean)

##    ride_id           bike_type           started_at                    
##  Length:5232703     Length:5232703     Min.   :2022-01-01 00:01:00.00  
##  Class :character   Class :character   1st Qu.:2022-05-29 00:56:11.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 20:38:59.00  
##                                        Mean   :2022-07-20 15:59:30.03  
##                                        3rd Qu.:2022-09-16 16:55:10.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##     ended_at                     start_station_name start_station_id  
##  Min.   :2022-01-01 00:04:39.0   Length:5232703     Length:5232703    
##  1st Qu.:2022-05-29 01:16:09.5   Class :character   Class :character  
##  Median :2022-07-22 20:56:14.0   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 16:14:58.7                                        
##  3rd Qu.:2022-09-16 17:11:26.0                                        
##  Max.   :2023-01-01 03:06:36.0                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5232703     Length:5232703     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##     end_lat         end_lng       customer_type           date           
##  Min.   :41.55   Min.   :-88.14   Length:5232703     Min.   :2022-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-05-29  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-22  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-20  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-16  
##  Max.   :42.37   Max.   :-87.30                      Max.   :2022-12-31  
##     month           day_of_week         start_time         ride_length     
##  Length:5232703     Length:5232703     Length:5232703     Min.   :   61.0  
##  Class :character   Class :character   Class :character   1st Qu.:  369.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :  626.0  
##                                                           Mean   :  928.8  
##                                                           3rd Qu.: 1089.0  
##                                                           Max.   :32386.0  
##  ride_distance   
##  Min.   : 0.100  
##  1st Qu.: 1.032  
##  Median : 1.694  
##  Mean   : 2.297  
##  3rd Qu.: 2.926  
##  Max.   :42.383

Phase 4: Analyze

Now that the Cyclistic data is stored appropriately and has been prepared for analysis, let’s start putting it to work. Key tasks in the analysis phase include:

Aggregate your data so it’s useful and accessible
Organize and format your data
Perform calculations
Identify trends and relationships

Let’s compare total number of trips and percentage distribution by customer type:

tripdata_clean %>% 
  group_by(customer_type) %>% 
  summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(tripdata_clean)) * 100);

## # A tibble: 2 × 3
##   customer_type ride_count ride_percentage
##   <chr>              <int>           <dbl>
## 1 casual           2079861            39.7
## 2 member           3152842            60.3

Cyclistic members made up 60.3% of total trips while casual riders made up 39.7%. Cyclistic members made about ~20% more trips than casual riders in the 2022 calendar year.

Let’s analyze total number of trips via e-bike and percentage distribution of all trips:

electric_trips <- tripdata_clean %>%
  filter(bike_type == "electric_bike")

(nrow(electric_trips) / nrow(tripdata_clean)) * 100

## [1] 50.71633

Check summary statistics of ride length:

summary(tripdata_clean$ride_length);

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0   369.0   626.0   928.8  1089.0 32386.0

Compare summary statistics of ride length by customer type:

tripdata_clean %>%
  group_by(customer_type) %>% 
  summarise(min_ride_length = min(ride_length)/60,
    median_ride_length = median(ride_length)/60,
    avg_ride_length = mean(ride_length)/60,
    max_ride_length = max(ride_length)/60);

## # A tibble: 2 × 5
##   customer_type min_ride_length median_ride_length avg_ride_length max_ride_le…¹
##   <chr>                   <dbl>              <dbl>           <dbl>         <dbl>
## 1 casual                   1.02              13.0             20.2          540.
## 2 member                   1.02               9.05            12.3          540.
## # … with abbreviated variable name ¹max_ride_length

The average ride length of casual riders was 20.2 minutes (2022) – nearly double that of annual members at 12.3 minutes. We can hypothesize that casual riders use Cyclistic for leisure, such as exploring the city, whereas members use Cyclistic for non-recreational activities, such as commuting to work or school, running errands, etc.

Let’s analyze our usage metrics by days of the week for both customer types. First order the days of the week from Monday to Sunday:

tripdata_clean$day_of_week <- ordered(tripdata_clean$day_of_week, 
                                      levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"));

Compare total trips by customer type in each day of the week:

tripdata_clean %>% 
  group_by(day_of_week, customer_type) %>%
  summarise(number_of_rides = n()) %>%
  arrange(day_of_week, customer_type);

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 3
## # Groups:   day_of_week [7]
##    day_of_week customer_type number_of_rides
##    <ord>       <chr>                   <int>
##  1 Monday      casual                 245924
##  2 Monday      member                 445756
##  3 Tuesday     casual                 237535
##  4 Tuesday     member                 490873
##  5 Wednesday   casual                 247997
##  6 Wednesday   member                 495727
##  7 Thursday    casual                 279847
##  8 Thursday    member                 503149
##  9 Friday      casual                 302127
## 10 Friday      member                 440151
## 11 Saturday    casual                 423211
## 12 Saturday    member                 415613
## 13 Sunday      casual                 343220
## 14 Sunday      member                 361573

Compare average ride length by customer type in each day of the week:

tripdata_clean %>% 
  group_by(day_of_week, customer_type) %>%
  summarise(avg_ride_length = mean(ride_length)) %>%
  arrange(day_of_week, customer_type);

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 3
## # Groups:   day_of_week [7]
##    day_of_week customer_type avg_ride_length
##    <ord>       <chr>                   <dbl>
##  1 Monday      casual                  1225.
##  2 Monday      member                   712.
##  3 Tuesday     casual                  1078.
##  4 Tuesday     member                   702.
##  5 Wednesday   casual                  1047.
##  6 Wednesday   member                   708.
##  7 Thursday    casual                  1081.
##  8 Thursday    member                   718.
##  9 Friday      casual                  1151.
## 10 Friday      member                   731.
## 11 Saturday    casual                  1378.
## 12 Saturday    member                   826.
## 13 Sunday      casual                  1383.
## 14 Sunday      member                   819.

Let’s order the months of the year from January to December:

tripdata_clean$month <- ordered(tripdata_clean$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"));

Now compare total rides by customer type in each month of the year:

tripdata_clean %>% 
  group_by(customer_type, month) %>%  
  summarise(number_of_rides = n(), .groups="drop") %>% 
  arrange(customer_type, month) %>%
  print(n = 24);

## # A tibble: 24 × 3
##    customer_type month     number_of_rides
##    <chr>         <ord>               <int>
##  1 casual        January             16685
##  2 casual        February            19184
##  3 casual        March               79258
##  4 casual        April              112284
##  5 casual        May                248043
##  6 casual        June               330262
##  7 casual        July               361847
##  8 casual        August             322034
##  9 casual        September          268749
## 10 casual        October            189134
## 11 casual        November            91476
## 12 casual        December            40905
## 13 member        January             80322
## 14 member        February            88026
## 15 member        March              181928
## 16 member        April              227024
## 17 member        May                332793
## 18 member        June               379145
## 19 member        July               394190
## 20 member        August             402996
## 21 member        September          382292
## 22 member        October            330212
## 23 member        November           224789
## 24 member        December           129125

The tibble shows us that:

Casual riders logged the most trips in June, July, and August
Annual members logged the most trips in July, August, and September
Both customer types made the fewest number of trips in January, February, and December

Let’s compare average ride length by customer type in each month of the year:

tripdata_clean %>% 
  group_by(month, customer_type) %>%  
  summarise(avg_ride_length = mean(ride_length), .groups="drop") %>% 
  arrange(month, customer_type) %>%
  print(n = 24);

## # A tibble: 24 × 3
##    month     customer_type avg_ride_length
##    <ord>     <chr>                   <dbl>
##  1 January   casual                   935.
##  2 January   member                   647.
##  3 February  casual                  1068.
##  4 February  member                   650.
##  5 March     casual                  1290.
##  6 March     member                   695.
##  7 April     casual                  1265.
##  8 April     member                   681.
##  9 May       casual                  1389.
## 10 May       member                   782.
## 11 June      casual                  1295.
## 12 June      member                   814.
## 13 July      casual                  1302.
## 14 July      member                   808.
## 15 August    casual                  1207.
## 16 August    member                   784.
## 17 September casual                  1127.
## 18 September member                   754.
## 19 October   casual                  1053.
## 20 October   member                   689.
## 21 November  casual                   871.
## 22 November  member                   646.
## 23 December  casual                   747.
## 24 December  member                   621.

Create summary table for top 5 start stations by customer type:

summary_stations <- tripdata_clean %>% 
  mutate(station = start_station_name) %>%
  drop_na(start_station_name) %>% 
  group_by(start_station_name, customer_type) %>%  
  summarise(number_of_trips = n()) %>%    
  arrange(customer_type, desc(number_of_trips)) %>%
  group_by(customer_type) %>%
  filter(rank(desc(number_of_trips)) <= 5) %>%
  ungroup();

## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.

Top 5 start stations for casual riders:

(1): Streeter Dr & Grand Ave (47,069 trips)
(2): DuSable Lake Shore Dr & Monroe St (25,042 trips)
(3): Millennium Park (21,230 trips)
(4): DuSable Lake Shore Dr & North Blvd (20,967 trips)
(5): Michigan Ave & Oak St (20,542 trips)

Top 5 start stations for annual members: * (1): Kingsbury St & Kinzie St (24,293 trips) * (2): Clark St & Elm St (21,083 trips) * (3): Wells St & Concord Ln (20,667 trips) * (4): Clinton St & Washington Blvd (19,268 trips) * (5): University Ave & 57th St (18,974 trips)

Now let’s create a new data frame that contains information about the most popular bike routes (> 300 trips), categorized by starting and ending coordinates, customer type, and bike type:

popular_routes <- tripdata_clean %>% 
  filter(start_lat != end_lat & start_lng != end_lng) %>%
  group_by(start_lat, start_lng, end_lat, end_lng, customer_type, bike_type) %>%
  summarise(total_rides = n(), .groups ="drop") %>%
  filter(total_rides > 300);

And create two separate data frames, one for each customer type, that we will use to visualize popular routes over the Chicago map:

casual_riders <- popular_routes %>% filter(customer_type == "casual");
annual_members <- popular_routes %>% filter(customer_type == "member");

Set up ggmap and store map of Chicago (bbox, stamen map):

chicago <- c(left = -87.70, bottom = 41.77, right = -87.55, top = 41.97);
chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain");

## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.

Phase 5: Share

Now that we have performed our analysis and gained some insights into our data, we need to create visualizations to share our findings. Keeping in mind that our data visualizations should be sophisticated and polished in order to effectively communicate to the executive team.

Create pie chart to visualize total trips distribution by customer type:

tripdata_clean %>% 
  count(customer_type) %>%
  mutate(ride_percentage = n / sum(n) * 100) %>%
  ggplot(aes(x = "", y = n, fill=customer_type)) + 
  geom_bar(stat = "identity", width= 1) +
  coord_polar("y", start=0) +
  geom_text(
    aes(y = cumsum(n) - 0.5 * n, label = paste0(format(n, big.mark = ",", scientific = FALSE), " trips\n", round(ride_percentage, 1), "%")),
    position = position_stack(vjust = 0.5), 
    color = "black", 
    size = 5
  ) +
  theme_void() +
  theme(legend.position = "bottom", legend.title = element_blank(),
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 10),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(title = "Cyclistic Trip Distribution 2022", 
       subtitle = "Annual Members vs Casual Riders");

Create stacked bar chart to visualize bike type distribution by customer type:

ggplot(tripdata_clean, aes(x = bike_type, fill = customer_type)) +
  geom_bar(position = "stack") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific=FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(x = "", y = "Number of Trips") +
  labs(title = "Cyclistic Bike Type Distribution 2022",
       subtitle = "Annual Members vs Casual Riders",
       fill = "Customer Type");

Cyclistic members have a slight preference for classic bikes in comparison to electric bikes. Casual riders have a slight preference for electric bikes in comparison to both classic and docked bikes combined.

Create line plot with points to visualize total trips by day of the week for both customer types:

tripdata_clean %>% 
  group_by(customer_type, day_of_week) %>% 
  summarise(number_of_rides = n(), .groups="drop") %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(x = "", y = "") +
  labs(title = "Cyclistic Total Trips by Days of the Week 2022",
       subtitle = "Annual Members vs Casual Riders", 
       color = "Customer Type");

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

For annual members, bike usage remains high on weekdays (Mon-Fri) and drops significantly on weekends (Sat & Sun).
For casual riders, bike usage remains low on weekdays and increases significantly on weekends.
Bike usage for casual riders peaks on Saturday, which is the only day of the week where casual riders logged more trips than annual members in 2022.
The data suggests that members use Cyclistic bikes to commute to work or school, while casual riders prefer to use Cyclistic for leisure and recreation.

Create line plot with points to visualize total trips by month of the year for both customer types:

tripdata_clean %>%  
  group_by(customer_type, month) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>%
  ggplot(aes(x = month, y = number_of_rides, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) + 
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm"),
        axis.text.x = element_text(angle = 45, margin = margin(t = 10))) +
  labs(x = "", y = "Number of Trips") +
  labs(title = "Cyclistic Total Trips by Month of the Year 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Here we can see that a positive correlation exists between bike usage and temperature in Chicago, as the number of trips increases in the summer and decreases in the winter.
While there is a significant drop in bike usage during winter months for both customer types, annual members make approximately 3x more trips than casual riders in December, ~4.5x more in February, and ~4.8x more in January.

Create line plot with points to visualize average ride length by day of the week for both customer types:

tripdata_clean %>% 
  group_by(customer_type, day_of_week) %>% 
  summarise(avg_ride_length = mean(ride_length)/60, .groups = "drop") %>% 
  ggplot(aes(x = day_of_week, y = avg_ride_length, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(x = "", y = "Avg. Ride Length (minutes)") +
  labs(title = "Cyclistic Average Ride Length by Day of the Week 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

The line plot above shows us that average ride length of casual riders is greater than the average ride length of current members on every day of the week.
Additionally, average ride length for both customer types increases on the weekends when people tend to ride bikes for leisure.

Create line plot with points to visualize average ride length by month of the year for both customer types:

tripdata_clean %>%  
  group_by(customer_type, month) %>% 
  summarise(avg_ride_length = mean(ride_length),.groups="drop") %>%
  ggplot(aes(x = month, y = avg_ride_length/60, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm"),
        axis.text.x = element_text(angle = 45, margin = margin(t = 10))) +
  labs(x = "", y = "Average Ride Length (minutes)") +
  labs(title = "Cyclistic Average Ride Length by Month of the Year 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Casual riders on average ride for longer times than annual members, regardless of month of the year.
Average ride length for both customer types tends to increase in the summer and decrease in the winter. This behavior is far more elastic for casual riders than it is for annual members.
Average ride length for both customer types was actually higher in February than in November. This can probably be explained by there being fewer days in February - and in turn fewer overall trips - increasing mean ride length by decreasing the denominator (ride length/total rides).
Average ride length for casual riders is relatively high in March compared to the summer months. This spike may be caused by spring break vacation when people have more leisure time to explore the city.

Create bar chart to visualize average ride distance for each customer type:

tripdata_clean %>% 
  group_by(customer_type) %>% drop_na() %>%
  summarise(avg_ride_distance = mean(ride_distance)) %>%
  ggplot() +
  (geom_col(mapping = aes(x = customer_type, y = avg_ride_distance, fill = customer_type))) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x ="", y = "Average Distance (Km)") +
  labs(title = "Cyclistic Average Ride Distance in 2022",
       subtitle = "Annual Members vs Casual Riders", 
       fill = "Customer Type");

The average ride distance of casual riders is slightly greater than the average ride distance of members. This is expected as casual riders who use Cyclistic for leisure, such as exploring the city, may travel greater distances than someone who is simply commuting to their place of work or school.

Create line plot to visualize average ride distance by day of the week for both customer types:

tripdata_clean %>%
  group_by(day_of_week, customer_type) %>%
  summarise(avg_ride_distance = mean(ride_distance), .groups = "drop") %>%
  ggplot(aes(x = day_of_week, y = avg_ride_distance, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x = "", y = "Average Ride Distance (Km)") +
  labs(title = "Cyclistic Average Ride Distance by Day of the Week 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create a line plot to visualize average ride distance by month of the year for both customer types:

tripdata_clean %>%
  group_by(month, customer_type) %>%
  summarise(avg_ride_distance = mean(ride_distance), .groups = "drop") %>%
  ggplot(aes(x = month, y = avg_ride_distance, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x = "", y = "Average Ride Distance (Km)") +
  labs(title = "Cyclistic Average Ride Distance by Month 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create dodged bar chart to visualize popular start times by hour of day for both customer types:

tripdata_clean %>%
  group_by(start_time, customer_type) %>%
  summarise(number_of_rides = n(), .groups = "drop") %>%
  ggplot(aes(x = start_time, y = number_of_rides, fill = customer_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x = "Hour of the Day (24 hours)", y = "Number of Trips") +
  labs(title = "Cyclistic Popular Start Times by Hour of Day 2022",
       subtitle = "Annual Members vs Casual Riders", 
       fill = "Customer Type",
       color = "Customer Type");

The bar chart above visualizes popular start times by hour of day for both customer types. For annual members, we see two inflection points where there is a sharp increase in bike usage: the first is from 6am to 8am and second is from 3pm to 5pm. This suggests that members use Cyclistic bikes to commute to work or school. For casual riders, we see a normal distribution of start times with a gradual increase from 5am to 5pm.

Create dodged bar chart to visualize popular start times by day of the week for both customer types:

tripdata_clean %>%
  group_by(start_time, customer_type, day_of_week) %>%
  summarise(number_of_rides = n(), .groups = "drop") %>%
  ggplot(aes(x = start_time, y = number_of_rides, fill = customer_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  facet_wrap(~day_of_week) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        axis.text.x = element_blank()) +
  labs(x = "Start Time (24 hours)", y = "Trips") +
  labs(title = "Cyclistic Popular Start Times by Hour of Day 2022",
       subtitle = "Annual Members vs Casual Riders", 
       fill = "Customer Type");

There is a noticeable difference in start times between weekdays (Mon-Fri) and weekends (Sat & Sun) for both customer types.
Again we see that bike usage for members spikes from 6am-8am and 3pm-5pm during the weekdays, as they use Cyclistic bikes to commute to and from work or school.
What’s interesting is that now we can also see these two inflection points for casual riders on weekdays, albeit at a much smaller scale, suggesting that casual riders may also rely on Cyclistic to commute to work or school.
We can also see there is a significant increase in volume of casual riders on the weekends.
On the weekends, start times for both customer types tend to center around 12pm-5pm with 3pm being the peak start time.

Visualize popular routes by casual riders and bike type on the Chicago map:

ggmap(chicago_map) +
  geom_point(casual_riders, mapping = aes(x = start_lng, y = start_lat, color = bike_type), size = 2) +
  theme(panel.spacing = unit(3,"lines"),
        plot.title = element_text(size = 16, face = "bold"),
        strip.text = element_blank()) +
  labs(x = NULL, y = NULL, color = "Bike Type") +
  labs(title = "Cyclistic Popular Start Stations by Casual Riders in 2022");

## Warning: Removed 2 rows containing missing values (`geom_point()`).

Visualize popular routes by annual members and bike type on the Chicago map:

ggmap(chicago_map) +
  geom_point(annual_members, mapping = aes(x = start_lng, y = start_lat, color = bike_type), size = 2) +
  theme(panel.spacing = unit(3,"lines"),
        plot.title = element_text(size = 16, face = "bold"),
        strip.text = element_blank()) +
  labs(x = NULL, y = NULL, color = "Bike Type") +
  labs(title = "Cyclistic Popular Start Stations by Annual Members in 2022");

## Warning: Removed 13 rows containing missing values (`geom_point()`).

The most popular routes for casual riders are in the downtown area of Chicago (Streeterville, Millennium Park, Navy Pier) and in Hyde Park, whereas annual members are more scattered throughout the city.

Phase 6: Act

After analyzing Cyclistic’s historical trip data in 2022, it is clear that annual members and casual riders use Cyclistic bikes differently. Annual members made up a majority of all trips (60%) and they tend to use the bikes more during weekdays, especially during commuting hours. Casual riders, on the other hand, prefer to ride on the weekends, and they tend to use the bikes more for leisure and recreation.

Another key difference between the two customer types is that casual riders tend to ride for longer periods and distances, which suggests that they are more likely to use the bikes for sightseeing or exploring the city. This behavior is also more noticeable in late spring/early summer when casual riders tend to ride for longer periods and distances.

The data also shows that bike usage is positively correlated with temperature in Chicago, with more trips being made during the warmer summer months and fewer trips during the colder winter months. This finding suggests that Cyclistic may want to adjust its marketing strategy to promote bike usage during the late spring and early summer months.

Furthermore, the data shows that casual riders tend to make more trips in downtown Chicago (Millenium Park, Navy Pier) and in Hyde Park, whereas annual members are more scattered throughout the city. This insight may be useful in developing targeted marketing campaigns to convert casual riders into annual members, such as advertising in docking stations frequently used by casual riders.

Deliverable:

My top three recommendations based on my analysis:

(1): Offer a weekend-only annual membership at a lower rate than the annual membership: Annual members are more profitable for Cyclistic, but they are also more cost-effective for frequent riders. For example, an annual membership provides unlimited 45-minute rides for a flat fee, while casual riders pay per ride or per day. And since a majority of casual riders tend to ride the bikes mostly on the weekends, a weekend-only annual membership offered at a lower price point would encourage casual riders to switch to an annual plan while still reaping the cost-saving benefits.
(2): Offer targeted promotions or discounts to frequent riders: Since casual riders are already familiar with the Cyclistic program, offering targeted promotions and discounts to frequent riders can encourage them to upgrade to an annual membership. For example, Cyclistic could provide a discount on annual memberships to casual riders who have taken a certain number of rides in a given month, or offer a free month of membership to those who sign up during a specific promotion period. Additionally, Cyclistic should use social media, email, and other online channels to target casual riders, preferably right before spring break (when there is a spike in casual riders) and during the peak summer season.
(3): Implement a referral program: A referral program can be an effective way to attract new customers while also retaining current ones. Cyclistic could offer current members incentives for referring casual riders to join the program, such as free ride credits or even discounts for the usage of electric bikes, which made up over 50% of total trips in 2022. Based on our analysis, both casual riders and annual members are also more likely to ride bikes during the summer months. Cyclistic could leverage this insight by increasing its social media presence during the summer months to promote the program and reach potential customers.

Note: Additional data we could use to expand on our findings

All ride ids in the data set are unique so we cannot conclude if the same person made multiple trips. Pricing details for casual riders and annual members in order to optimize cost structure and provide discounts without affecting the margin. Data on bike availability and demand levels across stations. Based on the analysis, casual riders tend to make more trips in downtown Chicago and Hyde Park, while annual members are more scattered throughout the city. Cyclistic could consider redistributing bikes from less used stations to high demand areas to ensure that bikes are available when and where they are needed most. Cyclistic could also consider expanding to other areas within the state of Illinois or even expanding to other states, depending on the demand for bike-share programs in those areas.

Google Data Analytics: Cyclistic Case Study

Eric Uchoa

2023-02-22

Phase 1: Ask

Phase 2: Prepare

Phase 3: Process

Phase 4: Analyze

Phase 6: Act

Google Data Analytics: Cyclistic Case Study

Eric Uchoa

2023-02-22

Case Study: How Does a Bike-Share Navigate Speedy Success?

Phase 1: Ask

Phase 2: Prepare

Phase 3: Process

Phase 4: Analyze

Phase 5: Share

Phase 6: Act