Case Study: How Does a Bike-Share Navigate Speedy Success?

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Phase 1: Ask

Business task:

The goal of this analysis is to understand how annual members and casual riders use Cyclistic bikes differently, and to identify any insights that can inform a marketing strategy aimed at converting casual riders to annual members. By understanding the differences between these two customer segments, we can better target our marketing efforts to ultimately drive future growth for the company.

Key stakeholders:

Phase 2: Prepare

Description of data source:

We will use Cyclistic’s historical bike trip data to analyze and identify trends. We will download the previous 12 months of Cyclistic trip data (January 2022 - December 2022) located here. (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable us to address the key business task. The data has been made available by Motivate International Inc under this license).

The 12 monthly datasets are organized as separate files in comma-delimited (.CSV) format and stored on my local drive. Each file consists of 13 columns (listed below) that contain information related to each bike trip that was recorded during that particular month.

In order to understand how casual riders and annual members use Cyclistic bikes differently, we will need to analyze and compare various usage metrics between the two groups, including:

After an initial review of the data, we will need to add some calculated fields for length and distance of rides, and also extract from start times the month of year, day of week, and hour of day to provide us with more opportunities to aggregate the data.

Before conducting any analysis, however, we first need to verify the data’s integrity! In the process phase, we will perform several pre-cleaning activities to ensure the overall accuracy, consistency, and completeness of the data.

Phase 3: Process

For this particular case study, I decided to use R programming in RStudio as we are working with extremely large data sets (+5M rows of data, +1 GB in size), which spreadsheets might not be able to handle as well. The primary advantage for using R is that it provides an accessible language to organize, modify, and clean data frames, and also create insightful data visualizations. RStudio, an integrated development environment (IDE), makes it easy to reproduce your work on different data sets. For instance, when you input your code, it’s simple to just load a new data set and run your scripts again.

Documentation of data cleaning and manipulation:

Install required packages:

install.packages("tidyverse")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("ggmap")
install.packages("scales")
install.packages("geosphere")

Load packages:

library(tidyverse)
library(lubridate) 
library(ggplot2) 
library(ggmap)
library(scales)
library(geosphere)

Import the 12 monthly data sets (csv files) into Rstudio:

m1 <- read_csv("cyclistic_data/202201-divvy-tripdata.csv")
m2 <- read_csv("cyclistic_data/202202-divvy-tripdata.csv")
m3 <- read_csv("cyclistic_data/202203-divvy-tripdata.csv")
m4 <- read_csv("cyclistic_data/202204-divvy-tripdata.csv")
m5 <- read_csv("cyclistic_data/202205-divvy-tripdata.csv")
m6 <- read_csv("cyclistic_data/202206-divvy-tripdata.csv")
m7 <- read_csv("cyclistic_data/202207-divvy-tripdata.csv")
m8 <- read_csv("cyclistic_data/202208-divvy-tripdata.csv")
m9 <- read_csv("cyclistic_data/202209-divvy-tripdata.csv")
m10 <- read_csv("cyclistic_data/202210-divvy-tripdata.csv")
m11 <- read_csv("cyclistic_data/202211-divvy-tripdata.csv")
m12 <- read_csv("cyclistic_data/202212-divvy-tripdata.csv")

Organize individual monthly data sets into one big data frame. Make sure column names in each file match perfectly before we can use a command to join them into one file:

tripdata <- bind_rows(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12)

Rename columns for easier understanding and analysis:

tripdata <- rename(tripdata, 
                      bike_type = rideable_type,
                      customer_type = member_casual)

Inspect the new df:

colnames(tripdata) #check column names
##  [1] "ride_id"            "bike_type"          "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "customer_type"
str(tripdata) #check data types for each column
## spc_tbl_ [5,667,717 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5667717] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ bike_type         : chr [1:5667717] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5667717], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:5667717], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:5667717] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:5667717] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:5667717] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:5667717] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:5667717] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:5667717] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ customer_type     : chr [1:5667717] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Add columns that list the date, month, day of week, and start time of each ride that will allow us to perform our analysis:

tripdata$date <- as.Date(tripdata$started_at) #extract date (YYYY-MM-DD)
tripdata$month <- format(as.Date(tripdata$date), "%B") #extract month (full name)
tripdata$day_of_week <- format(as.Date(tripdata$date), "%A") #extract day of week (full name)
tripdata$start_time <- strftime(tripdata$started_at, "%H") #extract start time in decimal hours (24 hour)

Add a calculated column ‘ride length’ for all trips (in seconds):

tripdata$ride_length <- difftime(tripdata$ended_at, tripdata$started_at)
is.factor(tripdata$ride_length)
## [1] FALSE
tripdata$ride_length <- as.numeric(as.character(tripdata$ride_length))
is.numeric(tripdata$ride_length)
## [1] TRUE

Add a calculated column ‘ride distance’ for all trips (in km):

tripdata$ride_distance <- distGeo(matrix(c(tripdata$start_lng, tripdata$start_lat), ncol = 2), matrix(c(tripdata$end_lng, tripdata$end_lat), ncol = 2))
tripdata$ride_distance <- tripdata$ride_distance/1000 #distance in km

Check summary statistics of added columns:

summary(tripdata)
##    ride_id           bike_type           started_at                    
##  Length:5667717     Length:5667717     Min.   :2022-01-01 00:00:05.00  
##  Class :character   Class :character   1st Qu.:2022-05-28 19:21:05.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 15:03:59.00  
##                                        Mean   :2022-07-20 07:21:18.74  
##                                        3rd Qu.:2022-09-16 07:21:29.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-01-01 00:01:48.00   Length:5667717     Length:5667717    
##  1st Qu.:2022-05-28 19:43:07.00   Class :character   Class :character  
##  Median :2022-07-22 15:24:44.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 07:40:45.33                                        
##  3rd Qu.:2022-09-16 07:39:03.00                                        
##  Max.   :2023-01-02 04:56:45.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5667717     Length:5667717     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       customer_type           date           
##  Min.   : 0.00   Min.   :-88.14   Length:5667717     Min.   :2022-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-05-28  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-22  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-19  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-16  
##  Max.   :42.37   Max.   :  0.00                      Max.   :2022-12-31  
##  NA's   :5858    NA's   :5858                                            
##     month           day_of_week         start_time         ride_length     
##  Length:5667717     Length:5667717     Length:5667717     Min.   :-621201  
##  Class :character   Class :character   Class :character   1st Qu.:    349  
##  Mode  :character   Mode  :character   Mode  :character   Median :    617  
##                                                           Mean   :   1167  
##                                                           3rd Qu.:   1108  
##                                                           Max.   :2483235  
##                                                                            
##  ride_distance     
##  Min.   :   0.000  
##  1st Qu.:   0.873  
##  Median :   1.575  
##  Mean   :   2.140  
##  3rd Qu.:   2.781  
##  Max.   :9817.319  
##  NA's   :5858

There are few “bad data” problems that will need to be cleaned:

tripdata_clean <- tripdata %>%
  filter(ride_length > 60 & ride_length < 32400) %>%
  filter(!grepl("repair", start_station_name, ignore.case = TRUE)) %>%
  filter(!grepl("repair", end_station_name, ignore.case = TRUE)) %>%
  filter(!is.na(ride_distance)) %>%
  filter(ride_distance > 0.1 & ride_distance < 100)

Check summary statistics of clean df:

summary(tripdata_clean)
##    ride_id           bike_type           started_at                    
##  Length:5232703     Length:5232703     Min.   :2022-01-01 00:01:00.00  
##  Class :character   Class :character   1st Qu.:2022-05-29 00:56:11.00  
##  Mode  :character   Mode  :character   Median :2022-07-22 20:38:59.00  
##                                        Mean   :2022-07-20 15:59:30.03  
##                                        3rd Qu.:2022-09-16 16:55:10.00  
##                                        Max.   :2022-12-31 23:59:26.00  
##     ended_at                     start_station_name start_station_id  
##  Min.   :2022-01-01 00:04:39.0   Length:5232703     Length:5232703    
##  1st Qu.:2022-05-29 01:16:09.5   Class :character   Class :character  
##  Median :2022-07-22 20:56:14.0   Mode  :character   Mode  :character  
##  Mean   :2022-07-20 16:14:58.7                                        
##  3rd Qu.:2022-09-16 17:11:26.0                                        
##  Max.   :2023-01-01 03:06:36.0                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5232703     Length:5232703     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##     end_lat         end_lng       customer_type           date           
##  Min.   :41.55   Min.   :-88.14   Length:5232703     Min.   :2022-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-05-29  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-22  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-20  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-16  
##  Max.   :42.37   Max.   :-87.30                      Max.   :2022-12-31  
##     month           day_of_week         start_time         ride_length     
##  Length:5232703     Length:5232703     Length:5232703     Min.   :   61.0  
##  Class :character   Class :character   Class :character   1st Qu.:  369.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :  626.0  
##                                                           Mean   :  928.8  
##                                                           3rd Qu.: 1089.0  
##                                                           Max.   :32386.0  
##  ride_distance   
##  Min.   : 0.100  
##  1st Qu.: 1.032  
##  Median : 1.694  
##  Mean   : 2.297  
##  3rd Qu.: 2.926  
##  Max.   :42.383

Phase 4: Analyze

Now that the Cyclistic data is stored appropriately and has been prepared for analysis, let’s start putting it to work. Key tasks in the analysis phase include:

Let’s compare total number of trips and percentage distribution by customer type:

tripdata_clean %>% 
  group_by(customer_type) %>% 
  summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(tripdata_clean)) * 100);
## # A tibble: 2 × 3
##   customer_type ride_count ride_percentage
##   <chr>              <int>           <dbl>
## 1 casual           2079861            39.7
## 2 member           3152842            60.3

Cyclistic members made up 60.3% of total trips while casual riders made up 39.7%. Cyclistic members made about ~20% more trips than casual riders in the 2022 calendar year.

Let’s analyze total number of trips via e-bike and percentage distribution of all trips:

electric_trips <- tripdata_clean %>%
  filter(bike_type == "electric_bike")

(nrow(electric_trips) / nrow(tripdata_clean)) * 100
## [1] 50.71633

Check summary statistics of ride length:

summary(tripdata_clean$ride_length);
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0   369.0   626.0   928.8  1089.0 32386.0

Compare summary statistics of ride length by customer type:

tripdata_clean %>%
  group_by(customer_type) %>% 
  summarise(min_ride_length = min(ride_length)/60,
    median_ride_length = median(ride_length)/60,
    avg_ride_length = mean(ride_length)/60,
    max_ride_length = max(ride_length)/60);
## # A tibble: 2 × 5
##   customer_type min_ride_length median_ride_length avg_ride_length max_ride_le…¹
##   <chr>                   <dbl>              <dbl>           <dbl>         <dbl>
## 1 casual                   1.02              13.0             20.2          540.
## 2 member                   1.02               9.05            12.3          540.
## # … with abbreviated variable name ¹​max_ride_length

The average ride length of casual riders was 20.2 minutes (2022) – nearly double that of annual members at 12.3 minutes. We can hypothesize that casual riders use Cyclistic for leisure, such as exploring the city, whereas members use Cyclistic for non-recreational activities, such as commuting to work or school, running errands, etc.

Let’s analyze our usage metrics by days of the week for both customer types. First order the days of the week from Monday to Sunday:

tripdata_clean$day_of_week <- ordered(tripdata_clean$day_of_week, 
                                      levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"));

Compare total trips by customer type in each day of the week:

tripdata_clean %>% 
  group_by(day_of_week, customer_type) %>%
  summarise(number_of_rides = n()) %>%
  arrange(day_of_week, customer_type);
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   day_of_week [7]
##    day_of_week customer_type number_of_rides
##    <ord>       <chr>                   <int>
##  1 Monday      casual                 245924
##  2 Monday      member                 445756
##  3 Tuesday     casual                 237535
##  4 Tuesday     member                 490873
##  5 Wednesday   casual                 247997
##  6 Wednesday   member                 495727
##  7 Thursday    casual                 279847
##  8 Thursday    member                 503149
##  9 Friday      casual                 302127
## 10 Friday      member                 440151
## 11 Saturday    casual                 423211
## 12 Saturday    member                 415613
## 13 Sunday      casual                 343220
## 14 Sunday      member                 361573

Compare average ride length by customer type in each day of the week:

tripdata_clean %>% 
  group_by(day_of_week, customer_type) %>%
  summarise(avg_ride_length = mean(ride_length)) %>%
  arrange(day_of_week, customer_type);
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   day_of_week [7]
##    day_of_week customer_type avg_ride_length
##    <ord>       <chr>                   <dbl>
##  1 Monday      casual                  1225.
##  2 Monday      member                   712.
##  3 Tuesday     casual                  1078.
##  4 Tuesday     member                   702.
##  5 Wednesday   casual                  1047.
##  6 Wednesday   member                   708.
##  7 Thursday    casual                  1081.
##  8 Thursday    member                   718.
##  9 Friday      casual                  1151.
## 10 Friday      member                   731.
## 11 Saturday    casual                  1378.
## 12 Saturday    member                   826.
## 13 Sunday      casual                  1383.
## 14 Sunday      member                   819.

Let’s order the months of the year from January to December:

tripdata_clean$month <- ordered(tripdata_clean$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"));

Now compare total rides by customer type in each month of the year:

tripdata_clean %>% 
  group_by(customer_type, month) %>%  
  summarise(number_of_rides = n(), .groups="drop") %>% 
  arrange(customer_type, month) %>%
  print(n = 24);
## # A tibble: 24 × 3
##    customer_type month     number_of_rides
##    <chr>         <ord>               <int>
##  1 casual        January             16685
##  2 casual        February            19184
##  3 casual        March               79258
##  4 casual        April              112284
##  5 casual        May                248043
##  6 casual        June               330262
##  7 casual        July               361847
##  8 casual        August             322034
##  9 casual        September          268749
## 10 casual        October            189134
## 11 casual        November            91476
## 12 casual        December            40905
## 13 member        January             80322
## 14 member        February            88026
## 15 member        March              181928
## 16 member        April              227024
## 17 member        May                332793
## 18 member        June               379145
## 19 member        July               394190
## 20 member        August             402996
## 21 member        September          382292
## 22 member        October            330212
## 23 member        November           224789
## 24 member        December           129125

The tibble shows us that:

Let’s compare average ride length by customer type in each month of the year:

tripdata_clean %>% 
  group_by(month, customer_type) %>%  
  summarise(avg_ride_length = mean(ride_length), .groups="drop") %>% 
  arrange(month, customer_type) %>%
  print(n = 24);
## # A tibble: 24 × 3
##    month     customer_type avg_ride_length
##    <ord>     <chr>                   <dbl>
##  1 January   casual                   935.
##  2 January   member                   647.
##  3 February  casual                  1068.
##  4 February  member                   650.
##  5 March     casual                  1290.
##  6 March     member                   695.
##  7 April     casual                  1265.
##  8 April     member                   681.
##  9 May       casual                  1389.
## 10 May       member                   782.
## 11 June      casual                  1295.
## 12 June      member                   814.
## 13 July      casual                  1302.
## 14 July      member                   808.
## 15 August    casual                  1207.
## 16 August    member                   784.
## 17 September casual                  1127.
## 18 September member                   754.
## 19 October   casual                  1053.
## 20 October   member                   689.
## 21 November  casual                   871.
## 22 November  member                   646.
## 23 December  casual                   747.
## 24 December  member                   621.

Create summary table for top 5 start stations by customer type:

summary_stations <- tripdata_clean %>% 
  mutate(station = start_station_name) %>%
  drop_na(start_station_name) %>% 
  group_by(start_station_name, customer_type) %>%  
  summarise(number_of_trips = n()) %>%    
  arrange(customer_type, desc(number_of_trips)) %>%
  group_by(customer_type) %>%
  filter(rank(desc(number_of_trips)) <= 5) %>%
  ungroup();
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.

Top 5 start stations for casual riders:

Top 5 start stations for annual members: * (1): Kingsbury St & Kinzie St (24,293 trips) * (2): Clark St & Elm St (21,083 trips) * (3): Wells St & Concord Ln (20,667 trips) * (4): Clinton St & Washington Blvd (19,268 trips) * (5): University Ave & 57th St (18,974 trips)

Now let’s create a new data frame that contains information about the most popular bike routes (> 300 trips), categorized by starting and ending coordinates, customer type, and bike type:

popular_routes <- tripdata_clean %>% 
  filter(start_lat != end_lat & start_lng != end_lng) %>%
  group_by(start_lat, start_lng, end_lat, end_lng, customer_type, bike_type) %>%
  summarise(total_rides = n(), .groups ="drop") %>%
  filter(total_rides > 300);

And create two separate data frames, one for each customer type, that we will use to visualize popular routes over the Chicago map:

casual_riders <- popular_routes %>% filter(customer_type == "casual");
annual_members <- popular_routes %>% filter(customer_type == "member");

Set up ggmap and store map of Chicago (bbox, stamen map):

chicago <- c(left = -87.70, bottom = 41.77, right = -87.55, top = 41.97);
chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain");
## ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.

Phase 5: Share

Now that we have performed our analysis and gained some insights into our data, we need to create visualizations to share our findings. Keeping in mind that our data visualizations should be sophisticated and polished in order to effectively communicate to the executive team.

Create pie chart to visualize total trips distribution by customer type:

tripdata_clean %>% 
  count(customer_type) %>%
  mutate(ride_percentage = n / sum(n) * 100) %>%
  ggplot(aes(x = "", y = n, fill=customer_type)) + 
  geom_bar(stat = "identity", width= 1) +
  coord_polar("y", start=0) +
  geom_text(
    aes(y = cumsum(n) - 0.5 * n, label = paste0(format(n, big.mark = ",", scientific = FALSE), " trips\n", round(ride_percentage, 1), "%")),
    position = position_stack(vjust = 0.5), 
    color = "black", 
    size = 5
  ) +
  theme_void() +
  theme(legend.position = "bottom", legend.title = element_blank(),
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 10),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(title = "Cyclistic Trip Distribution 2022", 
       subtitle = "Annual Members vs Casual Riders");

Create stacked bar chart to visualize bike type distribution by customer type:

ggplot(tripdata_clean, aes(x = bike_type, fill = customer_type)) +
  geom_bar(position = "stack") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific=FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(x = "", y = "Number of Trips") +
  labs(title = "Cyclistic Bike Type Distribution 2022",
       subtitle = "Annual Members vs Casual Riders",
       fill = "Customer Type");

Cyclistic members have a slight preference for classic bikes in comparison to electric bikes. Casual riders have a slight preference for electric bikes in comparison to both classic and docked bikes combined.

Create line plot with points to visualize total trips by day of the week for both customer types:

tripdata_clean %>% 
  group_by(customer_type, day_of_week) %>% 
  summarise(number_of_rides = n(), .groups="drop") %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(x = "", y = "") +
  labs(title = "Cyclistic Total Trips by Days of the Week 2022",
       subtitle = "Annual Members vs Casual Riders", 
       color = "Customer Type");
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Create line plot with points to visualize total trips by month of the year for both customer types:

tripdata_clean %>%  
  group_by(customer_type, month) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>%
  ggplot(aes(x = month, y = number_of_rides, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) + 
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm"),
        axis.text.x = element_text(angle = 45, margin = margin(t = 10))) +
  labs(x = "", y = "Number of Trips") +
  labs(title = "Cyclistic Total Trips by Month of the Year 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create line plot with points to visualize average ride length by day of the week for both customer types:

tripdata_clean %>% 
  group_by(customer_type, day_of_week) %>% 
  summarise(avg_ride_length = mean(ride_length)/60, .groups = "drop") %>% 
  ggplot(aes(x = day_of_week, y = avg_ride_length, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm")) +
  labs(x = "", y = "Avg. Ride Length (minutes)") +
  labs(title = "Cyclistic Average Ride Length by Day of the Week 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create line plot with points to visualize average ride length by month of the year for both customer types:

tripdata_clean %>%  
  group_by(customer_type, month) %>% 
  summarise(avg_ride_length = mean(ride_length),.groups="drop") %>%
  ggplot(aes(x = month, y = avg_ride_length/60, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        plot.margin = unit(c(1,1,1,1), "cm"),
        axis.text.x = element_text(angle = 45, margin = margin(t = 10))) +
  labs(x = "", y = "Average Ride Length (minutes)") +
  labs(title = "Cyclistic Average Ride Length by Month of the Year 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create bar chart to visualize average ride distance for each customer type:

tripdata_clean %>% 
  group_by(customer_type) %>% drop_na() %>%
  summarise(avg_ride_distance = mean(ride_distance)) %>%
  ggplot() +
  (geom_col(mapping = aes(x = customer_type, y = avg_ride_distance, fill = customer_type))) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x ="", y = "Average Distance (Km)") +
  labs(title = "Cyclistic Average Ride Distance in 2022",
       subtitle = "Annual Members vs Casual Riders", 
       fill = "Customer Type");

The average ride distance of casual riders is slightly greater than the average ride distance of members. This is expected as casual riders who use Cyclistic for leisure, such as exploring the city, may travel greater distances than someone who is simply commuting to their place of work or school.

Create line plot to visualize average ride distance by day of the week for both customer types:

tripdata_clean %>%
  group_by(day_of_week, customer_type) %>%
  summarise(avg_ride_distance = mean(ride_distance), .groups = "drop") %>%
  ggplot(aes(x = day_of_week, y = avg_ride_distance, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x = "", y = "Average Ride Distance (Km)") +
  labs(title = "Cyclistic Average Ride Distance by Day of the Week 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create a line plot to visualize average ride distance by month of the year for both customer types:

tripdata_clean %>%
  group_by(month, customer_type) %>%
  summarise(avg_ride_distance = mean(ride_distance), .groups = "drop") %>%
  ggplot(aes(x = month, y = avg_ride_distance, group = customer_type, color = customer_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x = "", y = "Average Ride Distance (Km)") +
  labs(title = "Cyclistic Average Ride Distance by Month 2022",
       subtitle = "Annual Members vs Casual Riders",
       color = "Customer Type");

Create dodged bar chart to visualize popular start times by hour of day for both customer types:

tripdata_clean %>%
  group_by(start_time, customer_type) %>%
  summarise(number_of_rides = n(), .groups = "drop") %>%
  ggplot(aes(x = start_time, y = number_of_rides, fill = customer_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14)) +
  labs(x = "Hour of the Day (24 hours)", y = "Number of Trips") +
  labs(title = "Cyclistic Popular Start Times by Hour of Day 2022",
       subtitle = "Annual Members vs Casual Riders", 
       fill = "Customer Type",
       color = "Customer Type");

The bar chart above visualizes popular start times by hour of day for both customer types. For annual members, we see two inflection points where there is a sharp increase in bike usage: the first is from 6am to 8am and second is from 3pm to 5pm. This suggests that members use Cyclistic bikes to commute to work or school. For casual riders, we see a normal distribution of start times with a gradual increase from 5am to 5pm.

Create dodged bar chart to visualize popular start times by day of the week for both customer types:

tripdata_clean %>%
  group_by(start_time, customer_type, day_of_week) %>%
  summarise(number_of_rides = n(), .groups = "drop") %>%
  ggplot(aes(x = start_time, y = number_of_rides, fill = customer_type)) +
  geom_bar(position = "dodge", stat = "identity") +
  facet_wrap(~day_of_week) +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 14),
        axis.text.x = element_blank()) +
  labs(x = "Start Time (24 hours)", y = "Trips") +
  labs(title = "Cyclistic Popular Start Times by Hour of Day 2022",
       subtitle = "Annual Members vs Casual Riders", 
       fill = "Customer Type");

Visualize popular routes by casual riders and bike type on the Chicago map:

ggmap(chicago_map) +
  geom_point(casual_riders, mapping = aes(x = start_lng, y = start_lat, color = bike_type), size = 2) +
  theme(panel.spacing = unit(3,"lines"),
        plot.title = element_text(size = 16, face = "bold"),
        strip.text = element_blank()) +
  labs(x = NULL, y = NULL, color = "Bike Type") +
  labs(title = "Cyclistic Popular Start Stations by Casual Riders in 2022"); 
## Warning: Removed 2 rows containing missing values (`geom_point()`).

Visualize popular routes by annual members and bike type on the Chicago map:

ggmap(chicago_map) +
  geom_point(annual_members, mapping = aes(x = start_lng, y = start_lat, color = bike_type), size = 2) +
  theme(panel.spacing = unit(3,"lines"),
        plot.title = element_text(size = 16, face = "bold"),
        strip.text = element_blank()) +
  labs(x = NULL, y = NULL, color = "Bike Type") +
  labs(title = "Cyclistic Popular Start Stations by Annual Members in 2022");
## Warning: Removed 13 rows containing missing values (`geom_point()`).

The most popular routes for casual riders are in the downtown area of Chicago (Streeterville, Millennium Park, Navy Pier) and in Hyde Park, whereas annual members are more scattered throughout the city.

Phase 6: Act

After analyzing Cyclistic’s historical trip data in 2022, it is clear that annual members and casual riders use Cyclistic bikes differently. Annual members made up a majority of all trips (60%) and they tend to use the bikes more during weekdays, especially during commuting hours. Casual riders, on the other hand, prefer to ride on the weekends, and they tend to use the bikes more for leisure and recreation.

Another key difference between the two customer types is that casual riders tend to ride for longer periods and distances, which suggests that they are more likely to use the bikes for sightseeing or exploring the city. This behavior is also more noticeable in late spring/early summer when casual riders tend to ride for longer periods and distances.

The data also shows that bike usage is positively correlated with temperature in Chicago, with more trips being made during the warmer summer months and fewer trips during the colder winter months. This finding suggests that Cyclistic may want to adjust its marketing strategy to promote bike usage during the late spring and early summer months.

Furthermore, the data shows that casual riders tend to make more trips in downtown Chicago (Millenium Park, Navy Pier) and in Hyde Park, whereas annual members are more scattered throughout the city. This insight may be useful in developing targeted marketing campaigns to convert casual riders into annual members, such as advertising in docking stations frequently used by casual riders.

Deliverable:

My top three recommendations based on my analysis:

Note: Additional data we could use to expand on our findings

All ride ids in the data set are unique so we cannot conclude if the same person made multiple trips. Pricing details for casual riders and annual members in order to optimize cost structure and provide discounts without affecting the margin. Data on bike availability and demand levels across stations. Based on the analysis, casual riders tend to make more trips in downtown Chicago and Hyde Park, while annual members are more scattered throughout the city. Cyclistic could consider redistributing bikes from less used stations to high demand areas to ensure that bikes are available when and where they are needed most. Cyclistic could also consider expanding to other areas within the state of Illinois or even expanding to other states, depending on the demand for bike-share programs in those areas.