Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. Oneapproach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Summary of Analysis

This fictional case study was processed through the six steps outlined in the Google Data Analytics curriculum:

Ask - What is the high-level business task? What specific question am I trying to answer?
Prepare - Where do I get the data for this analysis? How do I ensure its credibility?
Process - Inspect, clean, and organize data. Document the process.
Analyze - Perform statistical analysis, subset data, plot relationships and trends.
Share - Summarize findings in deliverables. Include visualizations to illustrate key insights.
Act - Present analysis and conclusion to decision makers. Develop strategy for next steps.

This report includes code chunks and outputs where the anticipated impact is intended to illustrate the process being carried out or its result.

Ask

The high-level business task for this case study was to design marketing strategies aimed at converting casual riders into annual members. My specific assignment was to determine how annual members and casual riders use Cyclistic bikes differently.

Prepare

I began by importing data sets for each month of 2022. Data was made available under the use license found here, from the download site, https://divvy-tripdata.s3.amazonaws.com/index.html. I imported the twelve monthly *.csv files and consolidated them into a single data set for the year 2022. I prepared the data for cleaning and analysis through the following process.

# combine data into one data frame
tot_rides <- rbind(jan_rides, feb_rides, mar_rides, apr_rides, may_rides,
                   jun_rides, jul_rides, aug_rides, sep_rides, oct_rides,
                   nov_rides, dec_rides)

# cleanup workspace
# remove monthly data frames to ease storage and navigation, raw files remain unaltered
rm(jan_rides, feb_rides, mar_rides, apr_rides, may_rides, jun_rides, 
   jul_rides, aug_rides, sep_rides, oct_rides, nov_rides, dec_rides)

# remove all white space in ride_id to check for duplicates
tot_rides$ride_id <- gsub('\\s+', '', tot_rides$ride_id)

# check for duplicated rows based on ride_id, no duplicates found
sum(duplicated(tot_rides$ride_id))

# remove leading and trailing spaces start_station_name and end_station_name
# to be able to count unique instances
tot_rides$start_station_name <- trimws(tot_rides$start_station_name, 
                                       which = c("both"))

tot_rides$end_station_name <- trimws(tot_rides$end_station_name, 
                                     which = c("both"))

# convert started_at and ended_at columns to date-time format for calculations
tot_rides$started_at <- as.POSIXct(tot_rides$started_at,
                                   format = "%Y-%m-%d %H:%M:%S")
tot_rides$ended_at <- as.POSIXct(tot_rides$ended_at,
                                   format = "%Y-%m-%d %H:%M:%S")

# add column start_time, separate time of day from started_at column
tot_rides$start_time <- format(as.POSIXct(tot_rides$started_at),
                               format = "%H:%M:%S")

# format start_time as hh:mm:ss for calculations
tot_rides$start_time <- hms::as_hms(tot_rides$start_time)

# add column ride_length = ended_at - started_at
tot_rides$ride_length <- (tot_rides$ended_at - tot_rides$started_at)

# format ride_length as hh:mm:ss for calculations
tot_rides$ride_length <- hms::as_hms(tot_rides$ride_length)

# add day_of_week column 
tot_rides$day_of_week <- lubridate::wday(tot_rides$started_at)

# set column order
tot_rides = tot_rides %>% select(ride_id, rideable_type, day_of_week,
                                 start_time, started_at, ended_at, ride_length, 
                                 start_station_name, start_station_id, 
                                 end_station_name, end_station_id, start_lat, 
                                 start_lng, end_lat, end_lng, member_casual)

Process

I chose to work through this case study in R because of the large size of the dataset and its ability to generate plots. This data was made available from its owner, Cyclistic, and was anticipated to be credible. The following steps were taken to clean and verify data integrity as much as possible. Then, descriptive statistics were generated to inspect the data, initially.

## [1] "total records = 5667717"

## [1] "mean ride length = 00:19:27"

## [1] "median ride length = 00:10:17"

## [1] "max ride length = 689:47:15"

## [1] "min ride length = -172:33:21"

These items immediately stood out:

The negative minimum ride length
The dramatic difference between the max ride length and the mean
The mean ride length being nearly double the median

I generated a boxplot of the ride lengths to visualize the data spread and then addressed each item above through the following cleanup process.

Cleanup

Notes:

Records with negative ride lengths or values less than one minute were removed from the dataset to reduce potentially low skewing bias attributed to systematic or bike user errors. 121,089 rows were removed (2.14% of data).
Records with ride lengths outside the 99.9% quantile of remaining data were also removed from the dataset to reduce potentially high skewing bias, again, attributed to systematic or bike user errors. 5,547 rows were removed (0.98% of the original dataset).

Review of the regenerated boxplot showed the cleaned dataset was now a contiguous group without negative or extremely high ride lengths.

Analysis

Once the data was cleaned, the analysis began with regenerating descriptive statistics.

# repeat descriptive statistics
# count ride_ids
print(paste("total records =", nrow(tot_rides)))

## [1] "total records = 5541081"

# mean ride_length
print(paste("mean ride length =",
            hms::round_hms(hms::as_hms(mean(tot_rides$ride_length)), 1)))

## [1] "mean ride length = 00:16:34"

# median ride_length
print(paste("median ride length =",
            hms::round_hms(hms::as_hms(median(tot_rides$ride_length)),1)))

## [1] "median ride length = 00:10:30"

# max ride_length
print(paste("max ride length =", 
            hms::round_hms(hms::as_hms(max(tot_rides$ride_length)),1)))

## [1] "max ride length = 22:01:34"

# min ride_length
print(paste("min ride length =", 
            hms::round_hms(hms::as_hms(min(tot_rides$ride_length)),1)))

## [1] "min ride length = 00:01:00"

# mode day_of_week
# define function to calculate mode
find_mode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}

# find mode for day_of_week
busy_day <- find_mode(tot_rides$day_of_week)
days_in_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
                  "Friday", "Saturday") 
print(paste("mode = most rides occurred on", days_in_week[busy_day]))

## [1] "mode = most rides occurred on Saturday"

# average ride_length for members
print(paste("avg ride length for members =", hms::round_hms(hms::as_hms(mean(tot_rides[
  tot_rides$member_casual == 'member', 'ride_length'])), 1)))

## [1] "avg ride length for members = 00:12:39"

# average ride length for casual riders
print(paste("avg ride length for casual riders =", hms::round_hms(hms::as_hms(mean(tot_rides[
  tot_rides$member_casual == 'casual', 'ride_length'])), 1)))

## [1] "avg ride length for casual riders = 00:22:13"

# average start time for members
print(paste("avg start time for members =", hms::round_hms(hms::as_hms(mean(hms::as_hms(tot_rides[
  tot_rides$member_casual == 'member', 'started_at']))),1)))

## [1] "avg start time for members = 14:30:02"

# average start time for casual riders
print(paste("avg start time for casual riders =", hms::round_hms(hms::as_hms(mean(hms::as_hms(tot_rides[
  tot_rides$member_casual == 'casual', 'started_at']))),1)))

## [1] "avg start time for casual riders = 15:02:44"

Next, several subsets were generated for plotting various relationships and trends found in the data. They included:

Total User Rides by Day of Week
Weekday	Member Rides	Casual Rides
1	378,138	380,008
2	463,169	271,495
3	507,669	257,772
4	512,521	268,296
5	520,677	302,422
6	456,817	327,057
7	432,740	462,300

User Ride Times by Day of Week
Weekday	Member Rides	Casual Rides
1	00:13:59	00:25:26
2	00:12:13	00:22:42
3	00:12:01	00:19:54
4	00:12:03	00:19:11
5	00:12:14	00:19:48
6	00:12:28	00:20:51
7	00:14:05	00:24:54

Avg User Start Times by Day of Week
Weekday	Member Rides	Casual Rides
1	14:09:08	13:56:57
2	14:35:30	15:06:27
3	14:27:18	15:19:19
4	14:35:34	15:34:43
5	14:35:58	15:35:42
6	14:37:26	15:26:14
7	14:24:10	14:48:37

Total User Rides by Month
Month	Member Rides	Casual Rides
1	83,740	18,086
2	92,045	20,915
3	190,565	88,106
4	239,820	123,787
5	347,238	274,515
6	391,794	361,022
7	407,940	396,843
8	417,113	350,509
9	395,325	289,843
10	341,266	203,873
11	231,628	98,204
12	133,257	43,647

User Ride Times by Month
Month	Member Rides	Casual Rides
1	00:11:45	00:17:42
2	00:11:15	00:19:51
3	00:11:53	00:24:28
4	00:11:31	00:23:30
5	00:13:17	00:25:54
6	00:13:55	00:23:46
7	00:13:43	00:23:34
8	00:13:22	00:21:54
9	00:12:54	00:20:26
10	00:11:47	00:18:50
11	00:11:05	00:15:50
12	00:10:35	00:13:39

Total User Rides by Bike
Bike	Member Rides	Casual Rides
electric_bike	1,588,478	1,219,502
classic_bike	1,683,253	876,014
docked_bike	0	173,834

Top 10 Stations for Members
Station	Member Rides
Kingsbury St & Kinzie St	24,506
Clark St & Elm St	21,605
Wells St & Concord Ln	20,950
University Ave & 57th St	19,526
Clinton St & Washington Blvd	19,375
Ellis Ave & 60th St	19,105
Wells St & Elm St	18,669
Loomis St & Lexington St	18,646
Clinton St & Madison St	18,480
Broadway & Barry Ave	17,489

Top 10 Stations for Casual Riders
Station	Casual Rides
Streeter Dr & Grand Ave	56,938
DuSable Lake Shore Dr & Monroe St	31,237
Millennium Park	24,979
Michigan Ave & Oak St	24,834
DuSable Lake Shore Dr & North Blvd	23,129
Shedd Aquarium	19,832
Theater on the Lake	18,190
Wells St & Concord Ln	15,969
Dusable Harbor	13,828
Clark St & Armitage Ave	13,590

Act

Takeaways

More rides occurred during warmer months by both groups, but casual rides increased significantly more, nearly matching member rides during the peak months.
Member ride times were consistent throughout the year.
Casual riders consistently rode longer than members and tended to ride even longer during warmer months.
The number of daily rides for members increased during the middle of the week, where casual riders peak use was on Saturday and Sunday.
The length of rides for members was relatively consistent throughout the week, but casual riders clearly rode longer on Saturday and Sunday.
Member rides started mainly around 8AM and 5PM, correlating with commuting times. Casual riders saw an increase in start times around noon with a peak around 5PM.
Members didn’t appear to have a significant preference between classic or electric bikes, but casual riders did seem to have a slight preference for electric bikes.

Recommendations

The high-level business task for this case study was to design marketing strategies aimed at converting casual riders into annual members. In consideration of the insights provided from the data and discussed previously, I believe a marketing strategy designed to target the following behaviors may lead toward meeting that goal.

Increased seasonal use likely outweighs off-peak savings
Increased weekend use further detracts from off-peak savings

Alternatively, new, targeted membership options might be more appealing for casual riders than annual memberships. Consider adding membership options such as:

Seasonal memberships or seasonal membership rates (May to Oct)
Weekend user memberships (Fri/Sat/Sun)
Time of day memberships (Noon to Midnight)

Cyclistic Bike-Share: Fictional Case Study

Joel Fay

September 30, 2023