Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and teams

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. Oneapproach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Summary of Analysis

This fictional case study was processed through the six steps outlined in the Google Data Analytics curriculum:

This report includes code chunks and outputs where the anticipated impact is intended to illustrate the process being carried out or its result.

Ask

The high-level business task for this case study was to design marketing strategies aimed at converting casual riders into annual members. My specific assignment was to determine how annual members and casual riders use Cyclistic bikes differently.

Prepare

I began by importing data sets for each month of 2022. Data was made available under the use license found here, from the download site, https://divvy-tripdata.s3.amazonaws.com/index.html. I imported the twelve monthly *.csv files and consolidated them into a single data set for the year 2022. I prepared the data for cleaning and analysis through the following process.

# combine data into one data frame
tot_rides <- rbind(jan_rides, feb_rides, mar_rides, apr_rides, may_rides,
                   jun_rides, jul_rides, aug_rides, sep_rides, oct_rides,
                   nov_rides, dec_rides)

# cleanup workspace
# remove monthly data frames to ease storage and navigation, raw files remain unaltered
rm(jan_rides, feb_rides, mar_rides, apr_rides, may_rides, jun_rides, 
   jul_rides, aug_rides, sep_rides, oct_rides, nov_rides, dec_rides)

# remove all white space in ride_id to check for duplicates
tot_rides$ride_id <- gsub('\\s+', '', tot_rides$ride_id)

# check for duplicated rows based on ride_id, no duplicates found
sum(duplicated(tot_rides$ride_id))

# remove leading and trailing spaces start_station_name and end_station_name
# to be able to count unique instances
tot_rides$start_station_name <- trimws(tot_rides$start_station_name, 
                                       which = c("both"))

tot_rides$end_station_name <- trimws(tot_rides$end_station_name, 
                                     which = c("both"))

# convert started_at and ended_at columns to date-time format for calculations
tot_rides$started_at <- as.POSIXct(tot_rides$started_at,
                                   format = "%Y-%m-%d %H:%M:%S")
tot_rides$ended_at <- as.POSIXct(tot_rides$ended_at,
                                   format = "%Y-%m-%d %H:%M:%S")

# add column start_time, separate time of day from started_at column
tot_rides$start_time <- format(as.POSIXct(tot_rides$started_at),
                               format = "%H:%M:%S")

# format start_time as hh:mm:ss for calculations
tot_rides$start_time <- hms::as_hms(tot_rides$start_time)

# add column ride_length = ended_at - started_at
tot_rides$ride_length <- (tot_rides$ended_at - tot_rides$started_at)

# format ride_length as hh:mm:ss for calculations
tot_rides$ride_length <- hms::as_hms(tot_rides$ride_length)

# add day_of_week column 
tot_rides$day_of_week <- lubridate::wday(tot_rides$started_at)

# set column order
tot_rides = tot_rides %>% select(ride_id, rideable_type, day_of_week,
                                 start_time, started_at, ended_at, ride_length, 
                                 start_station_name, start_station_id, 
                                 end_station_name, end_station_id, start_lat, 
                                 start_lng, end_lat, end_lng, member_casual)

Process

I chose to work through this case study in R because of the large size of the dataset and its ability to generate plots. This data was made available from its owner, Cyclistic, and was anticipated to be credible. The following steps were taken to clean and verify data integrity as much as possible. Then, descriptive statistics were generated to inspect the data, initially.

## [1] "total records = 5667717"
## [1] "mean ride length = 00:19:27"
## [1] "median ride length = 00:10:17"
## [1] "max ride length = 689:47:15"
## [1] "min ride length = -172:33:21"

These items immediately stood out:

  • The negative minimum ride length
  • The dramatic difference between the max ride length and the mean
  • The mean ride length being nearly double the median

I generated a boxplot of the ride lengths to visualize the data spread and then addressed each item above through the following cleanup process.

Cleanup

Notes:

  • Records with negative ride lengths or values less than one minute were removed from the dataset to reduce potentially low skewing bias attributed to systematic or bike user errors. 121,089 rows were removed (2.14% of data).
  • Records with ride lengths outside the 99.9% quantile of remaining data were also removed from the dataset to reduce potentially high skewing bias, again, attributed to systematic or bike user errors. 5,547 rows were removed (0.98% of the original dataset).

Review of the regenerated boxplot showed the cleaned dataset was now a contiguous group without negative or extremely high ride lengths.

Analysis

Once the data was cleaned, the analysis began with regenerating descriptive statistics.

# repeat descriptive statistics
# count ride_ids
print(paste("total records =", nrow(tot_rides)))
## [1] "total records = 5541081"
# mean ride_length
print(paste("mean ride length =",
            hms::round_hms(hms::as_hms(mean(tot_rides$ride_length)), 1))) 
## [1] "mean ride length = 00:16:34"
# median ride_length
print(paste("median ride length =",
            hms::round_hms(hms::as_hms(median(tot_rides$ride_length)),1)))
## [1] "median ride length = 00:10:30"
# max ride_length
print(paste("max ride length =", 
            hms::round_hms(hms::as_hms(max(tot_rides$ride_length)),1)))
## [1] "max ride length = 22:01:34"
# min ride_length
print(paste("min ride length =", 
            hms::round_hms(hms::as_hms(min(tot_rides$ride_length)),1)))
## [1] "min ride length = 00:01:00"
# mode day_of_week
# define function to calculate mode
find_mode <- function(x) {
  u <- unique(x)
  tab <- tabulate(match(x, u))
  u[tab == max(tab)]
}

# find mode for day_of_week
busy_day <- find_mode(tot_rides$day_of_week)
days_in_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
                  "Friday", "Saturday") 
print(paste("mode = most rides occurred on", days_in_week[busy_day]))
## [1] "mode = most rides occurred on Saturday"
# average ride_length for members
print(paste("avg ride length for members =", hms::round_hms(hms::as_hms(mean(tot_rides[
  tot_rides$member_casual == 'member', 'ride_length'])), 1)))
## [1] "avg ride length for members = 00:12:39"
# average ride length for casual riders
print(paste("avg ride length for casual riders =", hms::round_hms(hms::as_hms(mean(tot_rides[
  tot_rides$member_casual == 'casual', 'ride_length'])), 1)))
## [1] "avg ride length for casual riders = 00:22:13"
# average start time for members
print(paste("avg start time for members =", hms::round_hms(hms::as_hms(mean(hms::as_hms(tot_rides[
  tot_rides$member_casual == 'member', 'started_at']))),1)))
## [1] "avg start time for members = 14:30:02"
# average start time for casual riders
print(paste("avg start time for casual riders =", hms::round_hms(hms::as_hms(mean(hms::as_hms(tot_rides[
  tot_rides$member_casual == 'casual', 'started_at']))),1)))
## [1] "avg start time for casual riders = 15:02:44"

Next, several subsets were generated for plotting various relationships and trends found in the data. They included:

Total User Rides by Day of Week
Weekday Member Rides Casual Rides
1 378,138 380,008
2 463,169 271,495
3 507,669 257,772
4 512,521 268,296
5 520,677 302,422
6 456,817 327,057
7 432,740 462,300
User Ride Times by Day of Week
Weekday Member Rides Casual Rides
1 00:13:59 00:25:26
2 00:12:13 00:22:42
3 00:12:01 00:19:54
4 00:12:03 00:19:11
5 00:12:14 00:19:48
6 00:12:28 00:20:51
7 00:14:05 00:24:54
Avg User Start Times by Day of Week
Weekday Member Rides Casual Rides
1 14:09:08 13:56:57
2 14:35:30 15:06:27
3 14:27:18 15:19:19
4 14:35:34 15:34:43
5 14:35:58 15:35:42
6 14:37:26 15:26:14
7 14:24:10 14:48:37
Total User Rides by Month
Month Member Rides Casual Rides
1 83,740 18,086
2 92,045 20,915
3 190,565 88,106
4 239,820 123,787
5 347,238 274,515
6 391,794 361,022
7 407,940 396,843
8 417,113 350,509
9 395,325 289,843
10 341,266 203,873
11 231,628 98,204
12 133,257 43,647
User Ride Times by Month
Month Member Rides Casual Rides
1 00:11:45 00:17:42
2 00:11:15 00:19:51
3 00:11:53 00:24:28
4 00:11:31 00:23:30
5 00:13:17 00:25:54
6 00:13:55 00:23:46
7 00:13:43 00:23:34
8 00:13:22 00:21:54
9 00:12:54 00:20:26
10 00:11:47 00:18:50
11 00:11:05 00:15:50
12 00:10:35 00:13:39
Total User Rides by Bike
Bike Member Rides Casual Rides
electric_bike 1,588,478 1,219,502
classic_bike 1,683,253 876,014
docked_bike 0 173,834
Top 10 Stations for Members
Station Member Rides
Kingsbury St & Kinzie St 24,506
Clark St & Elm St 21,605
Wells St & Concord Ln 20,950
University Ave & 57th St 19,526
Clinton St & Washington Blvd 19,375
Ellis Ave & 60th St 19,105
Wells St & Elm St 18,669
Loomis St & Lexington St 18,646
Clinton St & Madison St 18,480
Broadway & Barry Ave 17,489
Top 10 Stations for Casual Riders
Station Casual Rides
Streeter Dr & Grand Ave 56,938
DuSable Lake Shore Dr & Monroe St 31,237
Millennium Park 24,979
Michigan Ave & Oak St 24,834
DuSable Lake Shore Dr & North Blvd 23,129
Shedd Aquarium 19,832
Theater on the Lake 18,190
Wells St & Concord Ln 15,969
Dusable Harbor 13,828
Clark St & Armitage Ave 13,590

Share

Finally, I plotted several visualizations to illustrate the relationships and trends found in this dataset that might provide insights into the different behaviors exhibited by members and casual riders.

Total monthly rides by user indicated an increase in rides by both groups during the warmer months of the year. Casual riders nearly matched member numbers during the peak summer months. This may indicate an opportunity for a seasonal membership.

Member ride times were pretty consistent throughout the year. Two things stood out about casual ride times, (1) there was a seasonal increase through the warmer months, and (2) they were consistently higher than member ride times. These factors may further support a seasonal membership opportunity. Alternatively, pointing out the costs associated with the added ride time through a marketing campaign may provide an incentive for casual riders to become members.

This plot illustrates opposing behaviors among members and casual riders. Members’ total daily rides increased during the middle of the week, where casual riders’ peak use was on Saturday and Sunday. This may indicate an opportunity for another segmented membership option targeted at weekend riders.

Daily ride times indicated member rides were relatively consistent throughout the week – possibly indicative that their main use of the service was for commuting purposes. Casual riders tended to ride nearly twice as long on average and even longer on Saturday and Sunday. This may strengthen the case for targeting a weekend membership option.

I think this plot adds further evidence that members mainly utilized the service for commuting, indicated by the spikes near 8AM and 5PM. Casual riders seemed to begin increasing rides around noon and spiked during the evening commute time. This could indicate an opportunity for a time-of-day membership, possibly noon to midnight.

This last plot showed members chose classic bikes very slightly more than electric bikes for their rides. Casual riders appeared to prefer the electric bikes a bit more than classic bikes, though. This data doesn’t seem to indicate much of an opportunity in terms of increasing memberships.

Act

Takeaways

  • More rides occurred during warmer months by both groups, but casual rides increased significantly more, nearly matching member rides during the peak months.
  • Member ride times were consistent throughout the year.
  • Casual riders consistently rode longer than members and tended to ride even longer during warmer months.
  • The number of daily rides for members increased during the middle of the week, where casual riders peak use was on Saturday and Sunday.
  • The length of rides for members was relatively consistent throughout the week, but casual riders clearly rode longer on Saturday and Sunday.
  • Member rides started mainly around 8AM and 5PM, correlating with commuting times. Casual riders saw an increase in start times around noon with a peak around 5PM.
  • Members didn’t appear to have a significant preference between classic or electric bikes, but casual riders did seem to have a slight preference for electric bikes.

Recommendations

The high-level business task for this case study was to design marketing strategies aimed at converting casual riders into annual members. In consideration of the insights provided from the data and discussed previously, I believe a marketing strategy designed to target the following behaviors may lead toward meeting that goal.

  • Increased seasonal use likely outweighs off-peak savings
  • Increased weekend use further detracts from off-peak savings

Alternatively, new, targeted membership options might be more appealing for casual riders than annual memberships. Consider adding membership options such as:

  • Seasonal memberships or seasonal membership rates (May to Oct)
  • Weekend user memberships (Fri/Sat/Sun)
  • Time of day memberships (Noon to Midnight)