You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. Oneapproach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
This fictional case study was processed through the six steps outlined in the Google Data Analytics curriculum:
This report includes code chunks and outputs where the anticipated impact is intended to illustrate the process being carried out or its result.
The high-level business task for this case study was to design marketing strategies aimed at converting casual riders into annual members. My specific assignment was to determine how annual members and casual riders use Cyclistic bikes differently.
I began by importing data sets for each month of 2022. Data was made available under the use license found here, from the download site, https://divvy-tripdata.s3.amazonaws.com/index.html. I imported the twelve monthly *.csv files and consolidated them into a single data set for the year 2022. I prepared the data for cleaning and analysis through the following process.
# combine data into one data frame
tot_rides <- rbind(jan_rides, feb_rides, mar_rides, apr_rides, may_rides,
jun_rides, jul_rides, aug_rides, sep_rides, oct_rides,
nov_rides, dec_rides)
# cleanup workspace
# remove monthly data frames to ease storage and navigation, raw files remain unaltered
rm(jan_rides, feb_rides, mar_rides, apr_rides, may_rides, jun_rides,
jul_rides, aug_rides, sep_rides, oct_rides, nov_rides, dec_rides)
# remove all white space in ride_id to check for duplicates
tot_rides$ride_id <- gsub('\\s+', '', tot_rides$ride_id)
# check for duplicated rows based on ride_id, no duplicates found
sum(duplicated(tot_rides$ride_id))
# remove leading and trailing spaces start_station_name and end_station_name
# to be able to count unique instances
tot_rides$start_station_name <- trimws(tot_rides$start_station_name,
which = c("both"))
tot_rides$end_station_name <- trimws(tot_rides$end_station_name,
which = c("both"))
# convert started_at and ended_at columns to date-time format for calculations
tot_rides$started_at <- as.POSIXct(tot_rides$started_at,
format = "%Y-%m-%d %H:%M:%S")
tot_rides$ended_at <- as.POSIXct(tot_rides$ended_at,
format = "%Y-%m-%d %H:%M:%S")
# add column start_time, separate time of day from started_at column
tot_rides$start_time <- format(as.POSIXct(tot_rides$started_at),
format = "%H:%M:%S")
# format start_time as hh:mm:ss for calculations
tot_rides$start_time <- hms::as_hms(tot_rides$start_time)
# add column ride_length = ended_at - started_at
tot_rides$ride_length <- (tot_rides$ended_at - tot_rides$started_at)
# format ride_length as hh:mm:ss for calculations
tot_rides$ride_length <- hms::as_hms(tot_rides$ride_length)
# add day_of_week column
tot_rides$day_of_week <- lubridate::wday(tot_rides$started_at)
# set column order
tot_rides = tot_rides %>% select(ride_id, rideable_type, day_of_week,
start_time, started_at, ended_at, ride_length,
start_station_name, start_station_id,
end_station_name, end_station_id, start_lat,
start_lng, end_lat, end_lng, member_casual)
I chose to work through this case study in R because of the large size of the dataset and its ability to generate plots. This data was made available from its owner, Cyclistic, and was anticipated to be credible. The following steps were taken to clean and verify data integrity as much as possible. Then, descriptive statistics were generated to inspect the data, initially.
## [1] "total records = 5667717"
## [1] "mean ride length = 00:19:27"
## [1] "median ride length = 00:10:17"
## [1] "max ride length = 689:47:15"
## [1] "min ride length = -172:33:21"
These items immediately stood out:
I generated a boxplot of the ride lengths to visualize the data spread and then addressed each item above through the following cleanup process.
Notes:
Review of the regenerated boxplot showed the cleaned dataset was now a contiguous group without negative or extremely high ride lengths.
Once the data was cleaned, the analysis began with regenerating descriptive statistics.
# repeat descriptive statistics
# count ride_ids
print(paste("total records =", nrow(tot_rides)))
## [1] "total records = 5541081"
# mean ride_length
print(paste("mean ride length =",
hms::round_hms(hms::as_hms(mean(tot_rides$ride_length)), 1)))
## [1] "mean ride length = 00:16:34"
# median ride_length
print(paste("median ride length =",
hms::round_hms(hms::as_hms(median(tot_rides$ride_length)),1)))
## [1] "median ride length = 00:10:30"
# max ride_length
print(paste("max ride length =",
hms::round_hms(hms::as_hms(max(tot_rides$ride_length)),1)))
## [1] "max ride length = 22:01:34"
# min ride_length
print(paste("min ride length =",
hms::round_hms(hms::as_hms(min(tot_rides$ride_length)),1)))
## [1] "min ride length = 00:01:00"
# mode day_of_week
# define function to calculate mode
find_mode <- function(x) {
u <- unique(x)
tab <- tabulate(match(x, u))
u[tab == max(tab)]
}
# find mode for day_of_week
busy_day <- find_mode(tot_rides$day_of_week)
days_in_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday")
print(paste("mode = most rides occurred on", days_in_week[busy_day]))
## [1] "mode = most rides occurred on Saturday"
# average ride_length for members
print(paste("avg ride length for members =", hms::round_hms(hms::as_hms(mean(tot_rides[
tot_rides$member_casual == 'member', 'ride_length'])), 1)))
## [1] "avg ride length for members = 00:12:39"
# average ride length for casual riders
print(paste("avg ride length for casual riders =", hms::round_hms(hms::as_hms(mean(tot_rides[
tot_rides$member_casual == 'casual', 'ride_length'])), 1)))
## [1] "avg ride length for casual riders = 00:22:13"
# average start time for members
print(paste("avg start time for members =", hms::round_hms(hms::as_hms(mean(hms::as_hms(tot_rides[
tot_rides$member_casual == 'member', 'started_at']))),1)))
## [1] "avg start time for members = 14:30:02"
# average start time for casual riders
print(paste("avg start time for casual riders =", hms::round_hms(hms::as_hms(mean(hms::as_hms(tot_rides[
tot_rides$member_casual == 'casual', 'started_at']))),1)))
## [1] "avg start time for casual riders = 15:02:44"
Next, several subsets were generated for plotting various relationships and trends found in the data. They included:
| Weekday | Member Rides | Casual Rides |
|---|---|---|
| 1 | 378,138 | 380,008 |
| 2 | 463,169 | 271,495 |
| 3 | 507,669 | 257,772 |
| 4 | 512,521 | 268,296 |
| 5 | 520,677 | 302,422 |
| 6 | 456,817 | 327,057 |
| 7 | 432,740 | 462,300 |
| Weekday | Member Rides | Casual Rides |
|---|---|---|
| 1 | 00:13:59 | 00:25:26 |
| 2 | 00:12:13 | 00:22:42 |
| 3 | 00:12:01 | 00:19:54 |
| 4 | 00:12:03 | 00:19:11 |
| 5 | 00:12:14 | 00:19:48 |
| 6 | 00:12:28 | 00:20:51 |
| 7 | 00:14:05 | 00:24:54 |
| Weekday | Member Rides | Casual Rides |
|---|---|---|
| 1 | 14:09:08 | 13:56:57 |
| 2 | 14:35:30 | 15:06:27 |
| 3 | 14:27:18 | 15:19:19 |
| 4 | 14:35:34 | 15:34:43 |
| 5 | 14:35:58 | 15:35:42 |
| 6 | 14:37:26 | 15:26:14 |
| 7 | 14:24:10 | 14:48:37 |
| Month | Member Rides | Casual Rides |
|---|---|---|
| 1 | 83,740 | 18,086 |
| 2 | 92,045 | 20,915 |
| 3 | 190,565 | 88,106 |
| 4 | 239,820 | 123,787 |
| 5 | 347,238 | 274,515 |
| 6 | 391,794 | 361,022 |
| 7 | 407,940 | 396,843 |
| 8 | 417,113 | 350,509 |
| 9 | 395,325 | 289,843 |
| 10 | 341,266 | 203,873 |
| 11 | 231,628 | 98,204 |
| 12 | 133,257 | 43,647 |
| Month | Member Rides | Casual Rides |
|---|---|---|
| 1 | 00:11:45 | 00:17:42 |
| 2 | 00:11:15 | 00:19:51 |
| 3 | 00:11:53 | 00:24:28 |
| 4 | 00:11:31 | 00:23:30 |
| 5 | 00:13:17 | 00:25:54 |
| 6 | 00:13:55 | 00:23:46 |
| 7 | 00:13:43 | 00:23:34 |
| 8 | 00:13:22 | 00:21:54 |
| 9 | 00:12:54 | 00:20:26 |
| 10 | 00:11:47 | 00:18:50 |
| 11 | 00:11:05 | 00:15:50 |
| 12 | 00:10:35 | 00:13:39 |
| Bike | Member Rides | Casual Rides |
|---|---|---|
| electric_bike | 1,588,478 | 1,219,502 |
| classic_bike | 1,683,253 | 876,014 |
| docked_bike | 0 | 173,834 |
| Station | Member Rides |
|---|---|
| Kingsbury St & Kinzie St | 24,506 |
| Clark St & Elm St | 21,605 |
| Wells St & Concord Ln | 20,950 |
| University Ave & 57th St | 19,526 |
| Clinton St & Washington Blvd | 19,375 |
| Ellis Ave & 60th St | 19,105 |
| Wells St & Elm St | 18,669 |
| Loomis St & Lexington St | 18,646 |
| Clinton St & Madison St | 18,480 |
| Broadway & Barry Ave | 17,489 |
| Station | Casual Rides |
|---|---|
| Streeter Dr & Grand Ave | 56,938 |
| DuSable Lake Shore Dr & Monroe St | 31,237 |
| Millennium Park | 24,979 |
| Michigan Ave & Oak St | 24,834 |
| DuSable Lake Shore Dr & North Blvd | 23,129 |
| Shedd Aquarium | 19,832 |
| Theater on the Lake | 18,190 |
| Wells St & Concord Ln | 15,969 |
| Dusable Harbor | 13,828 |
| Clark St & Armitage Ave | 13,590 |
The high-level business task for this case study was to design marketing strategies aimed at converting casual riders into annual members. In consideration of the insights provided from the data and discussed previously, I believe a marketing strategy designed to target the following behaviors may lead toward meeting that goal.
Alternatively, new, targeted membership options might be more appealing for casual riders than annual memberships. Consider adding membership options such as: