In this case study, I play the role of a junior data analyst working within the marketing analytics team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.
One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders.
Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the marketing director believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, the company believes there is a very good chance to convert casual riders into members. They noted that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
To design marketing strategies aimed at converting casual riders into annual members. In order to accomplish this, the marketing team needs to better understand how annual members and casual riders differ. The marketing director and their team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
To be able to convert casual riders into members, the analyst must answer the following questions:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
All data retrieved was from Cyclistic’s company database from February 2022 to March 2023. The link can be found here: https://divvy-tripdata.s3.amazonaws.com/index.html. Each month of data was downloaded individually. All data collected was current and deemed reliable for analysis. Cyclistic owns their data, so there is no issue with credibility or data integrity.
Packages for this case study include: ‘tidyverse’, ‘lubridate’, ‘ggplot2’, ‘dplyr’, ‘janitor’, ‘readr’, ‘data.table’, and ‘scales’
Once in R, I set a working directory, then applied the ‘list.files’ function to list out each file in a specific order. This was followed by ‘lapply’ to take in the list of data for input to output the list of .csv files, then ‘rbindlist’ to ultimately input each month of data into a single data frame for processing.
Call list of all column names in data frame (13 total)
colnames(bike_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Renamed columns in data frame for cleaner look (showing just source code to avoid clutter)
bike_trips %>%
dplyr::rename(trip_id = ride_id, bikeid = rideable_type, start_time = started_at, end_time = ended_at, from_station_name = start_station_name, from_station_id = start_station_id, to_station_name = end_station_name, to_station_id = end_station_id, usertype = member_casual)
## trip_id bikeid start_time end_time
## 1: 47EC0A7F82E65D52 classic_bike 2022-03-21 13:45:01 2022-03-21 13:51:18
## 2: 8494861979B0F477 electric_bike 2022-03-16 09:37:16 2022-03-16 09:43:34
## 3: EFE527AF80B66109 classic_bike 2022-03-23 19:52:02 2022-03-23 19:54:48
## 4: 9F446FD9DEE3F389 classic_bike 2022-03-01 19:12:26 2022-03-01 19:22:14
## 5: 431128AD9AFFEDC0 classic_bike 2022-03-21 18:37:01 2022-03-21 19:19:11
## ---
## 5829080: C04510F8EBB5EE8A classic_bike 2023-02-08 21:57:22 2023-02-08 22:08:06
## 5829081: 187BA364FB265C80 electric_bike 2023-02-19 11:29:09 2023-02-19 11:39:11
## 5829082: 46B54F6B417D1B27 electric_bike 2023-02-07 09:01:33 2023-02-07 09:16:53
## 5829083: 335B3CAD59F6C016 electric_bike 2023-02-22 08:33:22 2023-02-22 08:50:11
## 5829084: 03D59518BB151EFA classic_bike 2023-02-01 21:52:17 2023-02-01 22:04:17
## from_station_name from_station_id
## 1: Wabash Ave & Wacker Pl TA1307000131
## 2: Michigan Ave & Oak St 13042
## 3: Broadway & Berwyn Ave 13109
## 4: Wabash Ave & Wacker Pl TA1307000131
## 5: DuSable Lake Shore Dr & North Blvd LF-005
## ---
## 5829080: Clark St & Wrightwood Ave TA1305000014
## 5829081: Ogden Ave & Roosevelt Rd KA1504000101
## 5829082: Clark St & Wrightwood Ave TA1305000014
## 5829083: Clark St & Wrightwood Ave TA1305000014
## 5829084: Clark St & Winnemac Ave TA1309000035
## to_station_name to_station_id start_lat start_lng
## 1: Kingsbury St & Kinzie St KA1503000043 41.88688 -87.62603
## 2: Orleans St & Chestnut St (NEXT Apts) 620 41.90100 -87.62375
## 3: Broadway & Ridge Ave 15578 41.97835 -87.65975
## 4: Franklin St & Jackson Blvd TA1305000025 41.88688 -87.62603
## 5: Loomis St & Jackson Blvd 13206 41.91172 -87.62680
## ---
## 5829080: Sheffield Ave & Waveland Ave TA1307000126 41.92955 -87.64312
## 5829081: Delano Ct & Roosevelt Rd KA1706005007 41.86650 -87.68448
## 5829082: Canal St & Madison St 13341 41.92953 -87.64325
## 5829083: Canal St & Madison St 13341 41.92961 -87.64312
## 5829084: Sheffield Ave & Waveland Ave TA1307000126 41.97335 -87.66786
## end_lat end_lng usertype
## 1: 41.88918 -87.63851 member
## 2: 41.89820 -87.63754 member
## 3: 41.98404 -87.66027 member
## 4: 41.87771 -87.63532 member
## 5: 41.87794 -87.66201 member
## ---
## 5829080: 41.94940 -87.65453 member
## 5829081: 41.86749 -87.63219 member
## 5829082: 41.88241 -87.63977 casual
## 5829083: 41.88241 -87.63977 casual
## 5829084: 41.94940 -87.65453 member
Call list of all rows in data frame
nrow(bike_trips)
## [1] 5829084
Call list of all dimensions in data frame
dim(bike_trips)
## [1] 5829084 13
View first 6 rows of data frame
head(bike_trips)
## ride_id rideable_type started_at ended_at
## 1: 47EC0A7F82E65D52 classic_bike 2022-03-21 13:45:01 2022-03-21 13:51:18
## 2: 8494861979B0F477 electric_bike 2022-03-16 09:37:16 2022-03-16 09:43:34
## 3: EFE527AF80B66109 classic_bike 2022-03-23 19:52:02 2022-03-23 19:54:48
## 4: 9F446FD9DEE3F389 classic_bike 2022-03-01 19:12:26 2022-03-01 19:22:14
## 5: 431128AD9AFFEDC0 classic_bike 2022-03-21 18:37:01 2022-03-21 19:19:11
## 6: 9AA8A13AF7A85325 classic_bike 2022-03-07 17:10:22 2022-03-07 17:15:04
## start_station_name start_station_id
## 1: Wabash Ave & Wacker Pl TA1307000131
## 2: Michigan Ave & Oak St 13042
## 3: Broadway & Berwyn Ave 13109
## 4: Wabash Ave & Wacker Pl TA1307000131
## 5: DuSable Lake Shore Dr & North Blvd LF-005
## 6: Bissell St & Armitage Ave 13059
## end_station_name end_station_id start_lat start_lng
## 1: Kingsbury St & Kinzie St KA1503000043 41.88688 -87.62603
## 2: Orleans St & Chestnut St (NEXT Apts) 620 41.90100 -87.62375
## 3: Broadway & Ridge Ave 15578 41.97835 -87.65975
## 4: Franklin St & Jackson Blvd TA1305000025 41.88688 -87.62603
## 5: Loomis St & Jackson Blvd 13206 41.91172 -87.62680
## 6: Southport Ave & Clybourn Ave TA1309000030 41.91802 -87.65218
## end_lat end_lng member_casual
## 1: 41.88918 -87.63851 member
## 2: 41.89820 -87.63754 member
## 3: 41.98404 -87.66027 member
## 4: 41.87771 -87.63532 member
## 5: 41.87794 -87.66201 member
## 6: 41.92077 -87.66371 member
View last 6 rows of data frame
tail(bike_trips)
## ride_id rideable_type started_at ended_at
## 1: B60EA061E2123F62 electric_bike 2023-02-04 17:52:34 2023-02-04 17:59:57
## 2: C04510F8EBB5EE8A classic_bike 2023-02-08 21:57:22 2023-02-08 22:08:06
## 3: 187BA364FB265C80 electric_bike 2023-02-19 11:29:09 2023-02-19 11:39:11
## 4: 46B54F6B417D1B27 electric_bike 2023-02-07 09:01:33 2023-02-07 09:16:53
## 5: 335B3CAD59F6C016 electric_bike 2023-02-22 08:33:22 2023-02-22 08:50:11
## 6: 03D59518BB151EFA classic_bike 2023-02-01 21:52:17 2023-02-01 22:04:17
## start_station_name start_station_id end_station_name
## 1: Clark St & Wrightwood Ave TA1305000014 Sheffield Ave & Waveland Ave
## 2: Clark St & Wrightwood Ave TA1305000014 Sheffield Ave & Waveland Ave
## 3: Ogden Ave & Roosevelt Rd KA1504000101 Delano Ct & Roosevelt Rd
## 4: Clark St & Wrightwood Ave TA1305000014 Canal St & Madison St
## 5: Clark St & Wrightwood Ave TA1305000014 Canal St & Madison St
## 6: Clark St & Winnemac Ave TA1309000035 Sheffield Ave & Waveland Ave
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1: TA1307000126 41.92960 -87.64315 41.94940 -87.65453 member
## 2: TA1307000126 41.92955 -87.64312 41.94940 -87.65453 member
## 3: KA1706005007 41.86650 -87.68448 41.86749 -87.63219 member
## 4: 13341 41.92953 -87.64325 41.88241 -87.63977 casual
## 5: 13341 41.92961 -87.64312 41.88241 -87.63977 casual
## 6: TA1307000126 41.97335 -87.66786 41.94940 -87.65453 member
View statistical summary of data frame
summary(bike_trips)
## ride_id rideable_type started_at
## Length:5829084 Length:5829084 Min. :2022-03-01 00:00:19.00
## Class :character Class :character 1st Qu.:2022-06-08 07:55:40.00
## Mode :character Mode :character Median :2022-08-02 12:43:52.50
## Mean :2022-08-08 08:42:54.79
## 3rd Qu.:2022-10-01 06:46:10.75
## Max. :2023-02-28 23:59:31.00
##
## ended_at start_station_name start_station_id
## Min. :2022-03-01 00:04:30.00 Length:5829084 Length:5829084
## 1st Qu.:2022-06-08 08:10:03.00 Class :character Class :character
## Median :2022-08-02 13:02:33.50 Mode :character Mode :character
## Mean :2022-08-08 09:02:07.83
## 3rd Qu.:2022-10-01 07:11:25.75
## Max. :2023-03-06 15:09:53.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5829084 Length:5829084 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.14 Length:5829084
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5938 NA's :5938
Adding columns that list date, month, day, and year of each ride This allows for aggregating ride data for each month, day, or year Utilized link: (https://www.statmethods.net/input/dates.html for date format outputs)
bike_trips$date <- as.Date(bike_trips$started_at) # default format is yyyy-mm-dd
bike_trips$month <- format(as.Date(bike_trips$date), "%m")
bike_trips$day <- format(as.Date(bike_trips$date), "%d")
bike_trips$year <- format(as.Date(bike_trips$date), "%Y")
bike_trips$day_of_week <- format(as.Date(bike_trips$date), "%A")
View structure of all columns
str(bike_trips)
## Classes 'data.table' and 'data.frame': 5829084 obs. of 18 variables:
## $ ride_id : chr "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
## $ rideable_type : chr "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct, format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" ...
## $ ended_at : POSIXct, format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" ...
## $ start_station_name: chr "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
## $ start_station_id : chr "TA1307000131" "13042" "13109" "TA1307000131" ...
## $ end_station_name : chr "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
## $ end_station_id : chr "KA1503000043" "620" "15578" "TA1305000025" ...
## $ start_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
## $ date : Date, format: "2022-03-21" "2022-03-16" ...
## $ month : chr "03" "03" "03" "03" ...
## $ day : chr "21" "16" "23" "01" ...
## $ year : chr "2022" "2022" "2022" "2022" ...
## $ day_of_week : chr "Monday" "Wednesday" "Wednesday" "Tuesday" ...
## - attr(*, ".internal.selfref")=<externalptr>
Converting the “started at” and “ended at” times to Date & Time using the lubridate package
bike_trips$started_at <- lubridate::ymd_hms(bike_trips$started_at)
bike_trips$ended_at <- lubridate::ymd_hms(bike_trips$ended_at)
Converted the time to hours, mins, secs
bike_trips$start_time <- lubridate::hms(bike_trips$started_at)
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed to
## parse, or all strings are NAs
bike_trips$end_time <- lubridate::hms(bike_trips$ended_at)
## Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed to
## parse, or all strings are NAs
Want to know how many hours were ridden per individual
bike_trips$start_hour <- lubridate::hour(bike_trips$started_at)
bike_trips$end_hour <- lubridate::hour(bike_trips$ended_at)
Removing columns that were discontinued but never removed from the data set
bike_trips <- bike_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
bike_trips <- bike_trips %>%
select(-c(start_time, end_time))
Adding a “ride_length” calculation for bike_trips Utilized the time intervals of difftime in base R help documentation for reference
bike_trips$ride_length <-difftime(bike_trips$ended_at, bike_trips$started_at)
Converted ride_length time to mins
bike_trips$ride_length <-difftime(bike_trips$ended_at, bike_trips$started_at, units = "mins")
Converted “ride_length” from Factor to numeric to run calculations on the data using the is.factor() function
is.factor(bike_trips$ride_length)
## [1] FALSE
bike_trips$ride_length <- as.numeric(as.character(bike_trips$ride_length))
is.numeric(bike_trips$ride_length)
## [1] TRUE
Removed all bad data (data frame includes entries where the quality of bikes are called into question as well as zero ride time accounted for)…New data frame created since data is being removed
bike_trips_V2 <- bike_trips[!(bike_trips$start_station_name=="HQ QR" | bike_trips$ride_length<0)]
Descriptive analysis ran for ride_length (all figures in mins)
mean(bike_trips_V2$ride_length) # straight average (total ride length / rides)
## [1] 19.22003
median(bike_trips_V2$ride_length) #mid number in ascending array of data frame
## [1] 10.15
max(bike_trips_V2$ride_length) #maximum ride length
## [1] 41387.25
min(bike_trips_V2$ride_length) #minimum ride length
## [1] 0
Summarizing descriptive analysis above for ride_length
summary(bike_trips_V2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.73 10.15 19.22 18.23 41387.25
Comparing the user type (member vs casual in the data frame)
aggregate(bike_trips_V2$ride_length ~ bike_trips_V2$member_casual, FUN = mean)
## bike_trips_V2$member_casual bike_trips_V2$ride_length
## 1 casual 28.94447
## 2 member 12.58045
aggregate(bike_trips_V2$ride_length ~ bike_trips_V2$member_casual, FUN = median)
## bike_trips_V2$member_casual bike_trips_V2$ride_length
## 1 casual 12.833333
## 2 member 8.733333
aggregate(bike_trips_V2$ride_length ~ bike_trips_V2$member_casual, FUN = max)
## bike_trips_V2$member_casual bike_trips_V2$ride_length
## 1 casual 41387.25
## 2 member 1559.90
aggregate(bike_trips_V2$ride_length ~ bike_trips_V2$member_casual, FUN = min)
## bike_trips_V2$member_casual bike_trips_V2$ride_length
## 1 casual 0
## 2 member 0
Seeking average ride time by each day for members vs casual bike users
aggregate(bike_trips_V2$ride_length ~ bike_trips_V2$member_casual + bike_trips_V2$day_of_week, FUN = mean)
## bike_trips_V2$member_casual bike_trips_V2$day_of_week
## 1 casual Friday
## 2 member Friday
## 3 casual Monday
## 4 member Monday
## 5 casual Saturday
## 6 member Saturday
## 7 casual Sunday
## 8 member Sunday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Tuesday
## 12 member Tuesday
## 13 casual Wednesday
## 14 member Wednesday
## bike_trips_V2$ride_length
## 1 27.88992
## 2 12.39832
## 3 28.90804
## 4 12.15400
## 5 32.56783
## 6 14.04522
## 7 33.91457
## 8 13.93516
## 9 25.28131
## 10 12.15608
## 11 25.53509
## 12 11.93440
## 13 24.40719
## 14 11.97117
Days of the week in data frame need to be in order
bike_trips_V2$day_of_week <- ordered(bike_trips_V2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
Finding the average ride time by each day for members vs casual bike users
aggregate(bike_trips_V2$ride_length ~ bike_trips_V2$member_casual + bike_trips_V2$day_of_week, FUN = mean)
## bike_trips_V2$member_casual bike_trips_V2$day_of_week
## 1 casual Sunday
## 2 member Sunday
## 3 casual Monday
## 4 member Monday
## 5 casual Tuesday
## 6 member Tuesday
## 7 casual Wednesday
## 8 member Wednesday
## 9 casual Thursday
## 10 member Thursday
## 11 casual Friday
## 12 member Friday
## 13 casual Saturday
## 14 member Saturday
## bike_trips_V2$ride_length
## 1 33.91457
## 2 13.93516
## 3 28.90804
## 4 12.15400
## 5 25.53509
## 6 11.93440
## 7 24.40719
## 8 11.97117
## 9 25.28131
## 10 12.15608
## 11 27.88992
## 12 12.39832
## 13 32.56783
## 14 14.04522
Analyzing ride data by type and weekday
bike_trips_V2 %>%
mutate(weekday = lubridate::wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by usertype and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) # sorts
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 398622 33.9
## 2 casual Mon 283323 28.9
## 3 casual Tue 272307 25.5
## 4 casual Wed 279897 24.4
## 5 casual Thu 313639 25.3
## 6 casual Fri 338803 27.9
## 7 casual Sat 478474 32.6
## 8 member Sun 402854 13.9
## 9 member Mon 488626 12.2
## 10 member Tue 546612 11.9
## 11 member Wed 542431 12.0
## 12 member Thu 547394 12.2
## 13 member Fri 481191 12.4
## 14 member Sat 454810 14.0
Utilized ggplot to create two data visuals: Column charts for ride comparison by user type and average ride duration
bike_trips_V2 %>%
mutate(weekday = lubridate::wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),
.groups = "drop_last",
average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
# Applied the scale_y_cont.() to remove sci. notation on y-axis
scale_y_continuous(label=comma) +
labs(title = "Day of Week Ride Count by User Type",x="Weekday",y="Number of Rides" ) +
# centered the text title
theme(plot.title=element_text(hjust=0.5)) +
# changing name of legend using labs(fill)
labs(fill='User Type') +
geom_col(position = "dodge") +
scale_fill_manual(values = c("casual" = "#52b64a",
"member" = "#e79b5e"))
bike_trips_V2 %>%
mutate(weekday = lubridate::wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),
.groups = "drop_last"
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
labs(title = "Average Ride Duration Per Day (in Mins)",x="Weekday",y="Average Ride Duration") +
# need to center the text tile
theme(plot.title=element_text(hjust=0.5)) +
# changing name of legend using labs(fill)
labs(fill='User Type') +
geom_col(position = "dodge")+
scale_fill_manual(values = c("casual" = "#52b64a",
"member" = "#e79b5e"))
The following analysis indicates that casual riders use Cyclistic bikes less frequently minus Saturday, but for a longer period of time than it’s members do. As a result of the ride data, target marketing can be generated to those who ride especially on the weekends since this is when both ride count and ride duration are at their peak.
Casual riders would potentially buy Cyclistic annual memberships to have a broad selection of bikes to choose from when deciding to ride. If there was an incentive attached to the new member conversion process, memberships could spike from the subset of casual riders.
Cyclistic can use digital media to sway potential cycle enthusiasts with a strategy centered around convenience and choice when it comes to their daily biking options.