Introduction
This is an analytical case study for an imaginary bike-sharing company - Cyclistic, in Chicago, USA. The scenario is as mentioned below -
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers. The director of marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers. She believes there is a very good chance to convert casual riders into members.
With the data provided, I’ll try to answer the below two points -
- How do annual members and casual riders use Cyclistic bikes differently?
- Why would casual riders buy Cyclistic annual memberships?
Data source / Notes:
- I have used the latest 12 months’ data for this analysis.
- Data can be downloaded from here.
- Note: The datasets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license.
- This is public data that you can also use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect past purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
With that out of the way let’s get started.
First look at the data
For this analysis, the last 12 months’ data was used. Each month’s data was given as a separate file. Primarily I approached the data on a spreadsheet. But the data turned out to be too big, so I moved to R.
Data is separated by month. So let’s collate the data into one data frame and create a backup before we start the analysis.
# Helper func
data_read_and_collate <- function(df, new_df_path)
{
df_tmp <- read.csv(new_df_path, na.strings = "")
df <- rbind(df, df_tmp)
rm(df_tmp)
return(df)
}
#202205
#'data_path_202205' has the path for the 2022-May data
df_202205 <- read.csv(data_path_202205, na.strings = "")
df_data <- df_202205 #creating the 1st data frame
rm(df_202205)
#202204
#'data_path_202204' has the path for the 2022-April data
#using the helper to read and append the data to 'df_data'
df_data <- data_read_and_collate(df_data, data_path_202204)
# Similarly collate the last 12 months' data into 'df_data'.......
#.....................# Backing up our data
df_data_backup <- df_data
# Let us view our data
skim_without_charts(df_data)| Name | df_data |
| Number of rows | 5860776 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5860776 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1.00 | 19 | 19 | 0 | 4896834 | 0 |
| ended_at | 0 | 1.00 | 19 | 19 | 0 | 4893478 | 0 |
| start_station_name | 823167 | 0.86 | 3 | 53 | 0 | 1105 | 0 |
| start_station_id | 823164 | 0.86 | 3 | 44 | 0 | 1063 | 0 |
| end_station_name | 878338 | 0.85 | 9 | 53 | 0 | 1112 | 0 |
| end_station_id | 878338 | 0.85 | 3 | 44 | 0 | 1068 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 5036 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 5036 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
We can see that there are 13 columns and over 5.8 million rows. Let’s take a look at the data columns.
Data Descriptions -
- “ride_id” - Unique ID per row / primary key.
- “rideable_type”- type cycle used for the ride.
- “started_at”- Start time of the ride.
- “ended_at”- End time of the ride.
- “start_station_name” & “start_station_id”- Ride staring station name and its ID.
- “end_station_name” & “end_station_id”- Ride ending station name and its ID.
- “start_lat” & “start_lng”- Starting station geographical location.
- “end_lat” & “end_lng”- Ending station geographical location.
- member_casual- Classifier for the ride being “member” or “casual”.
Few points that we need to consider. There is no column with ride length, but we can calculate that using start and end time. There are few missing values in “start_station_name” & “start_station_id”, “end_station_name” & “end_station_id” and geographical location values. The geographical location values also seem to be not complete need crosscheck that with Google Maps.
With that in mind let us start with the data processing -
# change start and end date format to DateTime
# add the trip duration
# add weekday/month column to analyze the data by weekday/month
df_data <- df_data %>%
mutate(started_at = ymd_hms(started_at),
ended_at = ymd_hms(ended_at),
trip_duration = difftime(ended_at,started_at,units="mins"),
weekdays = weekdays.POSIXt(started_at),
months = months.POSIXt(started_at))skim_without_charts(df_data)| Name | df_data |
| Number of rows | 5860776 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| difftime | 1 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5860776 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 823167 | 0.86 | 3 | 53 | 0 | 1105 | 0 |
| start_station_id | 823164 | 0.86 | 3 | 44 | 0 | 1063 | 0 |
| end_station_name | 878338 | 0.85 | 9 | 53 | 0 | 1112 | 0 |
| end_station_id | 878338 | 0.85 | 3 | 44 | 0 | 1068 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| weekdays | 0 | 1.00 | 6 | 9 | 0 | 7 | 0 |
| months | 0 | 1.00 | 3 | 9 | 0 | 12 | 0 |
Variable type: difftime
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| trip_duration | 0 | 1 | -58.03 mins | 55944.15 mins | 11.33 mins | 24942 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 5036 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 5036 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-06-01 00:00:38 | 2022-05-31 23:59:56 | 2021-09-23 17:33:23 | 4896834 |
| ended_at | 0 | 1 | 2021-06-01 00:06:22 | 2022-06-02 11:35:01 | 2021-09-23 17:49:29 | 4893478 |
As we can see the trip duration ranges from negative values to very high values ( above 30 days ). Looking more closely we can see that the negative value rows have an ending time that is before the starting time. This can be an issue in the data collection where the starting and ending times got exchanged, but as we can’t prove that, it is best to remove those rows altogether.
Trip duration varies in range a lot. We create bins to better visualize the distributions.
# creating bins for- below 0 min, 0-1 min, 1-5 min, 5 min - 1 hour
# 1-2 hour, 2-5 hour, 5 hour - 1 day, 1-2 day, above 2 day
df_data <- df_data %>%
mutate(trip_duration_bin = cut(as.double(trip_duration),
breaks = c(-Inf, 0, 1, 5, 60, 120, 300, 1440, 2880, Inf)))
ggplot(df_data, aes( x = trip_duration_bin, color = member_casual)) +
geom_bar(position="dodge", fill = "white")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))It is quite clear that 5 min - 1 hour trips are the most common ones.
Let us create a few summary tables for analyzing and visualizing the data. But before we do that let us drop the rows with negative trip duration. Trips that are less than 1 min, I am considering them as a mistake from the customers for this analysis. As you can see below the mean and median trip duration for this bin is near 30-40 sec.
df_data %>% filter(trip_duration_bin %in% c("(-Inf,0]", "(0,1]")) %>%
group_by(trip_duration_bin, member_casual) %>%
summarise(mean = mean(trip_duration), n = n(),
median = median(trip_duration),sd = sd(trip_duration),
total_travel = sum(trip_duration)) %>% We could make the same argument for a trip lasting over 2 days, as the max travel time is around 38 days. But I am keeping those values as their number is very low (700-800 count). Maybe we can explore them in future analyses.
So let’s drop the rows with a trip duration of less than 1 min and start with our analysis phase.
There are null values in station data also, but we will handle those when we are creating summary tables with them.
# dropping rows
df_data <- df_data %>% filter(!trip_duration_bin %in% c("(-Inf,0]","(0,1]"))Analyzing data
Will analyze the data with respect to weekdays, months, trip duration bins, and also location-wise with the station data.
Yearly Trend
# getting the count of trips per month
df_month_count <- df_data %>% group_by(months, member_casual) %>% summarise(count = n())This graph is made with Tableau. To see Tableau Viz check the appendix.
Weekly Trend
# getting count, average, median, sd values - grouping by weekdays, member_casual
df_summarise2 <- df_data %>% group_by(weekdays, member_casual) %>%
summarise(count = n(),
total_trip_duration = sum(trip_duration),
mean = mean(trip_duration),
median = median(trip_duration),
sd = sd(trip_duration))This graph is made with Tableau. To see Tableau Viz check the appendix.
# getting count, average, median, sd values- grouping by weekdays,trip_duration_bin,member_casual
df_summarise3 <- df_data %>%
group_by(weekdays,trip_duration_bin, member_casual) %>%
summarise(count = n(),
total_trip_duration = sum(trip_duration),
mean = mean(trip_duration),
median = median(trip_duration),
sd = sd(trip_duration))This graph is made with Tableau. To see Tableau Viz check the appendix.
Locations
# count per staring station
df_station_count <- df_data %>% filter(!is.na(start_station_name)) %>%
group_by(start_station_name, member_casual) %>%
summarise(count = n())
# count per staring station with weekday
df_station_count_wkd <- df_data %>%
filter(!is.na(start_station_name)) %>%
mutate(weekd = weekdays.POSIXt(started_at)) %>%
group_by(start_station_name,weekd, member_casual) %>%
summarise(count = n())
# getting the top paths
df_path_wkd <- df_data %>% filter(!is.na(start_station_name) & !is.na(end_station_name)) %>%
mutate(path = paste(start_station_name,end_station_name, sep = " to " )) %>%
group_by(path, weekdays, member_casual) %>% summarise(count = n())These data were saved and only the top 20 per member_casual were taken for further analysis. The longitude and latitude in the data do not seem to be perfect, so I extracted those values from Google Maps (for only the top 20 as mentioned above) and then plotted them with Tableau.
This graph is made with Tableau. To see Tableau Viz check the appendix.
Summing up
For the location data frames, the data was saved, and using Google Maps longitude and latitude were recorded for only the top 20 stations for members and casual rides.
- The number of trips increases significantly in the summer times - May-October. Casual riders’ usage shoots up from May and again drops down in November. But the member riders’ have consistent usage except for January and February month. The marketing campaign needs to focus on this time frame (May-October) to be more successful with the casual riders.
- Both on weekdays and weekends the Casual rides are mostly near tourist/weekend destinations. This might be an opportunity to boost the total customer base by increasing stations near tourist spots.
- The Casual customers that use the bikes in a similar manner to that of the Members, should be targeted as they would be more inclined to get the annual membership.
This graph is made with Tableau. To see Tableau Viz check the appendix.
After the analysis, it is clear that the usage is very different between the members and casuals. As casuals are using bikes near tourist stops an argument can be made that, a subset of the riders’ are from out of town mainly tourists. As this bike-sharing company is in Chicago only those customers can never be converted to members. To increase members by converting casual users, the marketing campaign needs to hyper-focus on the subset of casual riders that are using the bikes in a similar manner as the members are, like which casual riders’ are taking the same routes as the members, how often those routes are taken by the casual riders’. Which stations are very popular with the members and target the casual riders’ that are using those same stations for the same purpose.
Note: Further analysis can be done if customer details or demographic data are available. Details on stations/station data can also be very useful.
Appendix
- Data can be downloaded from here.
- Note: The data sets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license.
- Tableau Viz