Introduction

This is an analytical case study for an imaginary bike-sharing company - Cyclistic, in Chicago, USA. The scenario is as mentioned below -

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers. The director of marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers. She believes there is a very good chance to convert casual riders into members.

With the data provided, I’ll try to answer the below two points -

Data source / Notes:

With that out of the way let’s get started.

First look at the data

For this analysis, the last 12 months’ data was used. Each month’s data was given as a separate file. Primarily I approached the data on a spreadsheet. But the data turned out to be too big, so I moved to R.

Data is separated by month. So let’s collate the data into one data frame and create a backup before we start the analysis.

# Helper func
data_read_and_collate <- function(df, new_df_path)
{
  df_tmp <- read.csv(new_df_path, na.strings = "")
  df <- rbind(df, df_tmp)
  rm(df_tmp)
  
  return(df)
}

#202205
#'data_path_202205' has the path for the 2022-May data
df_202205 <- read.csv(data_path_202205, na.strings = "") 

df_data <-  df_202205 #creating the 1st data frame
rm(df_202205)

#202204
#'data_path_202204' has the path for the 2022-April data
#using the helper to read and append the data to 'df_data'
df_data <- data_read_and_collate(df_data, data_path_202204) 

# Similarly collate the last 12 months' data into 'df_data'.......
#.....................
# Backing up our data
df_data_backup <- df_data
# Let us view our data
skim_without_charts(df_data)
Data summary
Name df_data
Number of rows 5860776
Number of columns 13
_______________________
Column type frequency:
character 9
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5860776 0
rideable_type 0 1.00 11 13 0 3 0
started_at 0 1.00 19 19 0 4896834 0
ended_at 0 1.00 19 19 0 4893478 0
start_station_name 823167 0.86 3 53 0 1105 0
start_station_id 823164 0.86 3 44 0 1063 0
end_station_name 878338 0.85 9 53 0 1112 0
end_station_id 878338 0.85 3 44 0 1068 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5036 1 41.90 0.05 41.39 41.88 41.90 41.93 42.17
end_lng 5036 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.49

We can see that there are 13 columns and over 5.8 million rows. Let’s take a look at the data columns.

Data Descriptions -

Few points that we need to consider. There is no column with ride length, but we can calculate that using start and end time. There are few missing values in “start_station_name” & “start_station_id”, “end_station_name” & “end_station_id” and geographical location values. The geographical location values also seem to be not complete need crosscheck that with Google Maps.

With that in mind let us start with the data processing -

# change start and end date format to DateTime
# add the trip duration
# add weekday/month column to analyze the data by weekday/month

df_data <- df_data %>% 
  mutate(started_at = ymd_hms(started_at), 
         ended_at = ymd_hms(ended_at), 
         trip_duration = difftime(ended_at,started_at,units="mins"),
         weekdays = weekdays.POSIXt(started_at),
         months = months.POSIXt(started_at))
skim_without_charts(df_data)
Data summary
Name df_data
Number of rows 5860776
Number of columns 16
_______________________
Column type frequency:
character 9
difftime 1
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5860776 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 823167 0.86 3 53 0 1105 0
start_station_id 823164 0.86 3 44 0 1063 0
end_station_name 878338 0.85 9 53 0 1112 0
end_station_id 878338 0.85 3 44 0 1068 0
member_casual 0 1.00 6 6 0 2 0
weekdays 0 1.00 6 9 0 7 0
months 0 1.00 3 9 0 12 0

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
trip_duration 0 1 -58.03 mins 55944.15 mins 11.33 mins 24942

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5036 1 41.90 0.05 41.39 41.88 41.90 41.93 42.17
end_lng 5036 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.49

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2021-06-01 00:00:38 2022-05-31 23:59:56 2021-09-23 17:33:23 4896834
ended_at 0 1 2021-06-01 00:06:22 2022-06-02 11:35:01 2021-09-23 17:49:29 4893478

As we can see the trip duration ranges from negative values to very high values ( above 30 days ). Looking more closely we can see that the negative value rows have an ending time that is before the starting time. This can be an issue in the data collection where the starting and ending times got exchanged, but as we can’t prove that, it is best to remove those rows altogether.

Trip duration varies in range a lot. We create bins to better visualize the distributions.

# creating bins for- below 0 min, 0-1 min, 1-5 min, 5 min - 1 hour
# 1-2 hour, 2-5 hour, 5 hour - 1 day, 1-2 day, above 2 day
df_data <- df_data %>% 
  mutate(trip_duration_bin = cut(as.double(trip_duration), 
                                 breaks = c(-Inf, 0, 1, 5, 60, 120, 300, 1440, 2880, Inf)))

ggplot(df_data, aes( x = trip_duration_bin, color = member_casual)) + 
  geom_bar(position="dodge", fill = "white")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

It is quite clear that 5 min - 1 hour trips are the most common ones.

Let us create a few summary tables for analyzing and visualizing the data. But before we do that let us drop the rows with negative trip duration. Trips that are less than 1 min, I am considering them as a mistake from the customers for this analysis. As you can see below the mean and median trip duration for this bin is near 30-40 sec.

df_data %>% filter(trip_duration_bin %in% c("(-Inf,0]", "(0,1]")) %>% 
  group_by(trip_duration_bin, member_casual) %>% 
  summarise(mean = mean(trip_duration), n = n(), 
            median = median(trip_duration),sd = sd(trip_duration), 
            total_travel = sum(trip_duration)) %>% 

We could make the same argument for a trip lasting over 2 days, as the max travel time is around 38 days. But I am keeping those values as their number is very low (700-800 count). Maybe we can explore them in future analyses.

So let’s drop the rows with a trip duration of less than 1 min and start with our analysis phase.

There are null values in station data also, but we will handle those when we are creating summary tables with them.

# dropping rows

df_data <- df_data %>% filter(!trip_duration_bin %in% c("(-Inf,0]","(0,1]"))

Analyzing data

Will analyze the data with respect to weekdays, months, trip duration bins, and also location-wise with the station data.

Yearly Trend

# getting the count of trips per month
df_month_count <- df_data %>% group_by(months, member_casual) %>% summarise(count = n())

This graph is made with Tableau. To see Tableau Viz check the appendix.

Appendix

Weekly Trend

# getting count, average, median, sd values - grouping by weekdays, member_casual
df_summarise2 <- df_data %>% group_by(weekdays, member_casual) %>% 
  summarise(count = n(),
            total_trip_duration = sum(trip_duration),
            mean = mean(trip_duration),
            median = median(trip_duration),
            sd = sd(trip_duration))

This graph is made with Tableau. To see Tableau Viz check the appendix.

Appendix

# getting count, average, median, sd values- grouping by weekdays,trip_duration_bin,member_casual
df_summarise3 <- df_data %>% 
  group_by(weekdays,trip_duration_bin, member_casual) %>% 
  summarise(count = n(),
            total_trip_duration = sum(trip_duration),
            mean = mean(trip_duration),
            median = median(trip_duration),
            sd = sd(trip_duration))

This graph is made with Tableau. To see Tableau Viz check the appendix.

Appendix

Locations

# count per staring station 
df_station_count <- df_data %>% filter(!is.na(start_station_name)) %>% 
  group_by(start_station_name, member_casual) %>% 
  summarise(count = n())

# count per staring station with weekday
df_station_count_wkd <- df_data %>% 
  filter(!is.na(start_station_name)) %>% 
  mutate(weekd = weekdays.POSIXt(started_at)) %>% 
  group_by(start_station_name,weekd, member_casual) %>% 
  summarise(count = n())

# getting the top paths
df_path_wkd <- df_data %>% filter(!is.na(start_station_name) & !is.na(end_station_name)) %>% 
  mutate(path = paste(start_station_name,end_station_name, sep = " to " )) %>% 
  group_by(path, weekdays, member_casual) %>% summarise(count = n())

These data were saved and only the top 20 per member_casual were taken for further analysis. The longitude and latitude in the data do not seem to be perfect, so I extracted those values from Google Maps (for only the top 20 as mentioned above) and then plotted them with Tableau.

This graph is made with Tableau. To see Tableau Viz check the appendix.

Appendix

Summing up

For the location data frames, the data was saved, and using Google Maps longitude and latitude were recorded for only the top 20 stations for members and casual rides.

This graph is made with Tableau. To see Tableau Viz check the appendix.

Appendix

After the analysis, it is clear that the usage is very different between the members and casuals. As casuals are using bikes near tourist stops an argument can be made that, a subset of the riders’ are from out of town mainly tourists. As this bike-sharing company is in Chicago only those customers can never be converted to members. To increase members by converting casual users, the marketing campaign needs to hyper-focus on the subset of casual riders that are using the bikes in a similar manner as the members are, like which casual riders’ are taking the same routes as the members, how often those routes are taken by the casual riders’. Which stations are very popular with the members and target the casual riders’ that are using those same stations for the same purpose.

Note: Further analysis can be done if customer details or demographic data are available. Details on stations/station data can also be very useful.

Appendix