Introduction

This is an analytical case study for an imaginary bike-sharing company - Cyclistic, in Chicago, USA. The scenario is as mentioned below -

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers. The director of marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers. She believes there is a very good chance to convert casual riders into members.

With the data provided, I’ll try to answer the below two points -

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?

Data source / Notes:

I have used the latest 12 months’ data for this analysis.
Data can be downloaded from here.
Note: The datasets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license.
This is public data that you can also use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect past purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

With that out of the way let’s get started.

First look at the data

For this analysis, the last 12 months’ data was used. Each month’s data was given as a separate file. Primarily I approached the data on a spreadsheet. But the data turned out to be too big, so I moved to R.

Data is separated by month. So let’s collate the data into one data frame and create a backup before we start the analysis.

# Helper func
data_read_and_collate <- function(df, new_df_path)
{
  df_tmp <- read.csv(new_df_path, na.strings = "")
  df <- rbind(df, df_tmp)
  rm(df_tmp)
  
  return(df)
}

#202205
#'data_path_202205' has the path for the 2022-May data
df_202205 <- read.csv(data_path_202205, na.strings = "") 

df_data <-  df_202205 #creating the 1st data frame
rm(df_202205)

#202204
#'data_path_202204' has the path for the 2022-April data
#using the helper to read and append the data to 'df_data'
df_data <- data_read_and_collate(df_data, data_path_202204) 

# Similarly collate the last 12 months' data into 'df_data'.......
#.....................

# Backing up our data
df_data_backup <- df_data
# Let us view our data
skim_without_charts(df_data)

Data summary
Name	df_data
Number of rows	5860776
Number of columns	13
_______________________
Column type frequency:
character	9
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5860776
rideable_type	0	1.00	11	13	3
started_at	0	1.00	19	19	4896834
ended_at	0	1.00	19	19	4893478
start_station_name	823167	0.86	3	53	1105
start_station_id	823164	0.86	3	44	1063
end_station_name	878338	0.85	9	53	1112
end_station_id	878338	0.85	3	44	1068
member_casual	0	1.00	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	5036	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17
end_lng	5036	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49

We can see that there are 13 columns and over 5.8 million rows. Let’s take a look at the data columns.

Data Descriptions -

“ride_id” - Unique ID per row / primary key.
“rideable_type”- type cycle used for the ride.
“started_at”- Start time of the ride.
“ended_at”- End time of the ride.
“start_station_name” & “start_station_id”- Ride staring station name and its ID.
“end_station_name” & “end_station_id”- Ride ending station name and its ID.
“start_lat” & “start_lng”- Starting station geographical location.
“end_lat” & “end_lng”- Ending station geographical location.
member_casual- Classifier for the ride being “member” or “casual”.

Few points that we need to consider. There is no column with ride length, but we can calculate that using start and end time. There are few missing values in “start_station_name” & “start_station_id”, “end_station_name” & “end_station_id” and geographical location values. The geographical location values also seem to be not complete need crosscheck that with Google Maps.

With that in mind let us start with the data processing -

# change start and end date format to DateTime
# add the trip duration
# add weekday/month column to analyze the data by weekday/month

df_data <- df_data %>% 
  mutate(started_at = ymd_hms(started_at), 
         ended_at = ymd_hms(ended_at), 
         trip_duration = difftime(ended_at,started_at,units="mins"),
         weekdays = weekdays.POSIXt(started_at),
         months = months.POSIXt(started_at))

skim_without_charts(df_data)

Data summary
Name	df_data
Number of rows	5860776
Number of columns	16
_______________________
Column type frequency:
character	9
difftime	1
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5860776
rideable_type	0	1.00	11	13	3
start_station_name	823167	0.86	3	53	1105
start_station_id	823164	0.86	3	44	1063
end_station_name	878338	0.85	9	53	1112
end_station_id	878338	0.85	3	44	1068
member_casual	0	1.00	6	6	2
weekdays	0	1.00	6	9	7
months	0	1.00	3	9	12

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
trip_duration	0	1	-58.03 mins	55944.15 mins	11.33 mins	24942

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	5036	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17
end_lng	5036	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2021-06-01 00:00:38	2022-05-31 23:59:56	2021-09-23 17:33:23	4896834
ended_at	0	1	2021-06-01 00:06:22	2022-06-02 11:35:01	2021-09-23 17:49:29	4893478

As we can see the trip duration ranges from negative values to very high values ( above 30 days ). Looking more closely we can see that the negative value rows have an ending time that is before the starting time. This can be an issue in the data collection where the starting and ending times got exchanged, but as we can’t prove that, it is best to remove those rows altogether.

Trip duration varies in range a lot. We create bins to better visualize the distributions.

# creating bins for- below 0 min, 0-1 min, 1-5 min, 5 min - 1 hour
# 1-2 hour, 2-5 hour, 5 hour - 1 day, 1-2 day, above 2 day
df_data <- df_data %>% 
  mutate(trip_duration_bin = cut(as.double(trip_duration), 
                                 breaks = c(-Inf, 0, 1, 5, 60, 120, 300, 1440, 2880, Inf)))

ggplot(df_data, aes( x = trip_duration_bin, color = member_casual)) + 
  geom_bar(position="dodge", fill = "white")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

It is quite clear that 5 min - 1 hour trips are the most common ones.

Let us create a few summary tables for analyzing and visualizing the data. But before we do that let us drop the rows with negative trip duration. Trips that are less than 1 min, I am considering them as a mistake from the customers for this analysis. As you can see below the mean and median trip duration for this bin is near 30-40 sec.

df_data %>% filter(trip_duration_bin %in% c("(-Inf,0]", "(0,1]")) %>% 
  group_by(trip_duration_bin, member_casual) %>% 
  summarise(mean = mean(trip_duration), n = n(), 
            median = median(trip_duration),sd = sd(trip_duration), 
            total_travel = sum(trip_duration)) %>%

We could make the same argument for a trip lasting over 2 days, as the max travel time is around 38 days. But I am keeping those values as their number is very low (700-800 count). Maybe we can explore them in future analyses.

So let’s drop the rows with a trip duration of less than 1 min and start with our analysis phase.

There are null values in station data also, but we will handle those when we are creating summary tables with them.

# dropping rows

df_data <- df_data %>% filter(!trip_duration_bin %in% c("(-Inf,0]","(0,1]"))

Analyzing data

Will analyze the data with respect to weekdays, months, trip duration bins, and also location-wise with the station data.

Yearly Trend

# getting the count of trips per month
df_month_count <- df_data %>% group_by(months, member_casual) %>% summarise(count = n())