Cyclistic, a fictional bike-share company out of Chicago, plans to update its marketing strategy, previously built on generating general awareness and appeal to broad consumer markets. The director of marketing, Lily Moreno, now believes the company’s future success depends on maximizing the number of annual memberships, due to the conclusion that annual members are much more profitable than casual riders.
Using Cyclistic’s historical trip data from the previous 12 months, the business task is to find out how annual members and casual riders use Cyclistic bikes differently in 2023 (latest update of dataset). Understanding the differences in usages of the two types of riders can help deliver insights into how the marketing team can increase the number of annual memberships purchased.
This data was made available by Motivate International Inc. and is open to the public for download. (Note: The dataset has a different name because Cyclisitc is a fictional company. For the purposes of this case study, the dataset is appropriate for the business task).
The data being used to conduct this analysis is a csv file containing trip data from 2023:
202301-divvy-tripdata.csv * 202302-divvy-tripdata.csv * 202303-divvy-tripdata.csv * 202304-divvy-tripdata.csv * 202305-divvy-tripdata.csv * 202306-divvy-tripdata.csv * 202307-divvy-tripdata.csv * 202308-divvy-tripdata.csv * 202309-divvy-tripdata.csv * 202310-divvy-tripdata.csv * 202311-divvy-tripdata.csv * 202312-divvy-tripdata.csv
After downloading and unzipping the csv file, using R and Rstudio, the dataset were read into their own data frames and merged into one data frame representing the whole year. From here we can easily view and manipulate the data:
library(tidyverse)
# create data frame for each months data set
data_01_df <- read.csv("new_data/202301-divvy-tripdata.csv")
data_02_df <- read.csv("new_data/202302-divvy-tripdata.csv")
data_03_df <- read.csv("new_data/202303-divvy-tripdata.csv")
data_04_df <- read.csv("new_data/202304-divvy-tripdata.csv")
data_05_df <- read.csv("new_data/202305-divvy-tripdata.csv")
data_06_df <- read.csv("new_data/202306-divvy-tripdata.csv")
data_07_df <- read.csv("new_data/202307-divvy-tripdata.csv")
data_08_df <- read.csv("new_data/202308-divvy-tripdata.csv")
data_09_df <- read.csv("new_data/202309-divvy-tripdata.csv")
data_10_df <- read.csv("new_data/202310-divvy-tripdata.csv")
data_11_df <- read.csv("new_data/202311-divvy-tripdata.csv")
data_12_df <- read.csv("new_data/202312-divvy-tripdata.csv")
# merge data frames into one data set for the whole year
trip_data_df <- rbind(data_01_df, data_02_df, data_03_df, data_04_df,
data_05_df, data_06_df, data_07_df, data_08_df, data_09_df,
data_10_df, data_11_df, data_12_df)
The data is organized such that each row represents a single bike trip containing the following measurements and attributes:
glimpse(trip_data_df)
Rows: 5,719,877
Columns: 13
$ ride_id <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E6…
$ rideable_type <chr> "electric_bike", "classic_bike", "electric_bike",…
$ started_at <chr> "2023-01-21 20:05:42", "2023-01-10 15:37:36", "20…
$ ended_at <chr> "2023-01-21 20:16:33", "2023-01-10 15:46:05", "20…
$ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53r…
$ start_station_id <chr> "TA1309000058", "TA1309000037", "RP-005", "TA1309…
$ end_station_name <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47t…
$ end_station_id <chr> "202480.0", "TA1308000002", "599", "TA1308000002"…
$ start_lat <dbl> 41.92407, 41.79957, 42.00857, 41.79957, 41.79957,…
$ start_lng <dbl> -87.64628, -87.59475, -87.69048, -87.59475, -87.5…
$ end_lat <dbl> 41.93000, 41.80983, 42.03974, 41.80983, 41.80983,…
$ end_lng <dbl> -87.64000, -87.59938, -87.69941, -87.59938, -87.5…
$ member_casual <chr> "member", "member", "casual", "member", "member",…
This includes data for ride IDs, bike types, start and end times for trips, to and from station names and IDs, start and end latitudes and longitudes, and user types (either a member or causal rider).
Being that the data is first-party data and is provided by the bike-share company, for the sake of this case study we are going to assume that there is no bias or credibility concerns with our data set. However, in a real setting as a data analyst when working with outside data, you should always be sure that there is no bias and that your data is reliable before continuing.
The data set is provided under a Data License Agreement, which permits users to access, reproduce, copy, modify, analyze, and distribute the data so long as it is used as source material in analyses, reports, and/or studies. It prohibits users from host, stream, publish, distribute, sublicense, or selling the data as a stand-alone dataset (as stated here).
Privacy issues have been addressed by the removal of any personally identifiable information in the dataset.
The dataset is securely managed my Motivate International Inc. employees and is stored using a well-known cloud services provider in Amazon Web Services.
Cyclisics bike-share data is open to the public, meaning it is accessible to anyone so long as the license agreement is being abided by.
The data’s integrity will be verified in the cleaning process of this analysis. Issues such as missing or duplicate values, incorrect values, and inconsistencies will be addressed.
The dataset contains all of the information we will need to determine how annual members and casual riders use Cyclistic bikes differently. Data such as rider type, start and end stations, latitudes and longitudes, start and end times, and bike types will help us differentiate the usage in many ways in order to provide valuable insights to our stakeholders as to how Cyclistic can maximize the number of annual memberships being purchased.
Potential problems could include inconsistencies in the data, missing values, outliers of ride times, or duplicate records for rides.
I started off using Google Sheets but I began experiencing performance issues because of how large the dataset is. I then switched to using Rstudio and the R programming language to carry out my analysis. R is a great tool for this case study because it can handle large datasets (or as big what can fit in RAM comfortably). It also has a package called “tidyverse”, which is a collection of packages specifically designed for working with data. This includes data cleaning, manipulation, exploration, and visualization.
First, we’ll start by making a copy of the dataset and storing it in the data frame trip_data_clean_df. It is always good practice to not modify the original dataset unless given specific instructions to do so. This is so if anything goes wrong, we can always refer back to the original data without worrying about compromising it. We’ll also start by removing any null values that a record may contain. Since each column attribute is something we will use in our analysis, we need to make sure each cell contains a value.
# Make copy of original data set for cleaning and removing null values in records
trip_data_clean_df <- trip_data_df %>%
drop_na()
# Examine data set
head(trip_data_clean_df)
At first glance I noticed some of the columns names are not very descriptive, which include started_at, ended_at, and member_casual. Changing these names to start_time, end_time, and user_type would make them less ambiguous and clearer as to what they represent.
trip_data_clean_df <- trip_data_clean_df %>%
rename(start_time = started_at, end_time = ended_at, user_type = member_casual)
colnames(trip_data_clean_df)
[1] "ride_id" "rideable_type" "start_time"
[4] "end_time" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "user_type"
To make we are not using duplicate records in our analysis we can use the duplicated function nested inside the sum function which will give us the total number of duplicates in the dataset.
# Number of duplicate records
sum(duplicated(trip_data_clean_df))
[1] 0
We got a sum of zero duplicates so that means were are not using repeated data in our analysis.
The start and end times are of a character format in the form ‘YYYY-MM-DD H:M:S UTC’. To make them easier to work with in terms of manipulation and calculation of dates and times, we are going to covert them to class POSIXct. This data type makes working with date-time objects easier, as opposed to if they were left as character strings.
# Converting start and end times to POSIXct class
trip_data_clean_df <- trip_data_clean_df %>%
mutate(start_time = as.POSIXct(trip_data_clean_df$start_time, format = "%Y-%m-%d %H:%M:%S")) %>%
mutate(end_time = as.POSIXct(trip_data_clean_df$end_time, format = "%Y-%m-%d %H:%M:%S"))
From here we can now filter out rows in which the start time of bike rides is greater than or equal to end times. This is important because we do not want ride times that are have a time of zero or negative time, which is impossible, when the new column for ride times in seconds is created. We can also create a column for the day of the week in which rides were taken, extracted from the start time column.
trip_data_clean_df <- trip_data_clean_df %>%
filter(end_time > start_time) %>% # filter out start_time >= end_time
mutate(ride_time_sec = difftime(end_time, start_time)) %>% # calculate ride time in seconds
mutate(day_of_week = weekdays(start_time)) # get day of week ride was taken
trip_data_clean_df %>%
select(ride_time_sec, day_of_week)
While searching for errors in the data, I noticed that some station names have more than one ID associated with them. I verified which stations, how many IDs were associated with them, and what the IDs were with this:
# checks which start_station have more than one unique ids and lists the ids associated with them
trip_data_clean_df %>%
group_by(start_station_name) %>%
summarise(num_unique_ids = n_distinct(start_station_id), id_list=paste(unique(start_station_id), collapse = ", ")) %>%
filter(num_unique_ids > 1)
This could be a problem if we needed to extract information about certain stations with their ID, however for this analysis, we are not too concerned with accessing a station’s ID so it will be left unchanged. If I were to fix these station’s IDs, I would further investigate each station to determine which ID is used more often and ensure that each record that contains that station will have the more frequent ID associated with it.
Now that we have a clean dataset, we can perfrom exploratory analysis to identify trends and relationships among the data to give us insights that could help answer our business question.
The aim of this analysis is to identify the difference in ride times of casual and member riders:
library(ggplot2)
# User type vs mean ride time
mean_ride_times <- trip_data_clean_df %>%
group_by(user_type) %>%
summarise(mean_ride_time_sec = mean(ride_time_sec), max_ride_time=max(ride_time_sec),
min_ride_time=min(ride_time_sec))
mean_ride_times
# Causal Rider: just over 20 min 30 sec average ride time
# Member Rider: just over 12 min average ride time
ggplot(data=mean_ride_times) +
geom_col(mapping=aes(x=user_type, y=mean_ride_time_sec,fill=user_type)) +
geom_text(aes(x=user_type, y=mean_ride_time_sec, label=round(mean_ride_time_sec, 2)),vjust=1) +
labs(title = "Average Ride Times by Usertype", x="Usertype",y="Avg. Ride Time(Sec)")
Analysis:
To get a better sense of when casual and member riders use bikes, we can count the number of rides each type of rider took for each weekday of the year. While aggregating my data into a useful data frame, I changed the type of the day_of_week column from a character string to an ordinal factor. This made it easier to sort, organize, and count the data.
# Aggregate and organize data by week day counts for each type of rider
weekday_rides <- trip_data_clean_df %>%
mutate(day_of_week = wday(start_time, label=TRUE)) %>%
group_by(day_of_week, user_type) %>%
summarise(.groups="drop",ride_count=n()) %>%
arrange(day_of_week)
ggplot(data = weekday_rides) +
geom_col(mapping = aes(x=day_of_week, y=ride_count,fill=user_type),position = 'dodge') +
labs(title = "Ride Count Per Day of Week", x="Day of Week",y="Number of Rides") +
scale_y_continuous(labels = scales::comma)
Analysis:
# which months do rides occur the most
trip_data_clean_df %>%
mutate(month_of_ride = month(start_time, label=TRUE)) %>%
group_by(month_of_ride, user_type) %>%
summarise(.groups="drop", rides_per_month = n()) %>%
arrange(month_of_ride) %>%
ggplot() +
geom_line(mapping=aes(x=month_of_ride,y=rides_per_month,
color=user_type, group=user_type)) +
labs(title="Rides per Month by Usertype",x="Month",y="Number of Rides") +
scale_y_continuous(labels = scales::comma)
Analysis:
To further investigate how casual and member riders use bikes differently, we can see which type of bike each prefer based on which type of bikes were used for rides in the past 12 months. (Note: When organizing data for casual riders, I noticed there was a third type of bike, docked_bike, which is just a bike docked at a station so I excluded it from my analysis)
# Analysing rideable bike types per usertype
# Calculating the number of rides and percent of total rides taken with each
# type of bike for casual riders
c_rideable <- trip_data_clean_df %>%
filter(user_type == "casual") %>%
filter(rideable_type != "docked_bike") %>%
group_by(rideable_type) %>%
summarise(num_rides = n()) %>%
ungroup() %>%
mutate(percent_total_rides = num_rides/sum(num_rides) * 100)
# Pie chart displaying percentages of each bike used for casual riders
casual_pie <- ggplot(data = c_rideable, aes(x = "", y = percent_total_rides, fill = rideable_type)) +
geom_bar(stat = "identity", width = 1) +
geom_label(aes(label = paste0(round(percent_total_rides, 1), "%")),
position = position_stack(vjust = 0.5),show.legend = FALSE) +
coord_polar(theta = "y") +
labs(title = "Distribution of Rides by Rideable Type (Casual)",
fill = "Rideable Type") +
theme_void() +
theme(legend.position = "right")
# Calculating the number of rides and percent of total rides taken with each
# type of bike for casual riders
m_rideable <- trip_data_clean_df %>%
filter(user_type == "member") %>%
group_by(rideable_type) %>%
summarise(num_rides = n()) %>%
ungroup() %>%
mutate(percent_total_rides = num_rides/sum(num_rides)*100)
# Pie chart displaying percentages of each bike used for member riders
member_pie <- ggplot(data = m_rideable, aes(x = "", y = percent_total_rides, fill = rideable_type)) +
geom_bar(stat = "identity", width = 1) +
geom_label(aes(label = paste0(round(percent_total_rides, 1), "%")),
position = position_stack(vjust = 0.5),show.legend = FALSE) +
coord_polar(theta = "y") +
labs(title = "Distribution of Rides by Rideable Type (Member)",
fill = "Rideable Type") +
theme_void() +
theme(legend.position = "right")
casual_pie
member_pie
Analysis:
It is important to identify the most popular start stations riders go to for their bike rides. Knowing the difference in where casual riders like to go for their bikes and where members go can give insights to stakeholders in determining which area would be best to target when trying to convert casual riders to annual members.
######## Identify the top 5 starting stations for each usertype
# selecting user_type, start_station_name, and rideable_type from and storing
# in a data frame station_data
station_data <- trip_data_clean_df %>%
filter(rideable_type != "docked_bike") %>%
select(user_type, start_station_name, rideable_type) %>%
drop_na()
# Group data by user_type, start_station_name, and rideable type, calculate
# number of rides for each combination
station_summary <- station_data %>%
filter(start_station_name != "") %>%
group_by(user_type, start_station_name, rideable_type) %>%
summarise(.groups="drop",num_rides = n()) %>%
arrange(user_type, start_station_name)
# Transform data to a wide format to create separate columns for each rideable
# bike type
station_summary_pivot <- station_summary %>%
pivot_wider(names_from = rideable_type, values_from = num_rides, values_fill = 0)
# Calculate the total number of rides and how much number of rides with classic
# and electric bike types for each combination or user_type and
# start_station_name
station_summary_final <- station_summary_pivot %>%
group_by(user_type, start_station_name) %>%
summarise(.groups="drop",num_rides = sum(classic_bike)+sum(electric_bike),
num_rides_w_classic = sum(classic_bike),
num_rides_w_electric = sum(electric_bike))
# Get the top 5 stations with the most num_rides for each user_type
top_5_stations <- station_summary_final %>%
group_by(user_type) %>%
top_n(5, num_rides) %>%
arrange(user_type, desc(num_rides))
top_5_stations
Casual Riders Top Start Stations:
## Bar graph for Top 5 start stations by num rides (casual rider)
top_5_stations %>%
filter(user_type == "casual") %>%
ggplot(aes(x=start_station_name, y=num_rides,fill="All Stations")) +
geom_bar(stat = "identity",) +
labs(title = "Top 5 Start Stations by Number of Rides (Casual)",
x = "Start Station", y = "Total Number of Rides") +
geom_text(aes(x=start_station_name,y=num_rides,label=num_rides),vjust=1.5,
color="white", size=6)+
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
legend.position = "none",plot.title = element_text(size = 12)) +
scale_fill_manual(values = "brown")
Annual Members Top Start Stations:
## Bar graph for Top 5 start stations by num rides (member rider)
top_5_stations %>%
filter(user_type == "member") %>%
ggplot(aes(x=start_station_name, y=num_rides,fill="All Stations")) +
geom_bar(stat = "identity",) +
labs(title = "Top 5 Start Stations by Number of Rides (Member)",
x = "Start Station", y = "Total Number of Rides") +
geom_text(aes(x=start_station_name,y=num_rides,label=num_rides),vjust=1.5,
color="white", size=6) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
legend.position = "none",plot.title = element_text(size = 12)) +
scale_fill_manual(values = "brown")
Analysis:
Both Casual and Member Riders:
Annual Membership Riders:
Casual Riders: