This case study, part of the Google Data Analytics Professional Certificate program, delves into the strategic steps essential for fostering rapid expansion within the bike-share domain. The study meticulously examines the phases of:
Cyclistic, founded in 2016, initiated a thriving bike-share service, evolving over time into a robust fleet comprising 5,824 bicycles strategically stationed across 692 locations throughout Chicago. This sophisticated system enables users to unlock bikes from one station and seamlessly return them to any other station within the network at their convenience.
The primary aim is to devise marketing initiatives targeted at transitioning occasional riders into annual subscribers. To achieve this objective, a comprehensive understanding of the distinctions between annual members and casual riders is imperative. Furthermore, insight into the motivations driving casual riders towards membership acquisition, coupled with an assessment of digital media’s influence on marketing strategies, is paramount. Lily Moreno, Director of Marketing, spearheads this endeavor, emphasizing the analysis of historical bike trip data to discern patterns and trends within the Cyclistic user base.
Cyclistic’s inception in 2016 marked the genesis of an expansive journey, culminating in the current operational scale encompassing a diverse array of subscription options, including single-ride, full-day, and annual memberships. These offerings cater to distinct user demographics, with single-ride and full-day options predominantly serving casual riders, while annual memberships denote Cyclistic’s committed clientele.
For the forthcoming analysis, data spanning 12 months, from April 2020 to March 2021, will be scrutinized. Key inquiries to be addressed include:
This meticulous inquiry sets the stage for subsequent phases of preparation, processing, analysis, and strategic action, ultimately facilitating Cyclistic’s quest for sustained growth and market dominance within the bike-share landscape.
First, we will create a function to iterate through our CSV files, consolidating them into a single file. Subsequently, we will assign it to our designated variable.
library(tidyverse)
library(geosphere)
library(wordcloud)
# Firstly Making a function to proccess the data we collected
process_data <- function(file_path) {
# Reading CSV files and combining them into one data frame also making sure that the types are correct
result <- list.files(path = file_path, pattern = "*.csv", full.names = TRUE) %>%
purrr::map_dfr(~ read.csv(.x) %>% mutate(across(.fns = as.character))) %>%
readr::type_convert()
# Adding columns for month, year, and day_of_week
result$month <- format(as.Date(result$started_at), "%b")
result$year <- format(as.Date(result$started_at), "%Y")
result$day_of_week <- format(as.Date(result$started_at), "%A")
# Creating hour column
result$hour <- strftime(result$ended_at, "%H")
# Creating ride length column (in minutes)
result$ride_length <- as.numeric(difftime(result$ended_at, result$started_at, units = "mins"))
# Creating ride distance column (in km)
result$ride_distance <- geosphere::distGeo(matrix(c(result$start_lng, result$start_lat), ncol = 2),
matrix(c(result$end_lng, result$end_lat), ncol = 2)) / 1000
result$ride_distance <- result$ride_distance/1000
result$day_of_week <- ordered(result$day_of_week, levels = c("Monday", "Teusday", "Wedenesday", "Thursday", "Friday", "Saturday", "Sunday"))
#Ordering the Month Column
result$month <- ordered(result$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
#Renaming the Columns
names(result) [2] <- 'bike_model'
names(result) [13] <- 'member_type'
#Remove rows with NA
result <- drop_na(result)
#Removing the Negative Rides
result <- result[!result$ride_length < 1, ]
# Remove rows above 1 day rides
result <- result[!result$ride_length > 1440,]
return(result)
}
processed_data <- process_data("./")
After creating the processed_data variable, we will proceed to extract and analyze specific information. The following analyses will be conducted:
users <- processed_data %>%
group_by(member_type) %>%
summarize(total = n()) %>%
mutate(all_total = sum(total)) %>%
group_by(member_type) %>%
summarize(precentage = total/all_total * 100)
bikes <- processed_data %>%
group_by(bike_model) %>%
summarize(total = n()) %>%
mutate(all_total = sum(total)) %>%
group_by(bike_model) %>%
summarize(precentage = total/all_total * 100)
bike_model_user_type_casual <- processed_data %>%
filter(member_type == "casual") %>%
group_by(bike_model) %>%
summarize(total= n()) %>%
mutate(all_total = sum(total)) %>%
group_by(bike_model) %>%
summarize(precentage = total/all_total * 100)
bike_model_user_type_member <- processed_data %>%
filter(member_type == "member") %>%
group_by(bike_model) %>%
summarize(total= n()) %>%
mutate(all_total = sum(total)) %>%
group_by(bike_model) %>%
summarize(precentage = total/all_total * 100)
user_day_rel <- processed_data %>%
group_by(day_of_week, member_type, bike_model) %>%
group_by(day_of_week, member_type)%>%
summarize(total = n()) %>%
mutate(all_total = sum(total)) %>%
group_by(day_of_week, member_type) %>%
summarize(precentage = total/all_total * 100)
bike_model_day_rel <- processed_data %>%
group_by(day_of_week, member_type, bike_model) %>%
group_by(day_of_week, bike_model)%>%
summarize(total = n()) %>%
mutate(all_total = sum(total)) %>%
group_by(day_of_week, bike_model) %>%
summarize(precentage = total/all_total * 100)
user_month_rel <- processed_data %>%
group_by(month, member_type, bike_model) %>%
group_by(month, member_type)%>%
summarize(total = n()) %>%
mutate(all_total = sum(total)) %>%
group_by(month, member_type) %>%
summarize(precentage = total/all_total * 100)
bike_model_month_rel <- processed_data %>%
group_by(month, member_type, bike_model) %>%
group_by(month, bike_model)%>%
summarize(total = n()) %>%
mutate(all_total = sum(total)) %>%
group_by(month, bike_model) %>%
summarize(precentage = total/all_total * 100)
user_ride_length_rel <- processed_data %>%
group_by(member_type)%>%
summarize(ride_length = sum(ride_length))
user_ride_length_month_rel <- processed_data %>%
group_by(member_type, month)%>%
summarize(ride_length = sum(ride_length))
hours <- processed_data %>%
group_by(member_type, hour)%>%
summarize(number_of_rides = n(), .groups = "drop")%>%
arrange(hour)
avg_bike_rides_month <- processed_data %>%
group_by(bike_model, month)%>%
summarize(ride_length = sum(ride_length))
start_station_users_casual <- processed_data%>%
filter(member_type == "casual")%>%
group_by(start_station_name)%>%
summarize(total = n())
start_station_users_member <- processed_data%>%
filter(member_type == "member")%>%
group_by(start_station_name)%>%
summarize(total = n())
Following the data compilation, we will proceed with chart creation to analyze the gathered data. This process begins by establishing key variables essential for our analysis.
# Reused Variables
two_color_pallate <- c("#FF204E", "#A0153E")
three_color_pallate <- c("#FF204E", "#A0153E","#5D0E41")
user_types_original <- c("casual", "member")
user_types_chart <- c("Casual Member", "Annual Member")
bike_types_original <- c("classic_bike", "docked_bike", "electric_bike")
bike_types_chart <- c("Classic", "Docked", "Electrical")
footer_text <- "Data: Motivate International"
Subsequently, we will develop functions tailored to generate insightful charts based on the provided data sets.
These functions encompass the following:
# Function to use for the charts
plot_distribution <- function(data, x_value, y_value, fill_value, legend, colors, top_name, sections_data, labels_text, main_title, subtitle_text, caption_text) {
ggplot(data, aes(x={{x_value}}, y={{y_value}}, fill={{fill_value}}))+
geom_bar(stat= "identity", width = 1)+
coord_polar(theta = "y", start = 0)+
geom_text(aes(label = scales :: percent(round({{y_value}}) / 100)), position = position_stack(vjust = 0.5), size = 5, fontface = "bold", color = "#FFFFFF") +
scale_fill_manual(values = {{colors}}, name = {{top_name}}, breaks = {{sections_data}}, labels = {{labels_text}})+
labs(title = {{main_title}}, subtitle = {{subtitle_text}}, caption = {{caption_text}}, fill = legend)+
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#00224D"),
plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold", color = "grey20"),
plot.caption = element_text(size = 4, color = "grey35"),
legend.title = element_text(size = 12, face = "bold", color = "#00224D"),
legend.text = element_text(size = 10, color = "grey20"))
}
stackbar_plot <- function(data, x_value, y_value, fill_value, legend, colors, top_name, sections_data, labels_text, main_title, subtitle_text, caption_text){
ggplot(data, aes(x={{x_value}}, y={{y_value}}, fill={{fill_value}}))+
geom_bar(position = "dodge", stat="identity")+
geom_text(aes(label = scales :: percent(round({{y_value}}) / 100)), position = position_dodge(width =0.9),vjust=-0.5, size = 3, fontface = "bold", color = "#000000")+
scale_fill_manual(values = {{colors}}, name={{top_name}}, breaks = {{sections_data}}, labels= {{labels_text}})+
scale_y_continuous(labels = scales::comma)+
labs(title={{main_title}}, subtitle = {{subtitle_text}}, caption = {{caption_text}}, fill=legend)+
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#00224D"),
plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold", color = "grey20"),
plot.caption = element_text(size = 4, color = "grey35"),
legend.title = element_text(size = 12, face = "bold", color = "#00224D"),
legend.text = element_text(size = 10, color = "grey20"))
}
stackbar_plot_lenght <- function(data, x_value, y_value, fill_value, legend, colors, top_name, sections_data, labels_text, main_title, subtitle_text, caption_text){
ggplot(data, aes(x={{x_value}}, y={{y_value}}, fill={{fill_value}}))+
geom_bar(position = "stack", stat="identity")+
scale_fill_manual(values = {{colors}}, name={{top_name}}, breaks = {{sections_data}}, labels= {{labels_text}})+
scale_y_continuous(labels = scales::comma)+
labs(title={{main_title}}, subtitle = {{subtitle_text}}, caption = {{caption_text}}, fill=legend)+
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", color = "#00224D"),
plot.subtitle = element_text(hjust = 0.5, size = 12, face = "bold", color = "grey20"),
plot.caption = element_text(size = 4, color = "grey35"),
legend.title = element_text(size = 12, face = "bold", color = "#00224D"),
legend.text = element_text(size = 10, color = "grey20"))
}
In this report, we examine the distribution of membership types and their respective percentages in relation to the total number of members. As observed, the majority of our members are classified as Annual members.
# Charts to display
plot_distribution(users, "", precentage, member_type, "Members", two_color_pallate, "Types of riders",
user_types_original, user_types_chart, "Distribution of Riders", "What percentage of riders are using the Cyclistic?",
footer_text)
Next, we explore the distribution of bikes. This report aims to analyze the usage of bikes by users. According to the findings, Docked Bikes are the most frequently utilized.
plot_distribution(bikes, "", precentage, bike_model, "Bikes", three_color_pallate,
"Types of Bikes", bike_types_original, bike_types_chart,
"Distribution of Bikes", "What percentage of all riders are using the Bikes?",
footer_text)
Following this, we delve into the analysis of bike usage by Annual members. Once more, our observations indicate that Docked bikes represent the highest proportion in terms of usage among this membership category.
plot_distribution(bike_model_user_type_member, "", precentage, bike_model, "Bikes", three_color_pallate,
"Types of Bikes", bike_types_original, bike_types_chart,
"Member Bikes", "What percentage of member riders are using the Bikes?",
footer_text)
Once again, we conduct a similar report for Casual Members. Consistently, our findings reveal that Docked bikes remain the most utilized among this membership segment.
plot_distribution(bike_model_user_type_casual, "", precentage, bike_model, "Bikes", three_color_pallate,
"Types of Bikes", bike_types_original, bike_types_chart,
"Casual Bikes", "What percentage of casual riders are using the Bikes?",
footer_text)
In this report, our objective is to analyze the bike usage patterns among various membership types across different months. It is evident that Casual members exhibit the highest usage, particularly in January, with usage becoming more comparable across all membership types around June and July.
stackbar_plot(user_month_rel, month, precentage, member_type, legend, two_color_pallate,
"Members Type", user_types_original, user_types_chart,
"Members usage throught Months", "Comparison of the users throght the months",
footer_text)
In our comparative analysis across days of the week, it becomes apparent that Saturdays and Sundays observe heightened usage among our Annual users.
stackbar_plot(user_day_rel, day_of_week, precentage, member_type, legend, two_color_pallate,
"Members Type", user_types_original, user_types_chart,
"Members usage throught Days of Week", "Comparison of the users throght Days of Week",
footer_text)
Following our monthly analysis on the types of bikes utilized, a notable trend emerges: as temperatures rise, there is a noticeable increase in the usage of Docked bicycles. Conversely, as colder weather seasons approach, Classic bikes observe higher usage rates.
stackbar_plot(bike_model_month_rel, month, precentage, bike_model, legend, three_color_pallate,
"Types of Bikes", bike_types_original, bike_types_chart,
"Bikes usage throught Months", "Comparison of the bikes Months",
footer_text)
Continuing with our analysis, we scrutinized bike usage patterns across days of the week. It is evident that Docked bikes are consistently the most utilized throughout the entire week.
stackbar_plot(bike_model_day_rel, day_of_week, precentage, bike_model, legend, three_color_pallate,
"Types of Bikes", bike_types_original, bike_types_chart,
"Bikes usage throught Days of Week", "Comparison of the bikes throght Days of Week",
footer_text)
In this analysis, we examined ride lengths between two member groups. Notably, casual members exhibit longer ride durations compared to annual members. Additionally, it’s observed that as temperatures rise, ride frequency increases, particularly during warmer weather conditions.
stackbar_plot_lenght(user_ride_length_month_rel, month, ride_length, member_type, legend, two_color_pallate,
"Members Type", user_types_original, user_types_chart,
"Members Ride Length throught Months", "Comparison of the user Ride Length throght the months",
"Data: Motivate International")
This report illustrates a significant disparity in the overall ride length between casual and annual members, with casual members consistently demonstrating markedly higher ride durations compared to annual members.
stackbar_plot_lenght(user_ride_length_rel, "", ride_length, member_type, legend, two_color_pallate,
"Members Type", user_types_original, user_types_chart,
"Members Ride Length throught Months", "Comparison of the user Ride Length throght the months",
footer_text)
In this analysis, we investigated the varying lengths of bike rides, revealing a recurring trend where Docked bikes consistently record the highest number of rides among all bike types.
stackbar_plot_lenght(avg_bike_rides_month, month, ride_length, bike_model, legend, three_color_pallate,
"Members Type", bike_types_original, bike_types_chart,
"Members Ride Length throught Months", "Comparison of the user Ride Length throght the months",
footer_text)
This report provides insights into the usage patterns of member bikes throughout the 24-hour period. The analysis reveals that our members predominantly utilize the bikes between the hours of 16:00 and 21:00.
stackbar_plot_lenght(hours, hour, number_of_rides, member_type, legend, two_color_pallate,
"Members Type", user_types_original, user_types_chart,
"Members Ride Length throught Months", "Comparison of the user Ride Length throght the months",
"Data: Motivate International")
Upon examining the top street names utilized by casual members, it becomes evident that Millennium Park emerges as the favored start station among our casual riders.
wordcloud(words = start_station_users_casual$start_station_name, freq = start_station_users_casual$total, min.freq = 1, max.words = 200, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
Upon examining the top street names utilized by annual members, it becomes evident that Wells St & Elm St emerges as the favored start station among our annual riders.
wordcloud(words = start_station_users_member$start_station_name, freq = start_station_users_member$total, min.freq = 1, max.words = 200, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
Based on the analysis conducted, the following recommendations are proposed: