Introduction

As part of the Google Data Analytics Professional Certification program, participants have the option to complete a capstone project. Google provides two recommended case studies, both of which are based on open-source data sets.

I chose to work on the first case study, which focuses on a bike-sharing service in Chicago. The data used in this project is licensed by Bikeshare, an LLC operated by Lyft Bikes and Scooters, under the City of Chicago’s Divvy bicycle sharing service. To make the project more engaging, we used the fictional company name ‘Cyclistic’.

The data analysis process will consist of the following steps:

ASK - the first step in this project is to identify the business task at hand and determine the key stakeholders involved.
PREPARE - collect the data and assess its organization and credibility.
PROCESS - choose the appropriate tool for cleaning the data, thoroughly check for any errors, and meticulously document the entire cleaning process.
ANALYZE - analyze the data by organizing, aggregating, calculating, and identifying trends
SHARE - apply design thinking and data-driven storytelling to effectively visualize and present the findings, ensuring that the analysis addresses the business task
ACT - present the final conclusion and recommendations based on the analysis

About the Company

Cyclistic, a bike-sharing company in Chicago, launched its bike-share program in 2016 and now boasts a fleet of 5,824 geo-tracked bicycles across 692 stations in the city. The bikes can be rented and returned at any station in the system, making them a convenient transportation option for Chicagoans.

Cyclistic offers a range of pricing options, including single-ride passes, full-day passes, and annual memberships. Casual riders purchase single-ride or full-day passes, while annual members enjoy the benefits of unlimited rides throughout the year. Cyclistic’s finance analysts have determined that annual members are more profitable than casual riders, making it crucial to increase the number of annual memberships for future growth.

Scenario

In this case scenario, I am a junior data analyst on the marketing team at Cyclistic, a bike-sharing company in Chicago, I am tasked with understanding how casual riders and annual members use Cyclistic bikes differently. The director of marketing believes that the company’s future success depends on maximizing the number of annual memberships, and our team wants to help achieve this goal by designing a new marketing strategy that converts casual riders into annual members. By analyzing the data, we hope to uncover insights that will inform our marketing decisions and help Cyclistic achieve its goals.

ASK

Business Task

Cyclistic’s bike-share is currently focused on converting casual riders into annual members by analyzing the distinct usage patterns of both groups. To accomplish this, I will analyze historical data and provide recommendations on how to facilitate this conversion.

Stakeholders

Lily Moreno: The director of marketing and my manager.

Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program.

Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy.

Questions to be answered.

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Tools: I used R for data cleaning and data visualization

Dataset: Cyclistic’s historical trip data from May 2022 to April 2023 which can be found here

PREPARE

To analyze and identify trends for this project, I’ll be using Cyclistic’s historical trip data, which has been made available under license by Motivate International Inc.Therefore, I can rely on its integrity.

I downloaded the ZIP files containing the csv files from the above link. For the purpose of my analysis I will use the csv files from May 2022 to April 2023.

PROCESS

To get an initial look at the data, I used Microsoft Excel. Each month’s data is contained in a separate csv file, which includes information about the ride such as the ride id, rideable type, start and end time, start and end station, and latitude and longitude of the start and end stations.

Here I did some minor cleaning, formatted the columns saved in a new folder so as to have my original documents safe.

Then I proceeded to R for some more cleaning and analyze phase.

ANALYZE

To analyze the vast amount of data available at the company, I chose to use R, a powerful tool that can handle large datasets with ease. Below is a brief summary of the steps I took to analyze the data, and you can find the full process, including calculations, filtering, and more, on my GitHub page click here

Loaded the necessary libraries.
Imported the data for each month into separate data frames.
Combined all the monthly data frames into a single data frame for the entire year, named all_trips.

all_trips <- rbind(May_2022,June_2022,July_2022,August_2022,September_2022,October_2022,November_2022,December_2022,January_2023,February_2023,March_2023,April_2023)

Created a copy of the all_trips data frame, renamed twice with final version being “riders”, to perform all calculations.
Removed columns not needed for my analysis

all_trips <- all_trips %>%
    select(-c(start_lat, start_lng, end_lat, end_lng))

Calculated the ride_length column by subtracting the end_time from the start_time.

all_trips_clean$ride_length <- difftime(all_trips_clean$ended_at,all_trips$started_at)

Created new columns to provide more detailed information about the data, including date, month, year, day.
Calculated the average number of ride time by each day for both customer types
Calculated Total number of rides

nrow(riders)

Summarised by Customer types

riders %>%
  group_by(member_casual) %>%
  summarise(count = length(ride_length),
            "%" = (length(ride_id)/nrow(riders))*100)

R

ggplot(riders, aes(member_casual, fill = member_casual)) + 
  geom_bar() + labs(title = "Total rides by Customer Type", x = "Customer Type")+
  scale_fill_manual("legend", values = c("casual" = "orange", "member" = "blue"))

Based on the graph above, it is evident that there were more casual riders than members in the past 12 months, as indicated by the ride count and the corresponding percentages in the investigated data.

riders %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = member_casual, y = average_duration, fill = member_casual)) +
  labs(title = "Average Duration by Customer Type", x = "Customer Type")+
  scale_fill_manual("legend", values = c("casual" = "orange", "member" = "blue"))+ geom_col(position = "dodge")

From the above graph, we can see that member riders ride duration is longer than the casual riders despite there being more of this group in total. Let us take a closer look to see if there is still a way to convert all the casual riders into members.

riders %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n(), .groups="drop") %>% 
  arrange(member_casual, day_of_week) %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides per Customer Type by Day of week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  scale_fill_manual("legend", values = c("casual" = "orange", "member" = "blue"))+ 
  geom_col(position = "dodge")

Based on the visualization, it is evident that both casual and member riders have the highest number of rides during the weekends, specifically on Saturdays and Sundays. Interestingly, Wednesdays show the lowest number of rides for casual riders, while members seem to remain consistent throughout the week. Additionally, both customer types share similar behaviors over the weekend. This information suggests that there may be an opportunity to convert weekend casual riders into members.

riders %>%
  group_by(member_casual, day_of_week) %>% 
  summarise(average_trip_duration = mean(ride_length)) %>%
  ggplot(aes(x = day_of_week, y = average_trip_duration, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Average trip duration per Customer type by Day of week") +
  scale_fill_manual("legend", values = c("casual" = "orange", "member" = "blue"))+ 
  geom_col(position = "dodge")

Based on the above graph, it is evident that member riders have the longest trip duration on average during weekdays, with the highest duration on Saturdays and Sundays. In contrast, casual riders have shorter trip durations on a weekly basis. This pattern suggests that although there are more casual riders, they tend to ride for shorter periods of time. This may be a contributing factor to why they choose to remain casual riders.

Now, let’s examine the bike types and how they are being utilized by the riders

riders %>%
  ggplot(aes(rideable_type, fill = member_casual)) +
  geom_bar()+
  labs(x="Bike Type", title= "Total rides by Bike Type")+
  scale_fill_manual("legend", values = c("casual" = "orange", "member" = "blue"))

Based on the graph above, it is clear that both customer types have a preference for classic and electric bikes over docked bikes. Additionally, it appears that docked bikes are primarily being used by casual riders, as member riders are not utilizing them.

Before displaying the visualizations in Tableau, let’s examine the most frequently used starting station for the average casual rider.

riders %>%
  group_by(start_station_name, member_casual) %>%
  summarise(number_of_ride = n(), .groups = 'drop') %>%
  filter(start_station_name != "", member_casual != 'member') %>%
  arrange(-number_of_ride) %>%
  head(n=10) %>%
  select(-member_casual)

TABLEAU

To create a more comprehensive and visually appealing dashboard, Tableau is an excellent tool for visualizing the data. Below is a brief overview of my process (you can also find the complete code for this section here

I made some changes to the data to help me visualize properly in Tableau. The steps are in the code above.

To prepare the data for visualization in Tableau, I made some minor changes to the cyclistic_data data frame using a separate R code.
First, I created a copy of the data frame called riders_tableau where all my calculations would take place.
I created new columns for date, month, day, year, hour, quarter, and time_of_day.
To make the month column more readable, I changed it to display the month name (e.g. January) instead of its corresponding number (e.g. 1).
Finally, I exported the cleaned data frame as a .csv file and converted to a .xlsx to use in Tableau.

I went on to create the following graphs in Tableau.

Data Data Data Data

I developed an interactive dashboard that displays all the graphs and pertinent information from my analysis. You can view and interact with the complete dashboard by following this link.

You can find an image of the dashboard below

ACT

The final step of the project.

INSIGHT

Over 80% of Cyclistic riders are casual riders who typically ride for shorter durations than member riders. This may explain why they are less likely to subscribe. Additionally, casual riders do not have a strong preference for specific bike types like member riders do.

RECOMMENDATION

Cyclistic could encourage casual riders to become members by offering a discount upon registration once they have accumulated a certain number of minutes on their rides.
Since both casual and member riders tend to ride more during the summer, Cyclistic could offer an annual sale to incentivize casual riders to become members during this time.
Placing advertisements at the top 10 stations where casual members begin their rides could be an effective way to increase visibility and attract more riders.

Google Data Analytics Capstone Project

Sophia Ateb

2023-06-12

Introduction

About the Company

Scenario

ASK

PREPARE

PROCESS

ANALYZE

R

TABLEAU

ACT

Google Data Analytics Capstone Project

Sophia Ateb

2023-06-12

Cyclistic Bike Share Analysis

Introduction

About the Company

Scenario

ASK

PREPARE

PROCESS

ANALYZE

SHARE

R

TABLEAU

ACT