This project is part of the Google Data Analytics Professional Certificate, taught through Coursera. As my capstone project, I was tasked with completing the following analysis.
Chicago-based bike share company Cyclistic is designing a marketing campaign to convert casual riders (those who purchase single-ride or all-day passes) into annual members. They are requesting data analysis to determine the best strategies for an effective campaign.
The main questions to be answered are:
– How do annual members
and casual riders use Cyclistic bikes differently?
– Why would
casual riders buy Cyclistic annual memberships?
– How can Cyclistic
use digital media to influence casual riders to become members?
To answer these questions, I will use a combination of spreadsheets, R, and Tableau to analyize and visualize the various ways in which members and casual riders differ, and offer a few ideas as to how Cyclistic can proceed with their marketing campaign.
To begin the analysis, I downloaded from Cyclistic’s secure database the last twelve months of accrued trip data (stored as zipped csv files). I then unzipped each file, making a new folder with copies of the originals to keep the original data separate and secure. I then opened each file in Excel to get a sense of the datasets. I immediately noticed that I was dealing with what was collectively a very large dataset with millions of rows.
I noted several general observations of the data:First order of business was to prepare the data for analysis by cleaning up the spreadsheets and adding a few data points. First, I formatted the latitude and longitude columns to 0.000 to aid in the most consistent accuracy of location data. Then, I formatted the datetime columns to include to-the-second data (mdy hh:mm:ss). Interestingly enough, some of the spreadsheets that were missing :ss data suddenly included missing (or perhaps hidden) :ss data upon formatting. Next, I added a new column, ride_length, calculating the length of each ride in seconds. Finally, I added another column, calculating the day of the week for each ride.
Moving to RStudio, after installing and librarying tidyverse and lubridate, I set up my working directory.
setwd("C:/Users/dylan/OneDrive/DA Certificate/Portfolio - Case Studies/2021_06_2022_05_csv/2021_06_2022_05_tripdata_csv/cleaned")
I then loaded the csv files for each month:
m6_2021 <- read_csv("2021_06_divvy_tripdata.csv")
m7_2021 <- read_csv("2021_07_divvy_tripdata.csv")
m8_2021 <- read_csv("2021_08_divvy_tripdata.csv")
m9_2021 <- read_csv("2021_09_divvy_tripdata.csv")
m10_2021 <- read_csv("2021_10_divvy_tripdata.csv")
m11_2021 <- read_csv("2021_11_divvy_tripdata.csv")
m12_2021 <- read_csv("2021_12_divvy_tripdata.csv")
m1_2022 <- read_csv("2022_01_divvy_tripdata.csv")
m2_2022 <- read_csv("2022_02_divvy_tripdata.csv")
m3_2022 <- read_csv("2022_03_divvy_tripdata.csv")
m4_2022 <- read_csv("2022_04_divvy_tripdata.csv")
m5_2022 <- read_csv("2022_05_divvy_tripdata.csv")
Checking out the data with functions like View(), glimpse(), and str(), it looks like everything loaded correctly.
Beginning to clean the data, I dropped ride_id, start_station_name, start_station_id, end_station_name, and end_station_id, as they contain many null and inconsistent data. For the eventual purposes of mapping, the latitude and longitude coordinates can suffice:
m6_2021 <- m6_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m7_2021 <- m7_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m8_2021 <- m8_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m9_2021 <- m9_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m10_2021 <- m10_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m11_2021 <- m11_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m12_2021 <- m12_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m1_2022 <- m1_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m2_2022 <- m2_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m3_2022 <- m3_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m4_2022 <- m4_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m5_2022 <- m5_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
I then combined all the dataframes into one:
all_trips <- bind_rows(m6_2021, m7_2021, m8_2021, m9_2021, m10_2021, m11_2021, m12_2021, m1_2022, m2_2022, m3_2022, m4_2022, m5_2022)
Using functions like glimpse(), View(), and summary(), the all_trips dataframe looks good, but still needs a bit of cleaning. Using colSums(), I found that a number of the columns have null data:
colSums(is.na(all_trips))
I then dropped the rows with missing data, as they only accounted for a fraction of a percent of the rides logged:
all_trips_cleaned <- all_trips[complete.cases(all_trips), ]
Now that the dataframe is clean, I ran a few calculations to get some descriptive data to use for analysis and the business tasks. First, I changed ride_length to a numeric so I could run calculations:
all_trips_cleaned$ride_length <- as.numeric(as.character(all_trips_cleaned$ride_length))
is.numeric(all_trips_cleaned$ride_length)
Since I typically think of the length of a bike ride in terms of minutes rather than seconds, many of the following calculations are divided by 60:
## summarizing ride length calculations
summary((all_trips_cleaned$ride_length)/60)
## member v. casual - number of rides
all_trips_cleaned %>%
group_by(member_casual) %>%
summarise(number_of_rides = n())
## rideable_type - number of rides
all_trips_cleaned %>%
group_by(rideable_type) %>%
summarise(number_of_rides = n())
## members v. casual - ride length
all_trips_cleaned %>%
group_by(member_casual) %>%
summarise(average_duration = (mean(ride_length)/60))
## members v. casual - rideable type
all_trips_cleaned %>%
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides = n())
These calculations returned the following:
– Median ride length
is 11.3 minutes, mean ride length is 18.36 min.
– Maximum ride
length is a hair short of exactly 24 hours, which is perhaps an anomaly.
– There were approximately 30% more member rides than casual rides
(3,299,876 and 2,555,728, respectively).
– There were approximately
35% more classic bike rides than electric bike rides, and docked bikes
represented an insignificant share of the total number of rides.
–
Casual riders ride for approximately twice as long as members, on
average.
– Casual riders use classic bikes and electric bikes at an
almost equal rate.
– Members tend to use classic bikes more than
electric.
I then ran a few more comparative operations to explore further:
## members v. casual - ride length and day of week
all_trips_cleaned$day_of_week <- ordered(all_trips_cleaned$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
rldow <- as_tibble(aggregate(all_trips_cleaned$ride_length/60 ~ all_trips_cleaned$member_casual + all_trips_cleaned$day_of_week, FUN = mean))
| Member/Casual | Day of Week | Average Ride Length(min) |
|---|---|---|
| Casual | Sunday | 29.3 |
| Member | Sunday | 14.4 |
| Casual | Monday | 26.0 |
| Member | Monday | 12.4 |
| Casual | Tuesday | 22.9 |
| Member | Tuesday | 12.0 |
| Casual | Wednesday | 22.4 |
| Member | Wednesday | 12.1 |
| Casual | Thursday | 22.7 |
| Member | Thursday | 12.2 |
| Casual | Friday | 24.0 |
| Member | Friday | 12.5 |
| Casual | Saturday | 28.0 |
| Member | Saturday | 14.3 |
## members v. casual - day of week and number of rides
all_trips_cleaned %>%
group_by(day_of_week, member_casual) %>%
summarise(number_of_rides = n(), average_duration = (mean(ride_length))/60)
| Member/Casual | Day of Week | Number of Rides |
|---|---|---|
| Casual | Sunday | 469307 |
| Member | Sunday | 394535 |
| Casual | Monday | 301587 |
| Member | Monday | 465994 |
| Casual | Tuesday | 286618 |
| Member | Tuesday | 524661 |
| Casual | Wednesday | 285396 |
| Member | Wednesday | 512491 |
| Casual | Thursday | 308131 |
| Member | Thursday | 501664 |
| Casual | Friday | 359455 |
| Member | Friday | 459649 |
| Casual | Saturday | 545234 |
| Member | Saturday | 440882 |
Based on these findings, I noted a few observations:
– casual
riders are more apt to ride and to ride for longer on the weekends
–
member riders typically ride for significantly less time than casual
riders
– members account for higher ridership in general
–
members are more consistently reliable riders, while casual rider
numbers vary widely
– casual riders show significantly reduced
ridership during the week
Moving on to some visualizations using ggplot(), I compared members and casual riders in various ways:
all_trips_cleaned %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
labs(x = "Day of Week", y = "Number of Rides", fill = "Member/Casual", title = "Member v. Casual - Average Number of Rides by Day of Week")
all_trips_cleaned %>%
group_by(member_casual, day_of_week) %>%
summarise(average_duration = mean(ride_length)/60) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(x = "Day of Week", y = "Average Duration (min)", fill = "Member/Casual", title = "Member v. Casual - Average Duration of Ride by Day of Week")
all_trips_cleaned %>%
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides = n()) %>%
ggplot(aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
labs(x = "Type of Bicycle", y = "Number of Rides", fill = "Member/Casual", title = "Member v. Casual - Bicycle Types by Rider Type")
For an added layer of analysis, I wanted to see trips over 30 minutes to determine the areas of Chicago from which the longest trips originated. My thinking was that the longer trips generate more revenue and therefore deserve special attention. I made and saved a dataframe called long_trips using the following code:
long_trips <- all_trips_cleaned %>% filter(ride_length > 1800)
I then moved to Tableau to create this visualization:
Synthesizing all of the data and data visualizations above, I would
report to my stakeholders the following:
– Members and casual riders
differ primarily in regards to weekday v. weekend ridership and average
length of ride. Members, who most likely use the service to commute to
and from work, use more bikes during the week. Casual riders beat out
members on ridership on the weekends.
– Casual riders, on every day
of the week, ride for a longer time on average, and most likely spend
more money doing so.
– While members take more rides, their
geographic distribution is more centrally located around the center of
the city. Casual riders are more spread throughout the metropolitan
area.
– Casual members would become members for financial and
convenience incentives. An introductory ‘Weekday Member’ rate could act
as an enticing incentive, or a pricing model similar to Amazon Prime (a
yearly membership fee that comes with perks, like differing
weekday/weekend rates to boost weekday ridership).
– Casual riders
could further be incentivized by a discounted all-day rate, since they
typically ride for much longer times.
– Using digital media,
Cyclistic can use targeted advertisements around the city center to draw
in short-ride commuters, either through social media, or through
physical advertisements strategically placed around high-traffic areas.
– Cyclistic could further reach potential members by making an
in-app-only rate that requires signing up as a member, or by making the
standard member rate available to casual riders during commuting hours,
noting so in their digital media and marketing campaigns.