Cyclistic Bikeshare Case Study

Introduction

This project is part of the Google Data Analytics Professional Certificate, taught through Coursera. As my capstone project, I was tasked with completing the following analysis.

Chicago-based bike share company Cyclistic is designing a marketing campaign to convert casual riders (those who purchase single-ride or all-day passes) into annual members. They are requesting data analysis to determine the best strategies for an effective campaign.

The main questions to be answered are:
– How do annual members and casual riders use Cyclistic bikes differently?
– Why would casual riders buy Cyclistic annual memberships?
– How can Cyclistic use digital media to influence casual riders to become members?

To answer these questions, I will use a combination of spreadsheets, R, and Tableau to analyize and visualize the various ways in which members and casual riders differ, and offer a few ideas as to how Cyclistic can proceed with their marketing campaign.

Description of Data

To begin the analysis, I downloaded from Cyclistic’s secure database the last twelve months of accrued trip data (stored as zipped csv files). I then unzipped each file, making a new folder with copies of the originals to keep the original data separate and secure. I then opened each file in Excel to get a sense of the datasets. I immediately noticed that I was dealing with what was collectively a very large dataset with millions of rows.

I noted several general observations of the data:
– Winter months had significantly fewer rides logged, for fairly obvious reasons (I have been to Chicago in the winter, and I wouldn’t want to ride a bike, either!).
– The column ride_id seems to be a randomly generated alphanumeric code and is likely not useful for the present analysis.
– The columns started_at and ended_at are in mdy 00:00 format but some of the months have 00:00:00 data.
– The Station ID and Station Name columns contain potentially inconsistent data, as many rows are missing said data or contain discrepancies.
– The lack of uniformity in the station data might not be problem, as the geographic coordinates seem to be very much intact and can potentially be used in Tableau.
– Some simple calculations can be done in Excel before moving on to R.

Initial Cleaning in Excel

First order of business was to prepare the data for analysis by cleaning up the spreadsheets and adding a few data points. First, I formatted the latitude and longitude columns to 0.000 to aid in the most consistent accuracy of location data. Then, I formatted the datetime columns to include to-the-second data (mdy hh:mm:ss). Interestingly enough, some of the spreadsheets that were missing :ss data suddenly included missing (or perhaps hidden) :ss data upon formatting. Next, I added a new column, ride_length, calculating the length of each ride in seconds. Finally, I added another column, calculating the day of the week for each ride.

R

Moving to RStudio, after installing and librarying tidyverse and lubridate, I set up my working directory.

setwd("C:/Users/dylan/OneDrive/DA Certificate/Portfolio - Case Studies/2021_06_2022_05_csv/2021_06_2022_05_tripdata_csv/cleaned")

I then loaded the csv files for each month:

m6_2021 <- read_csv("2021_06_divvy_tripdata.csv")
m7_2021 <- read_csv("2021_07_divvy_tripdata.csv")
m8_2021 <- read_csv("2021_08_divvy_tripdata.csv")
m9_2021 <- read_csv("2021_09_divvy_tripdata.csv")
m10_2021 <- read_csv("2021_10_divvy_tripdata.csv")
m11_2021 <- read_csv("2021_11_divvy_tripdata.csv")
m12_2021 <- read_csv("2021_12_divvy_tripdata.csv")
m1_2022 <- read_csv("2022_01_divvy_tripdata.csv")
m2_2022 <- read_csv("2022_02_divvy_tripdata.csv")
m3_2022 <- read_csv("2022_03_divvy_tripdata.csv")
m4_2022 <- read_csv("2022_04_divvy_tripdata.csv")
m5_2022 <- read_csv("2022_05_divvy_tripdata.csv")

Checking out the data with functions like View(), glimpse(), and str(), it looks like everything loaded correctly.

Beginning to clean the data, I dropped ride_id, start_station_name, start_station_id, end_station_name, and end_station_id, as they contain many null and inconsistent data. For the eventual purposes of mapping, the latitude and longitude coordinates can suffice:

m6_2021 <- m6_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m7_2021 <- m7_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m8_2021 <- m8_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m9_2021 <- m9_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m10_2021 <- m10_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m11_2021 <- m11_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m12_2021 <- m12_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m1_2022 <- m1_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m2_2022 <- m2_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m3_2022 <- m3_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m4_2022 <- m4_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m5_2022 <- m5_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))

I then combined all the dataframes into one:

all_trips <- bind_rows(m6_2021, m7_2021, m8_2021, m9_2021, m10_2021, m11_2021, m12_2021, m1_2022, m2_2022, m3_2022, m4_2022, m5_2022)

Using functions like glimpse(), View(), and summary(), the all_trips dataframe looks good, but still needs a bit of cleaning. Using colSums(), I found that a number of the columns have null data:

colSums(is.na(all_trips))

I then dropped the rows with missing data, as they only accounted for a fraction of a percent of the rides logged:

all_trips_cleaned <- all_trips[complete.cases(all_trips), ]

Now that the dataframe is clean, I ran a few calculations to get some descriptive data to use for analysis and the business tasks. First, I changed ride_length to a numeric so I could run calculations:

all_trips_cleaned$ride_length <- as.numeric(as.character(all_trips_cleaned$ride_length))

is.numeric(all_trips_cleaned$ride_length)

Since I typically think of the length of a bike ride in terms of minutes rather than seconds, many of the following calculations are divided by 60:

## summarizing ride length calculations

summary((all_trips_cleaned$ride_length)/60)

## member v. casual - number of rides

all_trips_cleaned %>%
  group_by(member_casual) %>%
  summarise(number_of_rides = n())

## rideable_type - number of rides

all_trips_cleaned %>%
  group_by(rideable_type) %>%
  summarise(number_of_rides = n())

## members v. casual - ride length

all_trips_cleaned %>%
  group_by(member_casual) %>%
  summarise(average_duration = (mean(ride_length)/60))

## members v. casual - rideable type

all_trips_cleaned %>%
  group_by(member_casual, rideable_type) %>%
  summarise(number_of_rides = n())

These calculations returned the following:
– Median ride length is 11.3 minutes, mean ride length is 18.36 min.
– Maximum ride length is a hair short of exactly 24 hours, which is perhaps an anomaly.
– There were approximately 30% more member rides than casual rides (3,299,876 and 2,555,728, respectively).
– There were approximately 35% more classic bike rides than electric bike rides, and docked bikes represented an insignificant share of the total number of rides.
– Casual riders ride for approximately twice as long as members, on average.
– Casual riders use classic bikes and electric bikes at an almost equal rate.
– Members tend to use classic bikes more than electric.

I then ran a few more comparative operations to explore further:

## members v. casual - ride length and day of week

all_trips_cleaned$day_of_week <- ordered(all_trips_cleaned$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

rldow <- as_tibble(aggregate(all_trips_cleaned$ride_length/60 ~ all_trips_cleaned$member_casual + all_trips_cleaned$day_of_week, FUN = mean))

Member/Casual	Day of Week	Average Ride Length(min)
Casual	Sunday	29.3
Member	Sunday	14.4
Casual	Monday	26.0
Member	Monday	12.4
Casual	Tuesday	22.9
Member	Tuesday	12.0
Casual	Wednesday	22.4
Member	Wednesday	12.1
Casual	Thursday	22.7
Member	Thursday	12.2
Casual	Friday	24.0
Member	Friday	12.5
Casual	Saturday	28.0
Member	Saturday	14.3

## members v. casual - day of week and number of rides

all_trips_cleaned %>%
  group_by(day_of_week, member_casual) %>%
  summarise(number_of_rides = n(), average_duration = (mean(ride_length))/60)

Member/Casual	Day of Week	Number of Rides
Casual	Sunday	469307
Member	Sunday	394535
Casual	Monday	301587
Member	Monday	465994
Casual	Tuesday	286618
Member	Tuesday	524661
Casual	Wednesday	285396
Member	Wednesday	512491
Casual	Thursday	308131
Member	Thursday	501664
Casual	Friday	359455
Member	Friday	459649
Casual	Saturday	545234
Member	Saturday	440882

Based on these findings, I noted a few observations:
– casual riders are more apt to ride and to ride for longer on the weekends
– member riders typically ride for significantly less time than casual riders
– members account for higher ridership in general
– members are more consistently reliable riders, while casual rider numbers vary widely
– casual riders show significantly reduced ridership during the week

Visualizations

Moving on to some visualizations using ggplot(), I compared members and casual riders in various ways:

all_trips_cleaned %>%
  group_by(member_casual, day_of_week) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
  labs(x = "Day of Week", y = "Number of Rides", fill = "Member/Casual", title = "Member v. Casual - Average Number of Rides by Day of Week")

all_trips_cleaned %>%
  group_by(member_casual, day_of_week) %>%
  summarise(average_duration = mean(ride_length)/60) %>%
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Day of Week", y = "Average Duration (min)", fill = "Member/Casual", title = "Member v. Casual - Average Duration of Ride by Day of Week")

all_trips_cleaned %>%
  group_by(member_casual, rideable_type) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
  labs(x = "Type of Bicycle", y = "Number of Rides", fill = "Member/Casual", title = "Member v. Casual - Bicycle Types by Rider Type")

Tableau

For an added layer of analysis, I wanted to see trips over 30 minutes to determine the areas of Chicago from which the longest trips originated. My thinking was that the longer trips generate more revenue and therefore deserve special attention. I made and saved a dataframe called long_trips using the following code:

long_trips <- all_trips_cleaned %>% filter(ride_length > 1800)

I then moved to Tableau to create this visualization:

Findings, Observations, and Recommendations

Synthesizing all of the data and data visualizations above, I would report to my stakeholders the following:
– Members and casual riders differ primarily in regards to weekday v. weekend ridership and average length of ride. Members, who most likely use the service to commute to and from work, use more bikes during the week. Casual riders beat out members on ridership on the weekends.
– Casual riders, on every day of the week, ride for a longer time on average, and most likely spend more money doing so.
– While members take more rides, their geographic distribution is more centrally located around the center of the city. Casual riders are more spread throughout the metropolitan area.
– Casual members would become members for financial and convenience incentives. An introductory ‘Weekday Member’ rate could act as an enticing incentive, or a pricing model similar to Amazon Prime (a yearly membership fee that comes with perks, like differing weekday/weekend rates to boost weekday ridership).
– Casual riders could further be incentivized by a discounted all-day rate, since they typically ride for much longer times.
– Using digital media, Cyclistic can use targeted advertisements around the city center to draw in short-ride commuters, either through social media, or through physical advertisements strategically placed around high-traffic areas.
– Cyclistic could further reach potential members by making an in-app-only rate that requires signing up as a member, or by making the standard member rate available to casual riders during commuting hours, noting so in their digital media and marketing campaigns.