Introduction

This project is part of the Google Data Analytics Professional Certificate, taught through Coursera. As my capstone project, I was tasked with completing the following analysis.

Chicago-based bike share company Cyclistic is designing a marketing campaign to convert casual riders (those who purchase single-ride or all-day passes) into annual members. They are requesting data analysis to determine the best strategies for an effective campaign.

The main questions to be answered are:
– How do annual members and casual riders use Cyclistic bikes differently?
– Why would casual riders buy Cyclistic annual memberships?
– How can Cyclistic use digital media to influence casual riders to become members?

To answer these questions, I will use a combination of spreadsheets, R, and Tableau to analyize and visualize the various ways in which members and casual riders differ, and offer a few ideas as to how Cyclistic can proceed with their marketing campaign.

Description of Data

To begin the analysis, I downloaded from Cyclistic’s secure database the last twelve months of accrued trip data (stored as zipped csv files). I then unzipped each file, making a new folder with copies of the originals to keep the original data separate and secure. I then opened each file in Excel to get a sense of the datasets. I immediately noticed that I was dealing with what was collectively a very large dataset with millions of rows.

I noted several general observations of the data:
– Winter months had significantly fewer rides logged, for fairly obvious reasons (I have been to Chicago in the winter, and I wouldn’t want to ride a bike, either!).
– The column ride_id seems to be a randomly generated alphanumeric code and is likely not useful for the present analysis.
– The columns started_at and ended_at are in mdy 00:00 format but some of the months have 00:00:00 data.
– The Station ID and Station Name columns contain potentially inconsistent data, as many rows are missing said data or contain discrepancies.
– The lack of uniformity in the station data might not be problem, as the geographic coordinates seem to be very much intact and can potentially be used in Tableau.
– Some simple calculations can be done in Excel before moving on to R.

Initial Cleaning in Excel

First order of business was to prepare the data for analysis by cleaning up the spreadsheets and adding a few data points. First, I formatted the latitude and longitude columns to 0.000 to aid in the most consistent accuracy of location data. Then, I formatted the datetime columns to include to-the-second data (mdy hh:mm:ss). Interestingly enough, some of the spreadsheets that were missing :ss data suddenly included missing (or perhaps hidden) :ss data upon formatting. Next, I added a new column, ride_length, calculating the length of each ride in seconds. Finally, I added another column, calculating the day of the week for each ride.

R

Moving to RStudio, after installing and librarying tidyverse and lubridate, I set up my working directory.

setwd("C:/Users/dylan/OneDrive/DA Certificate/Portfolio - Case Studies/2021_06_2022_05_csv/2021_06_2022_05_tripdata_csv/cleaned")

I then loaded the csv files for each month:

m6_2021 <- read_csv("2021_06_divvy_tripdata.csv")
m7_2021 <- read_csv("2021_07_divvy_tripdata.csv")
m8_2021 <- read_csv("2021_08_divvy_tripdata.csv")
m9_2021 <- read_csv("2021_09_divvy_tripdata.csv")
m10_2021 <- read_csv("2021_10_divvy_tripdata.csv")
m11_2021 <- read_csv("2021_11_divvy_tripdata.csv")
m12_2021 <- read_csv("2021_12_divvy_tripdata.csv")
m1_2022 <- read_csv("2022_01_divvy_tripdata.csv")
m2_2022 <- read_csv("2022_02_divvy_tripdata.csv")
m3_2022 <- read_csv("2022_03_divvy_tripdata.csv")
m4_2022 <- read_csv("2022_04_divvy_tripdata.csv")
m5_2022 <- read_csv("2022_05_divvy_tripdata.csv")

Checking out the data with functions like View(), glimpse(), and str(), it looks like everything loaded correctly.

Beginning to clean the data, I dropped ride_id, start_station_name, start_station_id, end_station_name, and end_station_id, as they contain many null and inconsistent data. For the eventual purposes of mapping, the latitude and longitude coordinates can suffice:

m6_2021 <- m6_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m7_2021 <- m7_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m8_2021 <- m8_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m9_2021 <- m9_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m10_2021 <- m10_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m11_2021 <- m11_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m12_2021 <- m12_2021 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m1_2022 <- m1_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m2_2022 <- m2_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m3_2022 <- m3_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m4_2022 <- m4_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))
m5_2022 <- m5_2022 %>% select(-c(ride_id, start_station_name, start_station_id, end_station_name, end_station_id))

I then combined all the dataframes into one:

all_trips <- bind_rows(m6_2021, m7_2021, m8_2021, m9_2021, m10_2021, m11_2021, m12_2021, m1_2022, m2_2022, m3_2022, m4_2022, m5_2022)

Using functions like glimpse(), View(), and summary(), the all_trips dataframe looks good, but still needs a bit of cleaning. Using colSums(), I found that a number of the columns have null data:

colSums(is.na(all_trips))

I then dropped the rows with missing data, as they only accounted for a fraction of a percent of the rides logged:

all_trips_cleaned <- all_trips[complete.cases(all_trips), ]

Now that the dataframe is clean, I ran a few calculations to get some descriptive data to use for analysis and the business tasks. First, I changed ride_length to a numeric so I could run calculations:

all_trips_cleaned$ride_length <- as.numeric(as.character(all_trips_cleaned$ride_length))

is.numeric(all_trips_cleaned$ride_length)

Since I typically think of the length of a bike ride in terms of minutes rather than seconds, many of the following calculations are divided by 60:

## summarizing ride length calculations

summary((all_trips_cleaned$ride_length)/60)

## member v. casual - number of rides

all_trips_cleaned %>%
  group_by(member_casual) %>%
  summarise(number_of_rides = n())

## rideable_type - number of rides

all_trips_cleaned %>%
  group_by(rideable_type) %>%
  summarise(number_of_rides = n())

## members v. casual - ride length

all_trips_cleaned %>%
  group_by(member_casual) %>%
  summarise(average_duration = (mean(ride_length)/60))

## members v. casual - rideable type

all_trips_cleaned %>%
  group_by(member_casual, rideable_type) %>%
  summarise(number_of_rides = n())

These calculations returned the following:
– Median ride length is 11.3 minutes, mean ride length is 18.36 min.
– Maximum ride length is a hair short of exactly 24 hours, which is perhaps an anomaly.
– There were approximately 30% more member rides than casual rides (3,299,876 and 2,555,728, respectively).
– There were approximately 35% more classic bike rides than electric bike rides, and docked bikes represented an insignificant share of the total number of rides.
– Casual riders ride for approximately twice as long as members, on average.
– Casual riders use classic bikes and electric bikes at an almost equal rate.
– Members tend to use classic bikes more than electric.

I then ran a few more comparative operations to explore further:

## members v. casual - ride length and day of week

all_trips_cleaned$day_of_week <- ordered(all_trips_cleaned$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

rldow <- as_tibble(aggregate(all_trips_cleaned$ride_length/60 ~ all_trips_cleaned$member_casual + all_trips_cleaned$day_of_week, FUN = mean))
Member/Casual Day of Week Average Ride Length(min)
Casual Sunday 29.3
Member Sunday 14.4
Casual Monday 26.0
Member Monday 12.4
Casual Tuesday 22.9
Member Tuesday 12.0
Casual Wednesday 22.4
Member Wednesday 12.1
Casual Thursday 22.7
Member Thursday 12.2
Casual Friday 24.0
Member Friday 12.5
Casual Saturday 28.0
Member Saturday 14.3
## members v. casual - day of week and number of rides

all_trips_cleaned %>%
  group_by(day_of_week, member_casual) %>%
  summarise(number_of_rides = n(), average_duration = (mean(ride_length))/60)
Member/Casual Day of Week Number of Rides
Casual Sunday 469307
Member Sunday 394535
Casual Monday 301587
Member Monday 465994
Casual Tuesday 286618
Member Tuesday 524661
Casual Wednesday 285396
Member Wednesday 512491
Casual Thursday 308131
Member Thursday 501664
Casual Friday 359455
Member Friday 459649
Casual Saturday 545234
Member Saturday 440882

Based on these findings, I noted a few observations:
– casual riders are more apt to ride and to ride for longer on the weekends
– member riders typically ride for significantly less time than casual riders
– members account for higher ridership in general
– members are more consistently reliable riders, while casual rider numbers vary widely
– casual riders show significantly reduced ridership during the week

Visualizations

Moving on to some visualizations using ggplot(), I compared members and casual riders in various ways:

all_trips_cleaned %>%
  group_by(member_casual, day_of_week) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
  labs(x = "Day of Week", y = "Number of Rides", fill = "Member/Casual", title = "Member v. Casual - Average Number of Rides by Day of Week")

all_trips_cleaned %>%
  group_by(member_casual, day_of_week) %>%
  summarise(average_duration = mean(ride_length)/60) %>%
  ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "Day of Week", y = "Average Duration (min)", fill = "Member/Casual", title = "Member v. Casual - Average Duration of Ride by Day of Week")

all_trips_cleaned %>%
  group_by(member_casual, rideable_type) %>%
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +
  labs(x = "Type of Bicycle", y = "Number of Rides", fill = "Member/Casual", title = "Member v. Casual - Bicycle Types by Rider Type")

Tableau

For an added layer of analysis, I wanted to see trips over 30 minutes to determine the areas of Chicago from which the longest trips originated. My thinking was that the longer trips generate more revenue and therefore deserve special attention. I made and saved a dataframe called long_trips using the following code:

long_trips <- all_trips_cleaned %>% filter(ride_length > 1800)

I then moved to Tableau to create this visualization:

Findings, Observations, and Recommendations

Synthesizing all of the data and data visualizations above, I would report to my stakeholders the following:
– Members and casual riders differ primarily in regards to weekday v. weekend ridership and average length of ride. Members, who most likely use the service to commute to and from work, use more bikes during the week. Casual riders beat out members on ridership on the weekends.
– Casual riders, on every day of the week, ride for a longer time on average, and most likely spend more money doing so.
– While members take more rides, their geographic distribution is more centrally located around the center of the city. Casual riders are more spread throughout the metropolitan area.
– Casual members would become members for financial and convenience incentives. An introductory ‘Weekday Member’ rate could act as an enticing incentive, or a pricing model similar to Amazon Prime (a yearly membership fee that comes with perks, like differing weekday/weekend rates to boost weekday ridership).
– Casual riders could further be incentivized by a discounted all-day rate, since they typically ride for much longer times.
– Using digital media, Cyclistic can use targeted advertisements around the city center to draw in short-ride commuters, either through social media, or through physical advertisements strategically placed around high-traffic areas.
– Cyclistic could further reach potential members by making an in-app-only rate that requires signing up as a member, or by making the standard member rate available to casual riders during commuting hours, noting so in their digital media and marketing campaigns.

And lastly – wear a helmet and obey traffic laws!