Case Study: How do annual members and casual riders use Cyclistic bikes differently?

Business Task:

Obtain, examine, and clean the available data to generate actionable insights into how members of the (fictional) bike share company Cyclistic differ from casual riders.

Available Data:

Reliable, original data was obtained from the City of Chicago’s DivvyBike records hosted by the company’s Amazon AWS site.
* At the time of this case study, data was available from 2013 to August 2022.
* The data has become more sophisticated over time. For example, in records from 2013, rides were simply numbered, but from 2020 on, each ride was given a unique identifier and consistent data points were captured relating to rider type, start time and location, and end time and location.
* To ensure currency, a rolling 12 months of data was selected from Sept 2021 to Aug 2022 in the form of 12 CSV files.
* The data is comprehensive and cited. However, data on riders themselves is not included. This study is therefore unable to compare repeated trips by individuals or take demographic differences into account.

Processing the Data:

I examined the 12 CSVs in Excel to confirm that the data attributes were consistently ordered, named, and categorized as the correct data types. I then wanted to extract a proxy measure of ride distance by comparing the start and end points of each ride.
* Note: this proxy would not equal ride distance in all cases. A user may have taken an indirect route between stations, or may have ridden around the city and then docked back at the original station. However, I considered it valuable in the absence of other data on ride distance.

Although Excel was not the ideal tool for the size of the data (each CSV contained hundreds of thousands of records), it had a formula for calculating the difference between two points with longitude and latitude expressed in separate columns:
* =ACOS(COS(RADIANS(90-[start_lat]))*COS(RADIANS(90-[end_lat]))+SIN(RADIANS(90-[start_lat]))*SIN(RADIANS(90-[end_lat]))*COS(RADIANS([start_long]- [end_long])))*3959
* I used this to create a calculated column within each of my 12 CSVs labeled ‘distance_start_to_end’.

Transforming and Preparing the Data with SQL:

Initial Transformation

Once the above step was complete, I transitioned to BigQuery as a better tool to handle the large dataset.
The full data cleaning and transformation process in SQL has been documented separately and is available upon request. In brief:
* I joined the 12 CSVs into a single table with the UNION ALL operator.
* I reviewed the schema for the new table and found that my ‘distance_start_to_end’ variable had been incorrectly cast as a string; I transformed this with SAFE_CAST to a new column ‘ride_distance’.
* I extracted the Month, Date, and Day of Week from the start times, and calculated the time between the start and end of each ride as ‘ride_time’.
* I noticed a possible issue with the bike types “classic_bike” versus “docked_bike”, and reached out to the company who confirmed that these are physically and functionally the same; I used a SET WHERE command to replace the lesser-used term “docked_bike” with “classic_bike”.
* I extracted the hours and minutes from my ‘ride_time’ column, multiplied hours by 60, and added the minutes to create a combined column showing the ride time in minutes.

Removing Outliers

The company’s documentation advised that there were certain types of records that most likely were not true rides.
* It was explained that entries with rides less than 1 minute or more than 24 hours were probably riders who had forgotten or were attempting to redock bikes. I removed these with a SELECT * WHERE AND statement.
* I also removed rides with end times earlier than start times.
* This removed 71,019 records, about 1.21% of the dataset.

Final Tidying

My last step was to make my data more intuitive and remove duplicates.
* I used SELECT CASE queries to convert my month and day of week data from integers to strings.
* I extracted the hours from the ride start times to enable me to look for peaks and lulls.
* I noticed that there was a slightly-higher count of ride IDs than of unique ride IDs, indicating some mistaken or duplicate entries. I used a ROW_NUMBER () OVER (PARTITION BY) query to select only the first of each ride IDs. This dropped 521 values, or .012% of the dataset.
* Finally, I used CREATE OR REPLACE TABLE to generate a table that sorted rides in ascending chronological order.

The result was a clean, consistent, chronological table with usable data on 5,811,503 rides. Each record contained information on the month, day of week, hour of day (for ride start), time, distance, and rider type (member or casual).

Analyzing the Data:

Still in BigQuery, I calculated the average and median ride times in minutes and distances in miles for members versus casual riders. I found that casual riders took longer rides by all measures.
* For medians, I used a PERCENTILE_CONT(variable, 0.5) OVER PARTITION BY query.

Next, I looked at the distributions of casual riders versus members across days of the week, months, and time of day (as measured by the hour of the start time). I ran a series of COUNT queries, grouping by the rider type.
* Note: I felt examining the distribution of rides over months would provide insights into how seasonality affects each rider type. However, it should be noted that because I was looking at only one year of data, other factors could have also influenced the results. For example, many workplaces instituted ‘return to work’ policies throughout 2022, which could drive significantly increased ridership among commuters in August 2022 compared to September 2021.

I combined the average ride time and distance across the distributions of rides into 3 summary tables, which I exported to Excel.

Analyzing and Sharing with RStudio:

For my final analysis, I imported the summary tables into RStudio to plot the number of rides taken by casual versus member riders across days of the week, time of the day, and months of the year.
* RStudio reordered my X labels alphabetically instead of chronologically. To get around this, I adjusted the tables in Excel (using numeric values for days of the week and months), reimported them into RStudio, and converted my x labels to a factor so I could use the scale_x_discrete(labels=c()) function to add my own axis labels.
* I also used the scale_y_continuous(labels=scale::comma) command to have my y-axis labels as standard numbers rather than scientific notation.
* Note: One graph with full code is shown below as a sample. The remaining graphs appear in the Conclusion with code omitted for brevity.

  library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
  library(readxl)
  
  AvTD_by_M <- read_excel("C:/Users/Master/Desktop/Summary_Tables/Avg-time-dist_by-month_14-Jan-23.xlsx")

  AvTD_by_M %>% 
  ggplot(AvTD_by_M, mapping=aes(x=factor(month_num), y=number_of_rides,fill=month_num))+
  geom_bar(stat="identity", show.legend = FALSE) +
  facet_wrap(~member_casual) +
  theme(axis.text.x = element_text(size = 7), axis.title.x = element_blank(), axis.text.y = element_text(size=7)) +
  labs(title = "Number of Rides by Month", caption = "Rides from Sep 1, 2021 to Aug 31, 2022",y="") +
  scale_x_discrete(labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) +
  scale_y_continuous(labels=scales::comma)

I then compared average ride time in minutes and average ride distance in miles for casual and member riders, again across days, time, and months.
* As above, one example of the code and resulting graph is shown.

ylim.prim <- c(0, 30)   # set bounds of first Y axis for ride time in minutes
ylim.sec <- c(0, 5)    # set bounds of second Y axis for ride distance in miles

b <- diff(ylim.prim)/diff(ylim.sec)
a <- ylim.prim[1] - b*ylim.sec[1] # create two separate y axes

# plot time and distance on the same graph:
AvTD_by_Hr <- read_excel("C:/Users/Master/Desktop/Summary_Tables/Avg-time-dist_by-hour_14-Jan-23.xlsx")


ggplot(data = AvTD_by_Hr) +
  geom_bar(mapping = aes(x = factor(ride_hour), y = average_ride_time, fill = average_ride_time), stat = "identity") +
  geom_line(mapping = aes(x = ride_hour, y = a + average_ride_distance*b, color = 'darkred'), size = 2) +
  facet_wrap(~member_casual) +
  theme(axis.text.x = element_text(size = 7), axis.title.x = element_blank(), axis.text.y = element_text(size=5)) +
  scale_x_discrete(labels=c("12A","","","3A","","","6A","","","9A","","","12P","","","3P","","","6P","","","9P","","")) +
  scale_y_continuous("Avg. Ride Time", sec.axis = sec_axis(~ (. - a)/b, name = "Avg. Ride Distance")) +
  labs(title = "Average Ride Time & Distance by Hour of Day", color = "Distance (miles)", fill = "Time (minutes)") +
  scale_color_manual(values = 'darkred', labels = c(''))

Findings

The graphs show some similarities in the two groups of riders:
* Both groups use bikes more in summer months than in winter months - unsurprising, given the Chicago climate.
* Both groups use bikes more in the second half of the day during daylight hours, with a peak around 4-5PM.

However, the graphs highlight key differences as well.

Members: Shorter, Consistent Commutes

Although members, like casual riders, took more trips on certain months, days, or times, they showed shorter and less variable average ride times and distances across all three factors than did casual riders. Other notable differences include:
* Highest number of trips taken on weekdays, rather than weekends
* Peaks coinciding with the morning rush hour of 7-8AM and evening rush hour of 4-6PM
* Very little variability in ride time or distance, and especially steady ride lengths for weekdays (Monday through Friday)

This suggests that many members are using the bikes for commuting purposes, leading to short, predictable rides and higher weekday usage.

Casual Riders: Longer, More ‘Opportunistic’ Rides

Conversely, casual riders showed longer rides and greater variability on all measures: number of rides and average ride time or distance by day, month, or hour. Usage also suggested that rides may correspond with leisure activities:
* Highest number of rides and longer rides taken on the weekends
* Markedly higher ridership during Chicago’s temperate season from May to October; while members also showed higher ridership during this time, the variation is much more pronounced among casual riders, whose rides dropped dramatically in number and significantly in length between November and February
* A preference for afternoons and evenings, with ridership climbing throughout the day and tapering to its lowest around 3-6AM

Interestingly, while members typically take more rides, casual riders overtake members in the number of rides on Saturdays and Sundays and between the hours of 10PM-2AM. This may suggest that some casual riders use the bikes late into the night to take advantage of Chicago’s vibrant dining and entertainment scene, and would be worth exploring further.

Limitations

As mentioned earlier, while data is available for individual rides, none is available for individual riders. Visibility into how many trips a rider takes or how riders differ by demographics may lend valuable insights for marketing to or converting casual riders into members.

Additionally, the average ride distance measure was strongly correlated with average ride time when looking at days of the week, but less strongly correlated when looking across months or hours of the day. It’s possible that some adept riders travel a long distance in a short period of time, but also that the proxy ride distance metric was influenced by another factor such as the company physically redistributing bikes. Given this uncertainty, average ride time is the better measure of ride length for purposes of determining user habits.

Author’s Note: This case study was completed as part of the Google Data Analytics Certificate. I have chosen to share my findings as an RMarkdown document, rather than a business-friendly output such as a PowerPoint or Tableau dashboard, to test and expand my skills in R programming. Thank you for reading!