Obtain, examine, and clean the available data to generate actionable insights into how members of the (fictional) bike share company Cyclistic differ from casual riders.
Reliable, original data was obtained from the City of Chicago’s
DivvyBike records hosted by the company’s Amazon AWS site.
* At the time of this case study, data was available from 2013 to August
2022.
* The data has become more sophisticated over time. For example, in
records from 2013, rides were simply numbered, but from 2020 on, each
ride was given a unique identifier and consistent data points were
captured relating to rider type, start time and location, and end time
and location.
* To ensure currency, a rolling 12 months of data was selected from Sept
2021 to Aug 2022 in the form of 12 CSV files.
* The data is comprehensive and cited. However, data on riders
themselves is not included. This study is therefore unable to compare
repeated trips by individuals or take demographic differences into
account.
I examined the 12 CSVs in Excel to confirm that the data attributes
were consistently ordered, named, and categorized as the correct data
types. I then wanted to extract a proxy measure of ride distance by
comparing the start and end points of each ride.
* Note: this proxy would not equal ride distance in all cases. A
user may have taken an indirect route between stations, or may have
ridden around the city and then docked back at the original station.
However, I considered it valuable in the absence of other data on ride
distance.
Although Excel was not the ideal tool for the size of the data (each
CSV contained hundreds of thousands of records), it had a formula for
calculating the difference between two points with longitude and
latitude expressed in separate columns:
* =ACOS(COS(RADIANS(90-[start_lat]))*COS(RADIANS(90-[end_lat]))+SIN(RADIANS(90-[start_lat]))*SIN(RADIANS(90-[end_lat]))*COS(RADIANS([start_long]-
[end_long])))*3959
* I used this to create a calculated column within each of my 12 CSVs
labeled ‘distance_start_to_end’.
Once the above step was complete, I transitioned to BigQuery as a
better tool to handle the large dataset.
The full data cleaning and transformation process in SQL has been
documented separately and is available upon request. In brief:
* I joined the 12 CSVs into a single table with the UNION ALL operator.
* I reviewed the schema for the new table and found that my ‘distance_start_to_end’ variable had been
incorrectly cast as a string; I transformed this with SAFE_CAST to a new column ‘ride_distance’.
* I extracted the Month, Date, and Day of Week from the start times, and
calculated the time between the start and end of each ride as ‘ride_time’.
* I noticed a possible issue with the bike types “classic_bike”
versus “docked_bike”, and reached out to the company who
confirmed that these are physically and functionally the same; I used a
SET WHERE command to replace the
lesser-used term “docked_bike” with
“classic_bike”.
* I extracted the hours and minutes from my ‘ride_time’ column, multiplied hours by 60, and
added the minutes to create a combined column showing the ride time in
minutes.
The company’s documentation advised that there were certain types of
records that most likely were not true rides.
* It was explained that entries with rides less than 1 minute or more
than 24 hours were probably riders who had forgotten or were attempting
to redock bikes. I removed these with a SELECT
* WHERE AND statement.
* I also removed rides with end times earlier than start times.
* This removed 71,019 records, about 1.21% of the dataset.
My last step was to make my data more intuitive and remove
duplicates.
* I used SELECT CASE queries to convert
my month and day of week data from integers to strings.
* I extracted the hours from the ride start times to enable me to look
for peaks and lulls.
* I noticed that there was a slightly-higher count of ride IDs than of
unique ride IDs, indicating some mistaken or duplicate entries. I used a
ROW_NUMBER () OVER (PARTITION BY) query
to select only the first of each ride IDs. This dropped 521 values, or
.012% of the dataset.
* Finally, I used CREATE OR REPLACE
TABLE to generate a table that sorted rides in ascending
chronological order.
The result was a clean, consistent, chronological table with usable data on 5,811,503 rides. Each record contained information on the month, day of week, hour of day (for ride start), time, distance, and rider type (member or casual).
Still in BigQuery, I calculated the average and median ride times in
minutes and distances in miles for members versus casual riders. I found
that casual riders took longer rides by all measures.
* For medians, I used a PERCENTILE_CONT(variable, 0.5) OVER PARTITION
BY query.
Next, I looked at the distributions of casual riders versus members
across days of the week, months, and time of day (as measured by the
hour of the start time). I ran a series of COUNT queries, grouping by the rider
type.
* Note: I felt examining the distribution of rides over months would
provide insights into how seasonality affects each rider type. However,
it should be noted that because I was looking at only one year of data,
other factors could have also influenced the results. For example, many
workplaces instituted ‘return to work’ policies throughout 2022, which
could drive significantly increased ridership among commuters in August
2022 compared to September 2021.
I combined the average ride time and distance across the distributions of rides into 3 summary tables, which I exported to Excel.
For my final analysis, I imported the summary tables into RStudio to
plot the number of rides taken by casual versus member riders across
days of the week, time of the day, and months of the year.
* RStudio reordered my X labels alphabetically instead of
chronologically. To get around this, I adjusted the tables in Excel
(using numeric values for days of the week and months), reimported them
into RStudio, and converted my x labels to a factor so I could use the
scale_x_discrete(labels=c()) function to
add my own axis labels.
* I also used the scale_y_continuous(labels=scale::comma)
command to have my y-axis labels as standard numbers rather than
scientific notation.
* Note: One graph with full code is shown below as a sample. The
remaining graphs appear in the Conclusion with code omitted for
brevity.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readxl)
AvTD_by_M <- read_excel("C:/Users/Master/Desktop/Summary_Tables/Avg-time-dist_by-month_14-Jan-23.xlsx")
AvTD_by_M %>%
ggplot(AvTD_by_M, mapping=aes(x=factor(month_num), y=number_of_rides,fill=month_num))+
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~member_casual) +
theme(axis.text.x = element_text(size = 7), axis.title.x = element_blank(), axis.text.y = element_text(size=7)) +
labs(title = "Number of Rides by Month", caption = "Rides from Sep 1, 2021 to Aug 31, 2022",y="") +
scale_x_discrete(labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) +
scale_y_continuous(labels=scales::comma)
I then compared average ride time in minutes and average ride
distance in miles for casual and member riders, again across days, time,
and months.
* As above, one example of the code and resulting graph is
shown.
ylim.prim <- c(0, 30) # set bounds of first Y axis for ride time in minutes
ylim.sec <- c(0, 5) # set bounds of second Y axis for ride distance in miles
b <- diff(ylim.prim)/diff(ylim.sec)
a <- ylim.prim[1] - b*ylim.sec[1] # create two separate y axes
# plot time and distance on the same graph:
AvTD_by_Hr <- read_excel("C:/Users/Master/Desktop/Summary_Tables/Avg-time-dist_by-hour_14-Jan-23.xlsx")
ggplot(data = AvTD_by_Hr) +
geom_bar(mapping = aes(x = factor(ride_hour), y = average_ride_time, fill = average_ride_time), stat = "identity") +
geom_line(mapping = aes(x = ride_hour, y = a + average_ride_distance*b, color = 'darkred'), size = 2) +
facet_wrap(~member_casual) +
theme(axis.text.x = element_text(size = 7), axis.title.x = element_blank(), axis.text.y = element_text(size=5)) +
scale_x_discrete(labels=c("12A","","","3A","","","6A","","","9A","","","12P","","","3P","","","6P","","","9P","","")) +
scale_y_continuous("Avg. Ride Time", sec.axis = sec_axis(~ (. - a)/b, name = "Avg. Ride Distance")) +
labs(title = "Average Ride Time & Distance by Hour of Day", color = "Distance (miles)", fill = "Time (minutes)") +
scale_color_manual(values = 'darkred', labels = c(''))
The graphs show some similarities in the two groups of riders:
* Both groups use bikes more in summer months than in winter months -
unsurprising, given the Chicago climate.
* Both groups use bikes more in the second half of the day during
daylight hours, with a peak around 4-5PM.
However, the graphs highlight key differences as well.
Although members, like casual riders, took more trips on certain
months, days, or times, they showed shorter and less variable average
ride times and distances across all three factors than did casual
riders. Other notable differences include:
* Highest number of trips taken on weekdays, rather than weekends
* Peaks coinciding with the morning rush hour of 7-8AM and evening rush
hour of 4-6PM
* Very little variability in ride time or distance, and especially
steady ride lengths for weekdays (Monday through Friday)
This suggests that many members are using the bikes for commuting purposes, leading to short, predictable rides and higher weekday usage.
Conversely, casual riders showed longer rides and greater variability
on all measures: number of rides and average ride time or distance by
day, month, or hour. Usage also suggested that rides may correspond with
leisure activities:
* Highest number of rides and longer rides taken on the weekends
* Markedly higher ridership during Chicago’s temperate season from May
to October; while members also showed higher ridership during this time,
the variation is much more pronounced among casual riders, whose rides
dropped dramatically in number and significantly in length between
November and February
* A preference for afternoons and evenings, with ridership climbing
throughout the day and tapering to its lowest around 3-6AM
Interestingly, while members typically take more rides, casual riders overtake members in the number of rides on Saturdays and Sundays and between the hours of 10PM-2AM. This may suggest that some casual riders use the bikes late into the night to take advantage of Chicago’s vibrant dining and entertainment scene, and would be worth exploring further.
As mentioned earlier, while data is available for individual rides, none is available for individual riders. Visibility into how many trips a rider takes or how riders differ by demographics may lend valuable insights for marketing to or converting casual riders into members.
Additionally, the average ride distance measure was strongly correlated with average ride time when looking at days of the week, but less strongly correlated when looking across months or hours of the day. It’s possible that some adept riders travel a long distance in a short period of time, but also that the proxy ride distance metric was influenced by another factor such as the company physically redistributing bikes. Given this uncertainty, average ride time is the better measure of ride length for purposes of determining user habits.
Author’s Note: This case study was completed as part of the Google Data Analytics Certificate. I have chosen to share my findings as an RMarkdown document, rather than a business-friendly output such as a PowerPoint or Tableau dashboard, to test and expand my skills in R programming. Thank you for reading!