Introduction

For this project, I assume the role of a junior data analyst working on the marketing team of a fictional Chicago-based bike-sharing company named Cyclistic. Bikes are geotracked through a network of 692 stations across the city, and they can be unlocked from one station to be returned to any other station in the system. Previously, the company’s marketing strategy was based around building awareness to its brand and appealing to broad consumer bases. One of the main ways it accomplished this was through its flexible pricing plans, offering single-ride passes, full-ride passes, and annual memberships. Customers that purchase annual memberships are referred to as annual riders, or Cyclistic members, whereas those that purchase single-ride or full-day passes are referred to as casual riders.

Lily Moreno, the director of marketing at Cyclistic, claims that for future success, the company must maximize the amount of annual memberships purchased, as they are much more profitable for the company than its single-ride or full-day passes. While the company has historically focused on attracting new customers, Moreno now believes it must transition to converting existing casual riders into members. She also notes that casual riders are already aware of Cyclistic and have chosen the company for their mobility needs.

In order to achieve Moreno’s business goal of converting casual riders, we must understand how casual and annual riders differ, why casual riders would buy an annual membership, and how digital media could affect their marketing tactics. In this report, I analyze historical bike trip data to compare rider behavior, identify key differences between rider types, and propose data-driven business recommendations based on these findings. I will also provide the code I used to extract and visualize the provided data with explanations for my methodology.

Methods

Historical trip data is provided publicly by Divvy Bikes, a real bike-sharing company founded in 2013. [1] Like Cyclistic, Divvy is based in Chicago and uses a flexible pricing system with single-ride passes, full-day passes, and annual memberships. [2] According to Divvy’s website, each documented trip is anonymized, containing their start and end dates and times, start and end stations, and rider types. Additionally, the data is processed to remove trips taken by staff for servicing and inspection, as well as trips below 60 seconds in length, the latter of which is noted to potentially be due to false starts or attempts to re-dock bikes. [3] The following website provides its data for download: https://divvy-tripdata.s3.amazonaws.com/index.html

At the time of writing, data ranges from 2013 to December 2025, or the previous month. All trip data from 2013 is contained in one file, whereas trips from January 2014 to March 2020 are arranged by quarter. From April 2020 onward, data is separated by month. For this project, I will be using the most recent 12 months of data at the time of starting this project, ranging from December 2024 to November 2025.

For data validation and cleaning, I will be using SQL through SQLite due to its speed and each of use. For visualization, I will be programming in R in order to handle the large amounts of data and reproduce my steps. Microsoft Excel and Tableau were also considered for these tasks, but in my experience, they don’t seem to handle such large amounts of data spread across multiple files as easily as SQL and R.

Data validation

Prior to creating visualizations, I opted to observe and validate the provided datasets for understand what information is available and scan for inconsistent, incorrect, or missing information. This step is critical to ensure that any observed rider behavior reflects real usage patterns rather than data errors. I used SQLite to check my data and verify whether cleaning was required. I opted to use SQL due to its speed and ease of use compared to R, which I felt was important for working with such large datasets. Another key advantage to SQL compared to other tools is being able to combine all datasets into one (using the WITH keyword, in the case of SQLite), which would make the process simpler.

To begin, I checked each datasets’s columns to see what was contained and confirm that each used the same column names and data types. The following columns appear in each dataset:

With all columns confirmed to appear across all datasets, I used the following code to combine them into one table named all_trips. The following code shows all_trips in action with a simple example.

WITH all_trips AS (
    SELECT * FROM "tripdata_2024_12"
    UNION ALL
    SELECT * FROM "tripdata_2025_01"
    UNION ALL
    SELECT * FROM "tripdata_2025_02"
    UNION ALL
    SELECT * FROM "tripdata_2025_03"
    UNION ALL
    SELECT * FROM "tripdata_2025_04"
    UNION ALL
    SELECT * FROM "tripdata_2025_05"
    UNION ALL
    SELECT * FROM "tripdata_2025_06"
    UNION ALL
    SELECT * FROM "tripdata_2025_07"
    UNION ALL
    SELECT * FROM "tripdata_2025_08"
    UNION ALL
    SELECT * FROM "tripdata_2025_09"
    UNION ALL
    SELECT * FROM "tripdata_2025_10"
    UNION ALL
    SELECT * FROM "tripdata_2025_11"
)
SELECT *
FROM all_trips
LIMIT 10
Displaying records 1 - 10
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_casual
6C960DEB4F78854E electric_bike 2024-12-31 01:38:35.018 2024-12-31 01:48:45.775 Halsted St & Roscoe St TA1309000025 Clark St & Winnemac Ave TA1309000035 41.94363 -87.64908 41.97335 -87.66786 member
C0913EEB2834E7A2 classic_bike 2024-12-21 18:41:26.478 2024-12-21 18:47:33.871 Clark St & Wellington Ave TA1307000136 Halsted St & Roscoe St TA1309000025 41.93650 -87.64754 41.94363 -87.64908 member
848A37DD4723078A classic_bike 2024-12-21 11:41:01.664 2024-12-21 11:52:45.094 Sheridan Rd & Montrose Ave TA1307000107 Broadway & Barry Ave 13137 41.96167 -87.65464 41.93758 -87.64410 member
3FA09C762ECB48BD electric_bike 2024-12-26 13:07:27.526 2024-12-26 13:10:54.130 Aberdeen St & Jackson Blvd 13157 Green St & Randolph St* chargingstx3 41.87773 -87.65479 41.88360 -87.64863 member
E60317ADD1A87488 electric_bike 2024-12-13 15:17:55.063 2024-12-13 15:27:32.583 Paulina St & Flournoy St KA1504000104 Fairfield Ave & Roosevelt Rd KA1504000102 41.87306 -87.66913 41.86662 -87.69452 member
A83DBA54802B4618 electric_bike 2024-12-08 03:24:46.718 2024-12-08 03:33:21.733 Ravenswood Ave & Lawrence Ave TA1309000066 Kedzie Ave & Leland Ave KA1504000126 41.96845 -87.67423 41.96669 -87.70805 member
D3D6DAF0AB53F245 electric_bike 2024-12-08 18:56:35.855 2024-12-08 19:04:09.295 Paulina St & Flournoy St KA1504000104 Fairfield Ave & Roosevelt Rd KA1504000102 41.87306 -87.66913 41.86662 -87.69452 member
3309E4E902565E04 electric_bike 2024-12-09 16:26:13.301 2024-12-09 16:33:29.555 Paulina St & Flournoy St KA1504000104 Fairfield Ave & Roosevelt Rd KA1504000102 41.87306 -87.66913 41.86662 -87.69452 member
805D6F3B51AD91AF electric_bike 2024-12-19 16:30:17.751 2024-12-19 16:44:23.702 Paulina St & Flournoy St KA1504000104 Michigan Ave & 18th St 13150 41.87306 -87.66913 41.85793 -87.62434 member
8560580463C7E866 electric_bike 2024-12-21 06:38:40.043 2024-12-21 06:41:35.290 Rockwell St & Archer Ave 21379 Leavitt St & Archer Ave KA1503000068 41.82272 -87.68982 41.82879 -87.68060 member

With this temporary table, I made several checks in SQL to for “invalid” data, which would have to be observed and corrected before using it for analysis. The following checks were made:

  1. Check for null values in ride_id, rideable_type, started_at, ended_at, and member_casual.
  2. Check for misnamed, incorrect, or missing fields in rideable_type and member_casual.
  3. Check for duplicates in ride_id.
  4. Check for “near” duplicates, in which two rides have different IDs but the same details otherwise.
  5. Check for instances where the start time occurs prior to the end time, which can be found by subtracting start_time from end_time and receiving a negative number.
  6. Check for instances of latitude values ranging outside the range of -90 to 90 and longitude values ranging outside -180 to 180.
  7. In each table, check for start times occurring before their respective month or end times after their respective month.

Of these searches, two returned results. The first was with start times occurring prior to end times, with 29 ride times returning negative. All 29 of these rides took place between 1 AM and 2 AM on November 2, 2025, the end of daylight savings in 2025, which indicates that these datasets do not account for daylight savings time. To confirm this, I checked for rides within the hour of 2 AM on March 9, 2025, when daylight savings would have started, and found no results. The following code shows some of the negative ride times found in the data, except only using the data for November 2025 for brevity.

SELECT ride_id, started_at, ended_at, (((strftime('%s', ended_at) - strftime('%s', started_at))) / 60) AS ride_time
FROM "tripdata_2025_11"
WHERE ride_time < 0
LIMIT 5
5 records
ride_id started_at ended_at ride_time
5D010AFEA6850513 2025-11-02 01:52:37.475 2025-11-02 01:13:24.728 -39
D1E5316AD88ECD45 2025-11-02 01:50:39.702 2025-11-02 01:14:04.164 -36
3AF2F8908C9386F8 2025-11-02 01:57:57.512 2025-11-02 01:33:52.225 -24
19386939ECD81B33 2025-11-02 01:55:34.399 2025-11-02 01:21:31.873 -34
083534D28DA37F72 2025-11-02 01:17:57.001 2025-11-02 01:05:27.752 -12

The second was with start dates occurring prior to the months of their respective tables. As this did not occur for end dates, I believe this means that the end dates are used to determine which records fall under which months. For instance, a ride that ends in February is located in the dataset for February, even if it starts in January. As I plan to use the start month extensively in my R code, this detail will be important to keep in mind.

I also tested Divvy’s claim that rides shorter than 60 seconds are filtered out of the data. Surprisingly, 146,319 such records were found across the twelve months of data tested. The following code shows some of these ride times found in the data, except only using the data for December 2024 for brevity.

SELECT ride_id, started_at, ended_at, ((strftime('%s', ended_at) - strftime('%s', started_at))) AS ride_time
FROM "tripdata_2024_12"
WHERE ride_time < 60
LIMIT 5
5 records
ride_id started_at ended_at ride_time
BA67BF241B42DDBB 2024-12-30 10:50:58.682 2024-12-30 10:51:11.515 13
CACDB5A45C5B97B7 2024-12-11 07:42:26.076 2024-12-11 07:42:30.976 4
157D5B77488C4A24 2024-12-08 20:49:18.584 2024-12-08 20:49:38.808 20
3FB5A2AE01F84DBE 2024-12-06 21:19:00.072 2024-12-06 21:19:04.975 4
1EE618D8B7B34257 2024-12-22 01:51:02.496 2024-12-22 01:51:10.149 8

Due to these excessively short rides remaining in the data, I decided to filter them out manually. As noted by Divvy, rides under 60 seconds are unlikely to represent meaningful behavior and are commonly associated with false starts, test unlocks, or docking. Additionally, I also wanted to apply this thinking to rides that are too long, which may represent rides ending incorrectly or bikes remaining docked for hours without being used. This is further substantiated when knowing that casual riders are only eligible to ride for up to one day, assuming they are paying for a full-day pass. With this in mind, I decided that a range between 1 minute and 24 hours would be a fair balance to not exclude too many legitimate rides.

Finally, when analyzing the maximum and minimum latitude and longitude values across all datasets, I found that the latitude ranged from 41.49 to 42.21, with the longitude ranging from -88.10 to -87.42. According to the map of Chicago, Illinois provided by the Chicago Data Portal, the city’s approximate boundaries range from a latitude of 41.64 to 42.02 and a longitude of -87.94 to -87.52. [4] From this, I can infer that all rides, even those that do not start or end at a valid Divvy station, remain inside or reasonably close to the Chicago metropolitan area. Because the case study focuses on behavioral differences rather than geographic distribution, spatial mapping was deemed outside the scope of this analysis.

With the data checked, I used SQL to find how many trips for each member type were included in the data while taking into account the necessary filters, namely that rides should last between 1 minute and 24 hours. The following code shows examples of rides meeting this threshold.

WITH all_trips AS (
    SELECT * FROM "tripdata_2024_12"
    UNION ALL
    SELECT * FROM "tripdata_2025_01"
    UNION ALL
    SELECT * FROM "tripdata_2025_02"
    UNION ALL
    SELECT * FROM "tripdata_2025_03"
    UNION ALL
    SELECT * FROM "tripdata_2025_04"
    UNION ALL
    SELECT * FROM "tripdata_2025_05"
    UNION ALL
    SELECT * FROM "tripdata_2025_06"
    UNION ALL
    SELECT * FROM "tripdata_2025_07"
    UNION ALL
    SELECT * FROM "tripdata_2025_08"
    UNION ALL
    SELECT * FROM "tripdata_2025_09"
    UNION ALL
    SELECT * FROM "tripdata_2025_10"
    UNION ALL
    SELECT * FROM "tripdata_2025_11"
)
SELECT member_casual as rider_type, COUNT(*) as trips
FROM all_trips
WHERE (strftime('%s', ended_at) - strftime('%s', started_at)) > 60 AND (strftime('%s', ended_at) - strftime('%s', started_at)) < 86400
GROUP BY rider_type
LIMIT 5
2 records
rider_type trips
casual 1925850
member 3512151

With 3,512,151 annual member trips and 1,925,850 casual rider trips, the main thing that became clear was that annual members rode almost twice as often than casual riders. This would be taken into account when comparing ride totals between the two.

R Code

With data validation complete, I moved onto using R for visualization. All R code used for the project will be shown, though certain changes are made in this report for the purposes of demonstration and brevity.

  • All code relating to ggsave, which is used to save plots to your computer, is removed.
  • Several for loops are used to run plotting functions multiple times to print multiple plots. A check has been added to these blocks to only print one plot each as a demonstration. Plots not shown in this report are available in the “plots” folder on the GitHub repository for this project.
  • Most comments are removed. As this report will explain each code block in depth when necessary, comments may thus become redundant.

Libraries

The following packages are required for certain functionality throughout the program:

  • tidyverse: Core data manipulation and visualization tools, including dplyr for data cleaning and aggregation and ggplot2 for all charts.
  • readr: For reading the monthly trip data CSV files.
  • lubridate: For date-time parsing and feature creation such as weekdays, months, and hours from timestamp columns.
  • scales: For formatting comma-separated axes and percentage labels in plots.
  • shadowtext: For adding outlined percentage labels to pie charts for readability.
  • extrafont: For use of the custom Verdana font in plots for readability.

The packages can now be loaded using:

library(tidyverse)
library(readr)
library(lubridate)
library(scales)
library(shadowtext)
library(extrafont)

Data loading

As previously stated, generated plots and the ensuing analysis will use trip datasets from Divvy, particularly those ranging from December 2024 to November 2025, covering twelve months prior to when this project was started. As each dataset is contained in a separate file, all will have to be loaded together.

First, the twelve files representing the twelve months of data were read using readr. This step potentially could have been automated, but I decided it was unnecessary, as this analysis is only intended to be carried out with these specific twelve files. Ensure the working directory is set to the folder containing these files before running.

sheet_2024_12 <- read.csv("202412-divvy-tripdata.csv")
sheet_2025_01 <- read.csv("202501-divvy-tripdata.csv")
sheet_2025_02 <- read.csv("202502-divvy-tripdata.csv")
sheet_2025_03 <- read.csv("202503-divvy-tripdata.csv")
sheet_2025_04 <- read.csv("202504-divvy-tripdata.csv")
sheet_2025_05 <- read.csv("202505-divvy-tripdata.csv")
sheet_2025_06 <- read.csv("202506-divvy-tripdata.csv")
sheet_2025_07 <- read.csv("202507-divvy-tripdata.csv")
sheet_2025_08 <- read.csv("202508-divvy-tripdata.csv")
sheet_2025_09 <- read.csv("202509-divvy-tripdata.csv")
sheet_2025_10 <- read.csv("202510-divvy-tripdata.csv")
sheet_2025_11 <- read.csv("202511-divvy-tripdata.csv")

After loading in the data, it is inputted into a list named trip_list, which will be used extensively throughout the project. Each table is given a name corresponding to its respective month in a YYYY-MM format for brevity.

trip_list <- list(
  '2024-12' = sheet_2024_12,
  '2025-01' = sheet_2025_01,
  '2025-02' = sheet_2025_02,
  '2025-03' = sheet_2025_03,
  '2025-04' = sheet_2025_04,
  '2025-05' = sheet_2025_05,
  '2025-06' = sheet_2025_06,
  '2025-07' = sheet_2025_07,
  '2025-08' = sheet_2025_08,
  '2025-09' = sheet_2025_09,
  '2025-10' = sheet_2025_10,
  '2025-11' = sheet_2025_11
)

As mentioned previously, the extrafont package is used to load the Verdana font, which is imported below.

font_import(pattern = "verdana", prompt = FALSE)
loadfonts(device = "win")

Data transformation

After being loaded, the data also needs to be processed in order to be used for visualization, and several factors would be taken into account based on the start and end times. One of the main factors would be that rides lasting less than 1 minute or longer than 24 hours would be excluded. However, for the sake of comparison and transparency, I still wanted to include this data, albeit plotted separately from the filtered data. Thus, rather than removing data outside the specified range, a new column named clean is created to distinguish data acceptable for the filtered plots from data that should only be included in the separate raw plots.

Secondly, daylight savings times are accounted for, namely for rides taking place around the 2 AM block on March 9 and November 2, 2025. For rides on March 9 that occur at what would be 2 AM, 60 minutes are subtracted from their ride lengths. For rides on November 2 that start before 1 AM and end after 2 AM, or for rides with negative ride lengths, 60 minutes are instead added. I am not able to account for rides that only start or end in the 1 AM block on that day, as the data alone makes it impossible to tell whether 1 AM refers to before or after the hour falling back.

Finally, I wanted to account for the fact that the end time is used to indicate which month a ride counts for rather than the start time, as I feel that the time a rider decides to start a trip is more relevant to this analysis than when they decide to end it. In addition to using the start time as the basis for several time-based fields, I will also be excluding rides that start prior to December 1, 2024 for consistency with the rest of the data.

Before processing the data, I set up several dates to use for it, namely the boundaries for the range of data being used (December 1, 2024 and November 30, 2025) and the start and dates and times for daylight savings.

analysis_start <- as.POSIXct("2024-12-01 00:00:00", tz = "UTC")
analysis_end <- as.POSIXct("2025-11-30 23:59:59", tz = "UTC")

daylight_spring_start <- as.POSIXct("2025-03-09 01:59:59", tz = "UTC")
daylight_spring_end <- as.POSIXct("2025-03-09 03:00:00", tz = "UTC")
daylight_fall_start <- as.POSIXct("2025-11-02 00:59:59", tz = "UTC")
daylight_fall_end <- as.POSIXct("2025-11-02 02:00:00", tz = "UTC")

In addition to the clean field, five additional fields are created. These include:

  • ride_length_min, the ride length in minutes based on the difference between the start and end times.
  • day_of_week, the day of the week at which each trip started.
  • month, the month at which each trip started.
  • start_hour, the hour of the day at which each trip started.
  • round_trip, a boolean value that determines whether the trip is a round trip based on whether the start and end stations are the same, excluding trips with no stations listed.

It should be noted that, although the month of each trip can be parsed using the names of each table in trip_list, this cannot work here due to the aforementioned fact that trips are categorized into each dataset by the months of their end dates rather than their start dates. Thus, the month column is instead derived from the started_at column.

The member_casual column is also changed so that values labeled “member”, which refers to annual members, are renamed to the more clear “Annual”. For consistency, “casual” is capitalized to “Casual”. Similarly, the rideable_type column is changed to capitalize and remove underscores from the two vehicle names, replacing them with the more readable “Classic Bike” and “Electric Bike”.

Note that this block of code takes a long time to execute, possibly due to the sheer volume of data being processed.

for (i in seq_along(trip_list)) {
  
  trip_list[[i]]$started_at <- ymd_hms(trip_list[[i]]$started_at, tz = "UTC")
  trip_list[[i]]$ended_at <- ymd_hms(trip_list[[i]]$ended_at, tz = "UTC")
  
  trip_list[[i]] <- trip_list[[i]] %>%
    filter(between(started_at, analysis_start, analysis_end))
  
  trip_list[[i]]$ride_length_min <-
    as.numeric(difftime(
      trip_list[[i]]$ended_at,
      trip_list[[i]]$started_at,
      units = "mins"
    ))
  
  trip_list[[i]]$ride_length_min <- 
    ifelse(trip_list[[i]]$started_at <= daylight_spring_start & trip_list[[i]]$ended_at >= daylight_spring_end,
           trip_list[[i]]$ride_length_min - 60,
           trip_list[[i]]$ride_length_min)
  
  trip_list[[i]]$ride_length_min <- 
    ifelse(trip_list[[i]]$started_at <= daylight_fall_start & trip_list[[i]]$ended_at >= daylight_fall_end,
           trip_list[[i]]$ride_length_min + 60,
           trip_list[[i]]$ride_length_min)
  
  trip_list[[i]]$ride_length_min <- 
    ifelse(trip_list[[i]]$ride_length_min < 0,
           trip_list[[i]]$ride_length_min + 60,
           trip_list[[i]]$ride_length_min)
  
  trip_list[[i]]$day_of_week <-
    wday(trip_list[[i]]$started_at, label = TRUE, week_start = 1)
  
  trip_list[[i]]$month <-
    as.Date(format(trip_list[[i]]$started_at, "%Y-%m-01"))
  
  trip_list[[i]]$start_hour <-
    hour(trip_list[[i]]$started_at)
  
  trip_list[[i]]$round_trip <-
    !is.na(trip_list[[i]]$start_station_id) &
    !is.na(trip_list[[i]]$end_station_id) &
    trip_list[[i]]$start_station_id == trip_list[[i]]$end_station_id
  
  trip_list[[i]]$rideable_type <-
    recode(trip_list[[i]]$rideable_type,
           classic_bike = "Classic Bike",
           electric_bike = "Electric Bike")
  
  trip_list[[i]]$member_casual <-
    as.character(recode(trip_list[[i]]$member_casual,
           member = "Annual",
           casual = "Casual"))
  
  trip_list[[i]]$clean <-
    ifelse(between(trip_list[[i]]$ride_length_min, 1, 1440), TRUE, FALSE)
}

Functions

Functions are all grouped together after the code for data transformation. However, for the sake of helping to explain upcoming visualization code, all plotting functions will be included in their respective sections where they are called, rather than putting them all together in this section.

Before any other functions are called, hour_labels is created for future plots based on hours of the day. This function maps numbers from 0 to 23 to times of day according to a 12-hour clock, such as mapping 3 to “3 AM” or 15 to “3 PM”. This will allow their respective y-axes to display clean, understandable times of day rather than the raw numbers.

hour_labels <- function(h) {
  paste0(
    ((h + 11) %% 12) + 1,
    ifelse(h < 12, " AM", " PM")
  )
}

Plot 1: Total number of trips per month

This function will be used to plot a line graph displaying the total bike trips for each month, separated into two lines depending on rider type. Two plots will be created, including one consisting only of filtered data and one combining it with raw data. The data argument is used to pass in the dataframe to be plotted. The type argument will be used to assign the raw data in the combined plot to be a dotted line or, for the plot containing only filtered data, to not include a redundant legend for the data version.

plot_trip_count <- function(data, type) {
  ggplot(data, aes(x = month, y = trips, color = member_casual, linetype = {{type}})) +
    geom_line(linewidth = 2) +
    geom_point() +
    scale_x_date(
      date_breaks = "1 month",
      date_labels = "%b %Y"
    ) +
    scale_y_continuous(
      labels = comma,
      limits = c(0, NA)
    ) +
    labs(
      title = "Total Bike Trips by Month",
      caption = "Filtered rides range from 1 minute to 24 hours",
      x = "Month",
      y = "Number of Trips",
      color = "Rider Type",
      linetype = "Data Version"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.grid.major = element_line(colour = "darkgray"),
      panel.grid.minor.y = element_line(colour = "gray85", linewidth = 0.3),
      panel.grid.minor.x = element_blank(),
      panel.background = element_rect(fill = "white"),
      axis.text.x = element_text(angle = 30, hjust = 1),
      text = element_text(family = "Verdana")
    )
}

First, the filtered dataframe is created using map_dfr. This table includes the total number of rides in all of trip_list for each month, separated by annual and casual riders.

Note that an additional pipe is used to sum the total trips once more, which is required due to the fact that pulling from multiple tables at once causes duplicate dates in the resulting dataframe, such as two different rows showing annual member trips in December 2024. The extra pipe cleans this up and properly results in only one row per date and member type pair.

trips_by_month <- map_dfr(trip_list, function(df) {
  df %>%
    filter(clean) %>%
    count(member_casual, month, name = "trips")
}) %>%
  group_by(member_casual, month) %>%
  summarise(
    trips = sum(trips),
    .groups = "drop"
  ) %>%
  mutate(version = "Filtered")

Next, the raw dataframe is created using a similar function. Note that filtered dataframes will always be created using filter(clean), whereas raw dataframes will lack this line.

trips_by_month_raw <- map_dfr(trip_list, function(df) {
  df %>%
    count(member_casual, month, name = "trips")
}) %>%
  group_by(member_casual, month) %>%
  summarise(
    trips = sum(trips),
    .groups = "drop"
  ) %>%
  mutate(version = "Raw")

For the combined plot, the raw and filtered data are binded together into one dataframe.

trips_by_month_compare <- bind_rows(trips_by_month_raw, trips_by_month)

Finally, the filtered and combined plots are created and printed. For the former, NULL is passed into the type argument to prevent an unnecessary legend from being created, whereas for the latter, the version column is passed in.

p <- plot_trip_count(trips_by_month, NULL)
print(p)

p <- plot_trip_count(trips_by_month_compare, version)
print(p)

Plot 2: Average ride time per month

This function will be used to plot a line graph displaying the average length of time for bike trips for each month, separated into two lines depending on rider type. The resulting plot will be very similar to those created previously, and the function is very similar as a result.

plot_monthly_average <- function(data, type) {
  ggplot(data, aes(x = month, y = minutes, color = member_casual, linetype = {{type}})) +
    geom_line(linewidth = 2) +
    geom_point() +
    scale_x_date(
      date_breaks = "1 month",
      date_labels = "%b %Y"
    ) +
    scale_y_continuous(
      labels = comma,
      limits = c(0, NA)
    ) +
    labs(
      title = "Average Ride Times by Month",
      caption = "Filtered rides range from 1 minute to 24 hours",
      x = "Month",
      y = "Average Ride Time (minutes)",
      color = "Rider Type",
      linetype = "Data Version"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.grid.major = element_line(colour = "darkgray"),
      panel.grid.minor.y = element_line(colour = "gray85", linewidth = 0.3),
      panel.grid.minor.x = element_blank(),
      panel.background = element_rect(fill = "white"),
      axis.text.x = element_text(angle = 30, hjust = 1),
      text = element_text(family = "Verdana")
    )
}

The creation of a filtered, raw, and combined dataframe is also almost the same as previously, with the main difference being that the mean of ride_length_min is taken rather than the count of all trips. Here, time_by_month contains filtered data, time_by_month_raw contains raw data, and time_by_month_compare contains both.

time_by_month <- map_dfr(trip_list, function(df) {
  df %>%
    filter(clean) %>%
    group_by(member_casual, month) %>%
    summarise(
      total_minutes = sum(ride_length_min, na.rm = TRUE),
      trips = sum(!is.na(ride_length_min)),
      .groups = "drop"
    )
}) %>%
  group_by(member_casual, month) %>%
  reframe(
    minutes = sum(total_minutes) / sum(trips),
    .groups = "drop"
  ) %>%
  mutate(version = "Filtered")

time_by_month_raw <- map_dfr(trip_list, function(df) {
  df %>%
    group_by(member_casual, month) %>%
    summarise(
      total_minutes = sum(ride_length_min, na.rm = TRUE),
      trips = sum(!is.na(ride_length_min)),
      .groups = "drop"
    )
}) %>%
  group_by(member_casual, month) %>%
  reframe(
    minutes = sum(total_minutes) / sum(trips),
    .groups = "drop"
  ) %>%
  mutate(version = "Raw")

time_by_month_compare <- bind_rows(time_by_month_raw, time_by_month)

Finally, the filtered and combined plots are created and printed, with the same arguments being passed into type.

p <- plot_monthly_average(time_by_month, NULL)
print(p)

p <- plot_monthly_average(time_by_month_compare, version)
print(p)

Plot 3: Total rides per week day

This function will be used to plot a bar graph for each month displaying the total amount of bike trips for each weekday, along with an additional bar graph combining all months, separated into two bars per day depending on rider type. Unlike with the previous plots, these bar graphs will be separated into filtered and raw data rather than filtered and combined data, due to the bar graph format not working cleanly with two sets of data like this.

month_date is used to pass in the month each graph pertains to, which is to be used to determine both the label and the structure of the graph itself. For the latter, ifelse statements are used to expand the boundaries when “All” is passed in, as this will pertain to the much larger graph of all months combined. Additionally, version_label will provide the appropriate caption depending on whether the raw or filtered data is displayed.

Of particular note in this graph is the use of geom_rect to create a shaded rectangle over the weekends of Saturday and Sunday, which are located on the right side of each graph. This is meant to highlight those particular days, as behavior seems to differ for both types of riders on these days.

plot_weekday_month <- function(data, month_date, version_label) {
  
  limit = ifelse(as.character(month_date) == "All", 600000, 90000)
  spaces = ifelse(as.character(month_date) == "All", 200000, 20000)
  
  ggplot(data, aes(x = day_of_week, y = trips, fill = member_casual)) +
    geom_rect(
      data = data.frame(
        xmin = 5.5,
        xmax = 7.5,
        ymin = -Inf,
        ymax = Inf
      ),
      aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
      fill = "gray",
      alpha = 0.5,
      inherit.aes = FALSE
    ) +
    
    geom_col(position = "dodge", color = "black", linewidth = 0.2) +
    scale_y_continuous(
      labels = comma,
      expand = expansion(mult = c(0, 0.05)),
      limits = c(0, limit),
      breaks = seq(0, limit, by = spaces)
    ) +
    labs(
      title = paste(
        "Total Rides by Weekday —",
        ifelse(as.character(month_date) == "All", "All", format(month_date, "%B %Y"))
      ),
      caption = version_label,
      x = "Day of Week",
      y = "Rides",
      fill = "Rider Type"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.grid.major = element_line(colour = "grey40"),
      panel.grid.minor.y = element_line(colour = "gray85", linewidth = 0.3),
      panel.grid.minor.x = element_blank(),
      panel.background = element_blank(),
      plot.background = element_rect(fill = "white", color = NA),
      axis.line = element_line(color = "black", linewidth = 0.3),
      axis.ticks = element_line(color = "black", linewidth = 0.3),
      text = element_text(family = "Verdana")
    )
}

A slightly different approach is taken for the dataframes due to the differences in how the plots are constructed. Here, the raw and filtered data are mapped together as one dataframe, with each listing the total number of trips for each weekday.

weekday_summary <- map_dfr(trip_list, function(df) {
  
  bind_rows(
    df %>%
      count(day_of_week, month, member_casual, name = "trips") %>%
      mutate(version = "Raw"),

    df %>%
      filter(clean) %>%
      count(day_of_week, month, member_casual, name = "trips") %>%
      mutate(version = "Filtered")
  )
})

Starting with the raw data, a for loop is created to go through each month and create individual plots for total rides per weekday. Note that “Includes all data” is passed into version_label to indicate that this is the unfiltered data being plotted.

i <- 0

for (m in sort(unique(weekday_summary$month))) {
  m_date <- as.Date(m, origin = "1970-01-01")
  
  plot_data <- weekday_summary %>%
    filter(month == m_date, version == "Raw")
  
  p <- plot_weekday_month(plot_data, m_date, "Includes all data")
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

Next, a new dataframe called weekday_summary_raw is created to compile all months worth of trips in the raw data into one set.

weekday_summary_raw <- weekday_summary %>%
  filter(version == "Raw") %>%
  group_by(day_of_week, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")
p <- plot_weekday_month(weekday_summary_raw, "All", "Includes all data")
print(p)

A similar for loop is then used for the filtered data to plot each of its months of data. Once again, note that “Data ranges from 1 minute to 24 hours” is passed into version_label.

i <- 0

for (m in sort(unique(weekday_summary$month))) {
  m_date <- as.Date(m, origin = "1970-01-01")
  
  plot_data <- weekday_summary %>%
    filter(month == m_date, version == "Filtered")
  
  p <- plot_weekday_month(plot_data, m_date, "Data ranges from 1 minute to 24 hours")
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

As before, another new dataframe called weekday_summary_filter compiles all months worth of trips in the filtered data.

weekday_summary_filter <- weekday_summary %>%
  filter(version == "Filtered") %>%
  group_by(day_of_week, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")
p <- plot_weekday_month(weekday_summary_filter, "All", "Data ranges from 1 minute to 24 hours")
print(p)

Plot 4: Round trips vs one-way trips

This function will be used to plot a pie chart comparing the percentages of round trips to one-way trips both annual and casual riders take across the full dataset. To differentiate from the red and blue colorations used for annual and casual riders in previous graphs, the complementary colors yellow and purple are instead used for round and one-way trips, respectively.

Of particular note is the fact that, because the data contains significantly more annual rides than casual rides, setting the scales parameter in facet_wrap to “fixed” will display a gap in the casual riders’ chart to account for missing rides in comparison to the full circle used for annual members’ chart. If “free” were used instead, then both charts would show a full circle. Thus, I decided to include both while passing the desired setting into the void argument, as I felt that showing the difference in rider counts helps to better illustrate the comparison.

This is also the only plot where the shadowtext library is used, particularly through geom_shadowtext. This creates a black outline around the white text to be used for the percentages on each chart for readability.

plot_round_trips <- function(data, void, version_label) {
  ggplot(data, aes(x = 1, y = trips, fill = trip_type)) +
    geom_col(width = 1, color = "black", linewidth = 1) +
    coord_polar(theta = "y") +
    facet_wrap(~ member_casual, scales = {{void}}) +
    scale_y_continuous(labels = comma) +
    labs(
      title = "Round Trips vs One-Way Trips",
      caption = version_label,
      fill = "Trip Type"
    ) +
    geom_shadowtext(
      aes(label = percent(percent, accuracy = 1)),
      position = position_stack(vjust = 0.5),
      color = "white",
      bg.color = "black",
      bg.r = 0.1,
      size = 5,
      fontface = "bold"
    ) +
    scale_fill_manual(
      values = c(
        "Round trip" = "yellow",
        "One-way trip" = "purple"
      )
    ) +
    theme(
      axis.title = element_blank(),
      axis.text  = element_blank(),
      axis.ticks = element_blank(),
      strip.text = element_text(size = 12, face = "bold"),
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.background = element_rect(fill = "white"),
      text = element_text(family = "Verdana")
    )
}

This time, both the raw and filtered data will have an additional filter to remove any data where the start or end station is not listed, as both are required to tell what kind of trip it is. While it would be possible to make assumptions based on the start and end latitude and longitude values in the data, I feel that not all rides considered round trips will start and end on exactly the same coordinates, and using an arbitrary metric to determine how far from their starting position a rider must end on to count as a one-way trip may lead to inaccuracies.

round_trip_summary <- map_dfr(trip_list, function(df) {
  
  bind_rows(
    df %>%
      filter(
        !is.na(start_station_id),
        !is.na(end_station_id),
      ) %>%
      count(member_casual, round_trip, name = "trips") %>%
      mutate(version = "Raw"),
    
    df %>%
      filter(
        !is.na(start_station_id),
        !is.na(end_station_id),
        clean
      ) %>%
      count(member_casual, round_trip, name = "trips") %>%
      mutate(version = "Filtered")
  )
}) %>%
  group_by(member_casual, round_trip, version) %>%
  summarise(
    trips = sum(trips),
    .groups = "drop"
  ) %>%
  mutate(
    member_casual = ifelse(member_casual == "Annual", "Annual riders", "Casual riders"),
    trip_type = ifelse(round_trip, "Round trip", "One-way trip")
  )

For both the raw and filtered dataframes, a new column named percent is added. This contains the amount of round or one-way trips for each member type divided by their total number of trips. On the pie chart, these will be displayed as percent values.

# Raw data
round_trip_summary_raw <- round_trip_summary %>%
  filter(version == "Raw") %>%
  group_by(member_casual) %>%
  mutate(percent = trips / sum(trips))

# Filtered data
round_trip_summary_filtered <- round_trip_summary %>%
  filter(version == "Filtered") %>%
  group_by(member_casual) %>%
  mutate(percent = trips / sum(trips))

Next, the raw charts are plotted. As mentioned, one will use “fixed” for its scales and the other will use “free”. For version_label, the captions have to be edited to account for the fact that rides with unknown stations are not included.

# Because there are many more annual riders than casual, "free" plot has big gap
# Each plot will be printed with and without said gap for transparency
p <- plot_round_trips(round_trip_summary_raw, "fixed", "Data does not include rides with unknown stations")
print(p)

p <- plot_round_trips(round_trip_summary_raw, "free", "Data does not include rides with unknown stations")
print(p)

Finally, the filtered charts are made in a similar fashion.

p <- plot_round_trips(round_trip_summary_filtered, "fixed", "Data does not include rides with unknown stations and ranges from 1 minute to 24 hours")
print(p)

p <- plot_round_trips(round_trip_summary_filtered, "free", "Data does not include rides with unknown stations and ranges from 1 minute to 24 hours")
print(p)

Plot 5: Median ride length by hour per weekday

This function will be used to plot a heat map comparing the median ride times at all times of day for each weekday. It was considered to do this with average ride times instead, but as this lead to comparably subtle differences in ride length across the chart, medians were preferred to gather more clear insights.

As this is the first plot to use times of day, it is also the first to take advantage of hour_labels for its x-axis, which will display clear times of day. A shaded area over the weekends similar to with Plot 3 was considered as well, but I decided not to include it because the effect does not come off as pronounced on a heat map as it does on a bar graph due to the comparative lack of empty space.

plot_median_ride <- function(data, version_label) {
  ggplot(data, aes(start_hour, day_of_week, fill = minutes)) +
    geom_tile() +
    facet_wrap(~ member_casual) +
    scale_fill_viridis_c(name = "Median Minutes") +
    scale_x_continuous(
      breaks = seq(0, 23, by = 3),
      minor_breaks = 0:23,
      labels = hour_labels
    ) +
    labs(
      title = "Median Ride Length by Time and Day",
      caption = version_label,
      x = "Hour of Day",
      y = "Day of Week"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.background = element_rect(fill = "white"),
      axis.text.x = element_text(angle = 30, hjust = 1),
      text = element_text(family = "Verdana")
    )
}

For the dataframes, I decided against sorting by month and gathering a separate plot for each one, as I felt one heat map would be busy and informative enough as is. The day of the week and start hour are both required for grouping here, and member names are once again made into a more readable form for headers.

time_by_hour <- map_dfr(trip_list, function(df) {
  bind_rows(
    df %>%
      group_by(member_casual, day_of_week, start_hour) %>%
      summarise(
        minutes = median(ride_length_min, na.rm = TRUE),
        .groups = "drop"
      ) %>%
      mutate(
        member_casual = ifelse(member_casual == "Annual", "Annual riders", "Casual riders"),
        version = "Raw"
      ),
    df %>%
      filter(clean) %>%
      group_by(member_casual, day_of_week, start_hour) %>%
      summarise(
        minutes = median(ride_length_min, na.rm = TRUE),
        .groups = "drop"
      ) %>%
      mutate(
        member_casual = ifelse(member_casual == "Annual", "Annual riders", "Casual riders"),
        version = "Filtered"
      )
  )
})

Finally, the raw and filtered plots are printed after separating them into dataframes filtered by their version columns.

# Raw version
time_by_hour_raw <- time_by_hour %>%
  filter(version == "Raw")
p <- plot_median_ride(time_by_hour_raw, "Includes all data")
print(p)

# Filtered version
time_by_hour_filtered <- time_by_hour %>%
  filter(version == "Filtered")
p <- plot_median_ride(time_by_hour_filtered, "Data ranges from 1 minute to 24 hours")
print(p)

Plot 6: Number of trips by hour per weekday

This function will be used to plot a line chart for each weekday showing the number of trips taken for each hour of the day. Like with the other line charts, this will have both a filtered version and a combined version with both filtered and raw data, so the type parameter is used to the same effect as before. weekday_label is also used for the headers, and hour_labels is used for the x-axis again.

plot_hourly_day <- function(data, weekday_label, type) {
  
  ggplot(data, aes(x = start_hour, y = trips, color = member_casual, linetype = {{type}})) +
    geom_line(linewidth = 2) +
    geom_point(size = 2) +
    scale_x_continuous(
      breaks = seq(0, 23, by = 3),
      minor_breaks = 0:23,
      labels = hour_labels
    ) +
    scale_y_continuous(
      expand = expansion(mult = c(0, 0.05)),
      limits = c(0, 80000)
    ) +
    labs(
      title = paste("Trips by Hour —", weekday_label),
      caption = "Filtered rides range from 1 minute to 24 hours",
      x = "Time of Day",
      y = "Number of Trips",
      color = "Rider Type",
      linetype = "Data Version"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.subtitle = element_text(hjust = 0.5),
      plot.caption = element_text(hjust = 1, vjust = 0),
      panel.grid.major = element_line(colour = "darkgray"),
      panel.grid.minor = element_line(colour = "gray85", linewidth = 0.3),
      panel.background = element_rect(fill = "white"),
      text = element_text(family = "Verdana")
    )
}

For the filtered dataframe, map_dfr is used to group all trips by day of week and start hour, and that data is summed again through a pipe to get clean sum values.

trips_by_hour <- map_dfr(trip_list, function(df) {
  df %>%
    filter(clean) %>%
    count(member_casual, day_of_week, start_hour, name = "trips") %>%
    mutate(
      version = "Filtered"
    )
}) %>%
  group_by(member_casual, day_of_week, start_hour, version) %>%
  summarise(
    trips = sum(trips),
    .groups = "drop"
  )

The raw dataframe is processed similarly, and with both created, they are combined into one.

trips_by_hour_raw <- map_dfr(trip_list, function(df) {
  df %>%
    count(member_casual, day_of_week, start_hour, name = "trips") %>%
    mutate(
      version = "Raw"
    )
}) %>%
  group_by(member_casual, day_of_week, start_hour, version) %>%
  summarise(
    trips = sum(trips),
    .groups = "drop"
  )

trips_by_hour_compare <- bind_rows(trips_by_hour_raw, trips_by_hour)

For the filtered dataframe, a for loop is run to produce a chart for each separate weekday. Note that NULL is being passed into type to prevent an unnecessary legend from being created.

i <- 0

for (d in levels(trips_by_hour$day_of_week)) {
  
  plot_data <- trips_by_hour %>%
    filter(day_of_week == d)
  
  p <- plot_hourly_day(plot_data, d, NULL)
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

A similar for loop is created for the combined data, except with version passed into type to have dotted lnes for the raw data.

i <- 0

for (d in levels(trips_by_hour_compare$day_of_week)) {
  
  plot_data <- trips_by_hour_compare %>%
    filter(day_of_week == d)
  
  p <- plot_hourly_day(plot_data, d, version)
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

Plot 7: Total rides per vehicle type

The final piece of data I decided to focus on for the plots is vehicle type, with each trip having using a Classic Bike or Electric Bike. This function plots a bar chart for each month of the year, along with a year-round total, showing the number of rides per vehicle type for annual and casual riders. The month_date and version_label parameters function the same as they do in Plot 3.

plot_vehicle_count <- function(data, month_date, version_label) {
  
  limit = ifelse(as.character(month_date) == "All", 2500000, 300000)
  spaces = ifelse(as.character(month_date) == "All", 500000, 100000)
  
  ggplot(data, aes(x = rideable_type, y = trips, fill = member_casual)) +
    geom_col(position = "dodge", color = "black", linewidth = 0.2) +
    scale_y_continuous(
      labels = comma,
      expand = expansion(mult = c(0, 0.05)),
      limits = c(0, limit),
      breaks = seq(0, limit, by = spaces)
    ) +
    labs(
      title = paste(
        "Total Rides by Vehicle —",
        ifelse(as.character(month_date) == "All", "All", format(month_date, "%B %Y"))
      ),
      caption = version_label,
      x = "Vehicle",
      y = "Rides",
      fill = "Rider Type"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.grid.major = element_line(colour = "darkgray"),
      panel.grid.minor = element_line(colour = "gray85", linewidth = 0.3),
      panel.background = element_blank(),
      plot.background = element_rect(fill = "white", color = NA),
      axis.line = element_line(color = "black", linewidth = 0.3),
      axis.ticks = element_line(color = "black", linewidth = 0.3),
      text = element_text(family = "Verdana")
    )
}

Next, a dataframe is made using map_dfr adding up all trips for each type of vehicle.

trips_by_ride <- map_dfr(trip_list, function(df) {
  df %>%
    count(rideable_type, month, member_casual, clean, name = "trips")
})

For the raw and filtered dataframes, the trip counts are summed once again for cleaner data.

trips_by_ride_raw <- trips_by_ride %>%
  group_by(rideable_type, month, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")

trips_by_ride_filtered <- trips_by_ride %>%
  filter(clean) %>%
  group_by(rideable_type, month, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")

For the raw data, a for loop is used to generate a plot for each month, followed by an additional plot for the total.

i <- 0

for (m in sort(unique(trips_by_ride_raw$month))) {
  m_date <- as.Date(m, origin = "1970-01-01")
  
  plot_data <- trips_by_ride_raw %>%
    filter(month == m_date)
  
  p <- plot_vehicle_count(plot_data, m_date, "Includes all data")
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

trips_by_ride_raw <- trips_by_ride_raw %>%
  group_by(rideable_type, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")
p <- plot_vehicle_count(trips_by_ride_raw, "All", "Includes all data")
print(p)

Similar steps are performed for the filtered data.

i <- 0

for (m in sort(unique(trips_by_ride_filtered$month))) {
  m_date <- as.Date(m, origin = "1970-01-01")
  
  plot_data <- trips_by_ride_filtered %>%
    filter(month == m_date)
  
  p <- plot_vehicle_count(plot_data, m_date, "Data ranges from 1 minute to 24 hours")
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

trips_by_ride_filtered <- trips_by_ride_filtered %>%
  group_by(rideable_type, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")
p <- plot_vehicle_count(trips_by_ride_filtered, "All", "Data ranges from 1 minute to 24 hours")
print(p)

Plot 8: Total rides per vehicle type (weekdays)

The final function is very similar to the one used previously, except instead of plotting per month, it will be plotting per weekday. I’d attempted to combine this with Plot 7 but ran into complications, and I felt that it would also be easier making a separate function to give unique parameters for the size of each plot. Also note that limit and spaces do not use ifelse statements, as a total plot will not be produced from this function.

plot_vehicle_count_week <- function(data, weekday, version_label) {
  
  limit = 400000
  spaces = 100000
  
  ggplot(data, aes(x = rideable_type, y = trips, fill = member_casual)) +
    geom_col(position = "dodge", color = "black", linewidth = 0.2) +
    scale_y_continuous(
      labels = comma,
      expand = expansion(mult = c(0, 0.05)),
      limits = c(0, limit),
      breaks = seq(0, limit, by = spaces)
    ) +
    labs(
      title = paste(
        "Total Rides by Vehicle —", weekday
      ),
      caption = version_label,
      x = "Vehicle",
      y = "Rides",
      fill = "Rider Type"
    ) +
    theme(
      plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
      plot.caption = element_text(hjust = 1, vjust = 0),
      plot.caption.position = "plot",
      panel.grid.major = element_line(colour = "darkgray"),
      panel.grid.minor = element_line(colour = "gray85", linewidth = 0.3),
      panel.background = element_blank(),
      plot.background = element_rect(fill = "white", color = NA),
      axis.line = element_line(color = "black", linewidth = 0.3),
      axis.ticks = element_line(color = "black", linewidth = 0.3),
      text = element_text(family = "Verdana")
    )
}

Setting up the dataframes is almost the same as it was before, except data is sorted by the day of the week rather than the month.

trips_by_ride_week <- map_dfr(trip_list, function(df) {
  df %>%
    count(rideable_type, day_of_week, member_casual, clean, name = "trips")
})

trips_by_ride_week_raw <- trips_by_ride_week %>%
  group_by(rideable_type, day_of_week, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")

trips_by_ride_week_filtered <- trips_by_ride_week %>%
  filter(clean) %>%
  group_by(rideable_type, day_of_week, member_casual) %>%
  summarise(trips = sum(trips), .groups = "drop")

Finally, a for loop is used to create plots of the raw data for each weekday.

i <- 0

for (d in sort(unique(trips_by_ride_week_raw$day_of_week))) {
  plot_data <- trips_by_ride_week_raw %>%
    filter(day_of_week == d)
  
  p <- plot_vehicle_count_week(plot_data, d, "Includes all data")
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

A similar loop is used for the filtered data.

# Plot for each weekday (filtered)
i <- 0

for (d in sort(unique(trips_by_ride_week_filtered$day_of_week))) {
  plot_data <- trips_by_ride_week_filtered %>%
    filter(day_of_week == d)
  
  p <- plot_vehicle_count_week(plot_data, d, "Data ranges from 1 minute to 24 hours")
  
  if (i == 5) {
    print(p)
  }
  i <- i + 1
}

Station data

In addition to not using latitude and longitude data when conducting rider analysis and creating visualizations, I also opted not to compare riders by specific stations, as all available stations are located in Chicago and the dataset does not include additional contextual information about stations, such as neighborhood demographics or transit connections, that would support deeper station-level analysis. However, I do feel that determining the most popular stations can provide useful insights, such as for determining which stations to focus on first for recommended solutions to the business problem. Therefore, SQL was used to identify the most active stations across several behavioral categories.

The following code was run to create a new temporary table called clean_trips containing the filters and some of the data columns previously employed with R, as well as a filter to only include trips from casual riders.

WITH all_trips AS (
    SELECT * FROM "tripdata_2024_12"
    UNION ALL
    SELECT * FROM "tripdata_2025_01"
    UNION ALL
    SELECT * FROM "tripdata_2025_02"
    UNION ALL
    SELECT * FROM "tripdata_2025_03"
    UNION ALL
    SELECT * FROM "tripdata_2025_04"
    UNION ALL
    SELECT * FROM "tripdata_2025_05"
    UNION ALL
    SELECT * FROM "tripdata_2025_06"
    UNION ALL
    SELECT * FROM "tripdata_2025_07"
    UNION ALL
    SELECT * FROM "tripdata_2025_08"
    UNION ALL
    SELECT * FROM "tripdata_2025_09"
    UNION ALL
    SELECT * FROM "tripdata_2025_10"
    UNION ALL
    SELECT * FROM "tripdata_2025_11"
),

clean_trips AS (
    SELECT *,
    ((strftime('%s', ended_at) - strftime('%s', started_at)) / 60.0) AS ride_length_min,
    CASE
      WHEN start_station_id IS NOT NULL
       AND end_station_id IS NOT NULL
       AND start_station_id = end_station_id
      THEN 1
      ELSE 0
    END AS round_trip
    FROM all_trips
    WHERE
        member_casual = 'casual'
        AND ((strftime('%s', ended_at) - strftime('%s', started_at)) / 60.0) BETWEEN 1 AND 1440
        AND start_station_name IS NOT NULL
        AND TRIM(start_station_name) <> ''
        AND end_station_name IS NOT NULL
        AND TRIM(end_station_name) <> ''

)
SELECT *
FROM clean_trips

A minimum threshold of 500 rides per station was applied to ensure results reflected consistent usage rather than occasional or anomalous activity. Over a year, this corresponds to roughly 1–2 rides per day, indicating steady demand. Start stations were used rather than end stations for this analysis because they better represent where riders initiate trips, which is more relevant for marketing and rider conversion strategies.

First, the following code was run following the temporary table code to find the five stations with the most total rides. This would provide the most clear indication of which stations are the most active with casual riders overall.

SELECT
    start_station_name AS station,
    COUNT(*) AS casual_trips
FROM clean_trips
GROUP BY start_station_name
HAVING COUNT(*) >= 500
ORDER BY casual_trips DESC
LIMIT 5
5 records
station casual_trips
DuSable Lake Shore Dr & Monroe St 28723
Navy Pier 24532
Streeter Dr & Grand Ave 22127
Michigan Ave & Oak St 20056
DuSable Lake Shore Dr & North Blvd 17310

Next, the following code was run to find the five stations with the highest average ride times across all rides. As casual riders have been observed to ride for longer periods of time, which may indicate more leisure-oriented riding behavior, these stations would make sense to market to that kind of behavior.

SELECT
    start_station_name AS station,
    ROUND(AVG(ride_length_min), 1) AS avg_ride_length
FROM clean_trips
GROUP BY start_station_name
HAVING COUNT(*) >= 500
ORDER BY avg_ride_length DESC
LIMIT 5
5 records
station avg_ride_length
Fort Dearborn Dr & 31st St 46.7
Lake Park Ave & 35th St 44.7
63rd St Beach 42.8
Lakefront Trail & Wilson Ave 41.3
Griffin Museum of Science and Industry 41.1

Finally, the following code was run to find the five stations with the most round trips relative to one-way-trips, calculated as a percentage. A high proportion of round trips may indicate recreational usage, such as sightseeing or exercise rides, where riders return to their starting location.

SELECT
    start_station_name AS station,
    ROUND(100.0 * SUM(round_trip) / COUNT(*), 1) AS round_percent
FROM clean_trips
GROUP BY start_station_name
HAVING COUNT(*) >= 500
ORDER BY round_percent DESC
LIMIT 5
5 records
station round_percent
Lake Park Ave & 35th St 39.3
Wells St & 19th St 31.1
63rd St Beach 31.1
California Ave & Montrose Ave 25.2
University Library (NU) 24.1

Analysis

Raw versus filtered data

Ultimately, 90 plots were generated through R including 45 plots of purely filtered data, 36 plots of purely raw data, and 9 plots directly comparing both. Before observing the trends and patterns present in these plots, the first thing I inspected was whether any substantial difference in behavior could be observed between these different types of plots. What I ultimately found was that, while the increased number of trips present in the raw data can lead to noticeable changes in the plots, these changes serve only to reinforce existing trends and differences between rider types rather than imply substantially different behaviors. An example of this can be seen in the combined plot for average ride time per month, shown below.

2-average_timed_compare.png
2-average_timed_compare.png

From this plot, it can be seen that casual riders in the raw data have consistently higher average ride times than they do in the filtered data. For instance, January 2025 shows an average ride time of about 12.5 minutes in the filtered set, whereas the raw set shows about 15 minutes. Thus, although the raw data in this case can be used to show a larger difference in behavior between rider types, both the raw and filtered data draw the same conclusions regarding relative rider behavior. For this reason, the remainder of the analysis relies on filtered data.

Rides Per Month

The first plot produced was a line plot mapping the number of rides each rider type takes per month, shown below. I feel that showing how often rides are taken is the simplest and most straightforward way to demonstrate the differences in riders’ behavior.

1-total_trips_filtered.png
1-total_trips_filtered.png

This plot shows similarly shaped series for annual and casual riders, except with annual riders generally having about 100,000 more rides per month. Both annual and casual riders are shown to ride the least during winter months from December to February, with January marking the lowest point for both riders. This is followed by a rise during warmer months from March to August, with August being the peak month for both riders. Finally, the autumn months of October and November mark a sharp decline.

A key difference can be seen in September, which marks the end of summer and the beginning of autumn. While casual riders sharply decline during this month, annual riders remain only slightly lower than their peak all the way through October. Additionally, causal riders seem to have a slightly sharper rise during June, when summer would start. Both of these insights suggest that casual riders are more weather and season-dependent than annual riders, particularly preferring to ride during the warmer summer months, whereas annual riders remain more consistent through autumn.

Average Ride Times

The second plot produced was another line plot, this time comparing the average ride times in minutes by month between annual and casual riders, shown below. Measuring ride times helps to show more specific ride behavior between each rider type, particularly the reasons why they would be taking these rides.

2-average_timed_filtered.png
2-average_timed_filtered.png

A much more noticeable difference can be seen between the two rider types. Whereas annual riders range between about 10 and 13 minutes per month, casual riders range between about 12 and 24 minutes. This shows that casual riders not only take longer rides in general, with their peak month having an average almost twice as long as with annual riders’ peak month, but their ride times also fluctuate much more sharply. Both rider types are at their lowest in the winter months of January and February, but casual riders have a sharp increase in March that stays consistently high until it declines starting in the autumn month of September.

These insights reinforce the point made with rides per month, in that casual riders are more seasonal than annual riders. Additionally, longer rides suggest that casual riders may be using their bikes for more recreational purposes, such as for fun, exercise, or sightseeing, which would likely occur more often during warmer months. On the other hand, the more consistent ride times of annual riders suggest that they are more likely to ride for practical reasons, such as for commuting to work, which is more routine and would likely take approximately the same amount of time for each ride.

Rides Per Weekday

The third set of plots shows the number of rides each rider type takes per weekday. Specific weekdays can provide further insight into why each rider type chooses to ride their bikes. The total across all months is shown below.

3-total_rides_by_weekday_filtered_total.png
3-total_rides_by_weekday_filtered_total.png

Across a full year, annual riders ride most often on weekdays, particularly Tuesday and Thursday, before declining starting on the weekend. On the contrary, casual riders ride most often on weekends, particularly Saturday, and decline more sharply during weekdays. These totals vary between months, but most follow a similar pattern where annual riders prefer weekdays and casual riders prefer weekends. The most dramatic examples occur in August and September, shown below.

August marks the highest single weekday of casual rider usage, with Saturday netting almost 80,000 combined rides. On the other hand, September marks the highest weekday of annual rider usage, with Tuesday exceeding 80,000 rides. As noted, the difference is not always this stark, such as in September when casual riders have roughly the same ride counts from Sunday to Monday. However, the plots as a whole further demonstrate a casual rider preference for recreation, as Saturday is the start of the weekend and an opportune time for such activity. Likewise, it further implies an annual rider preference for practicality, as the typical work week from Monday to Friday would indicate that bikes are most often used to commute to work.

Round-Trips Percentage

The fourth plot produced was a pie chart comparing the percentages of rides for both annual and casual riders that are round-trips compared to one-way trips, shown below. This is another way of directly inspecting ride behavior, as one-way trips are more likely to indicate purposes with a set destination in mind, such as commuting. Note that trips with missing start or end stations in the data are excluded, as it cannot be confirmed which type of ride they would be classified as.

4-round_trips_filtered_free.png
4-round_trips_filtered_free.png

This chart shows that both annual and casual riders primarily make one-way trips over round-trips. Casual riders take round-trips slightly more often than annual riders, which is shown to be 15% of casual trips compared to 10% of annual trips. To show this difference more accurately, the chart below scales the casual riders’ plot to the number of rides taken by annual riders.

4-round_trips_filtered_fixed.png
4-round_trips_filtered_fixed.png

Here, the large empty space in the casual riders’ plot represents their lower number of rides overall compared to annual riders, which is shown to be roughly half. This further demonstrates that while annual and casual riders take roughly similar amounts of round-trips throughout the year, casual riders take substantially fewer one-way trips, such as those for commuting. However, one-way trips still dominate for both groups overall.

Median Ride Times

The fifth plot produced was a heat map comparing the median ride times across all weekdays and times of day, shown below. Time of day is another factor that can be used to explain rider behavior, and while average ride times have already been tested, median ride times may help to counteract outliers in the data.

5-median-length-filtered.png
5-median-length-filtered.png

As a whole, this plot reiterates much of what has been established by previous plots, in that annual riders are more consistent overall, whereas casual riders fluctuate more at certain points. In particular, annual riders seem slightly more active around 8 AM and 5 PM on weekdays, the times at which a typical work day would start and end, respectively. On the other hand, casual riders are especially active between 10 AM and 6 PM on weekends, with their most active point being at 1 PM on Saturdays.

Rides Per Hour

The sixth set of plots shows the number of rides each rider type takes per hour on each weekday. This can be used to expand upon the insights of the previous heat map by keying in on more specific instances of high activity. The individual plots for the five weekdays are very similar to each other, as are the two weekends, and the two plots shown below demonstrate an example of each, with Tuesday and Saturday.

For the weekday plot of Tuesday, it can be seen that both annual riders and casual riders have peaks in ride counts at 8 AM and 5 AM, the times of day that work may start and end, respectively. However, annual riders have much sharer rises during these times and declines afterward. On Saturday, both annual and casual rider behaviors are similarly consistent from 11 AM to about 5 PM, when someone may partake in leisurely weekend activities. These trends further support the presumption of annual riders preferring to commute versus casual riders preferring leisure, though it also indicates that these behaviors are not mutually exclusive to each group, as both have similarly shaped series. For instance, because both riders peak at similar times on Tuesday, it can be assumed that a sizeable number of casual riders do use Cyclistic for commuting purposes, albeit at a smaller scale compared to annual riders.

Rides Per Vehicle

The seventh and eighth sets of plots show the number of rides each rider type takes with each vehicle type per month and weekday, respectively. As both classic and electric bikes are offered by the bike sharing company, the specific type of bikes being used could provide further details regarding rider behavior; for instance, electric bikes are often preferred for covering long distances, whereas classic bikes offer a more consistent workout. The plot shown below shows the total across the full year.

7-trips_per_vehicle_filtered_all.png
7-trips_per_vehicle_filtered_all.png

Based on this plot, both annual and casual riders ride electric bikes approximately twice as often as they ride classic bikes. This trend is consistent across all months and weekdays with negligible differences. Despite the distinct ways the two rider types are observed to ride their bikes, no behavioral differences between them can be observed based on their preferred vehicles. This shows that electric bikes are broadly preferred across all riders, though classic bikes still remain a popular option.

Key Findings

Analysis of the produced plots reveals clear and consistent behavioral differences between casual and annual riders, particularly in ride frequency, timing, and trip structure. These patterns form the basis for the key findings and business implications discussed below.

1. Seasonal variance

Casual riders show greater variability based on season and weather, riding for longer and more often during warmer summer months. Annual riders are more consistent throughout the year by comparison. This suggests that casual riders may not see value in committing to a year-long plan, as they are much less likely to ride during colder months.

2. Recreation versus routine

Casual riders primarily use Cyclistic for leisure-oriented trips, riding more on weekends. Annual riders instead prefer practicality, riding more often on weekdays. This suggests that annual memberships are better-aligned with the more routine needs of annual members, such as commuting to work, than the more recreational needs of casual riders, such as exercise.

3. Ride consistency

Annual riders display more consistent ride behavior than casual riders, such as through their more frequent one-way trips and high preference to specific hours of the weekday. This suggests that casual riders who already exhibit similar time-based riding patterns may be strong candidates for conversion to annual memberships.

4. Electric bike preferences

Electric bikes are preferred by both annual and casual riders, indicating broad appeal across usage styles. This suggests that e-bikes can be leveraged as a shared incentive for membership conversion.

Recommendations

The analysis reveals that casual and annual riders use Cyclistic in highly distinct ways. This creates an opportunity for Cyclistic to tailor its marketing and pricing strategies to better convert casual riders into members. The following recommendations focus on increasing perceived value, reducing commitment barriers, and aligning membership benefits with casual rider behavior.

1. Running seasonal promotions

As casual riders show heightened activity during warm months and weekends, targeting these high-activity periods would likely maximize marketing return on investment. Offering ride credits, free trials, or promotional codes for an annual membership during casual riders’ peak activity could lead to higher conversion, as they would be targeted when demand for Cyclistic’s sevices already exists. Because of the limited-time nature of these promotions, they would also be relatively simple to test the effectiveness of without great risk.

However, several risks do come with this action, such as reducing revenue from riders paying full price and worsening bike availability while these promotions are active. Additionally, this measure would only help to bolster high-usage periods rather than low-usage periods, such as winter months. The success of this strategy could be measured through promotion redemption rates, membership sign-ups during campaign windows, and retention of converted riders.

2. Recreational marketing

If casual riders feel that an annual membership does not align with their preferred ways of riding, then changing how the service is marketed to them may provide more effective messaging for them. This does not require replacing existing commuter-focused messaging, but rather segmenting campaigns to better appeal to different rider motivations. Focusing on recreational behavior such as weekend outings, sightseeing, or exercise-based trips while emphasizing membership perks such as unlimited rides would more closely match with common casual trends, and it would also help to expand Cyclistic’s brand positioning beyond the market it currently controls.

It is worth noting that due to the high number of current members that do not match the inferred casual behavior, a shift in marketing may alienate existing customers who have chosen Cyclistic for practical needs such as commuting to work. Additionally, the aforementioned seasonal heights of casual rider activity could indicate that this marketing shift may not be effective during months where casual riders are not otherwise active. Effectiveness could be evaluated through engagement metrics on targeted campaigns and conversion rates among recreational riders.

3. Low-committal passes

The variability of casual rider behavior suggests that they may be hesitant to invest in a full year-long plan. Creating intermediary passes between the existing full-day and full-year offers, such as a “summer pass” lasting from June to August, could reduce the friction in their purchasing decisions. The shorter time spans of these passes would also make them easier to price-test by design. Riders that buy into these passes may also feel more inclined to becoming annual members, especially if they offer the option to convert to an annual plan.

On the contrary, this may also run the risk of cannibalizing annual memberships and lowering per-rider revenue if too many casual riders do not feel inclined to upgrade. Careful pricing would also be required in order to maintain the perceived value of a full membership. Pilot programs could track upgrade rates from short-term passes to full memberships.

4. Electric bike perks

It has been observed that electric bikes are by far the most popular vehicle of choice for riders of all types. Therefore, tying electric bike perks to annual memberships, such as reduced electric bike fees, could prove an attractive, tangible incentive for riders. Increasing the perceived value of a membership with an existing point of interest could provide what many value-conscious riders may look for. To control operational costs, these perks could be structured with reasonable limits, such as through capped discounted rides per month.

However, because high electric bike usage has also been observed in existing members, providing perks on such bikes would also run the risk of eating into their own revenue. Higher incentive to ride electric bikes would also lead to higher demand for them, which subsequently would require higher operational costs, more frequent maintenance, and increased supply. Success could be measured through changes in membership sign-ups and electric bike utilization rates among members.

Limitations and Next Steps

Although this analysis reveals meaningful behavioral differences between casual riders and annual members, several limitations should be acknowledged both for judging the strength of the current analysis and as considerations for future analyses.

This study relies solely on anonymized trip data and as such does not include demographic information such as age, income, or residency status. As a result, conclusions about who casual riders are and why they make purchasing decisions remain limited. Future analysis incorporating customer surveys or demographic datasets could provide deeper insight into rider motivations. Additionally, pricing data was not available, preventing evaluation of price sensitivity or how cost influences conversion. Access to historical pricing changes would allow for stronger conclusions about purchasing behavior.

The lack of available user data also forces the analysis to judge activity in terms of the rides themselves rather than the people riding them. It is impossible to know based on the provided trip data how many users it represents or how often these users are returning for additional trips. This reflects a larger problem: throughout the analysis, ride behavior is assumed to directly reflect rider intent, but this interpretation cannot be fully confirmed without direct customer feedback. Surveys or app usage data could validate these assumptions.

Finally, while the trip data includes station names and geographic coordinates, this analysis did not incorporate distance calculations or route-level spatial analysis. As a result, it cannot determine how far riders typically travel, which routes are most common, or whether different rider types favor particular geographic areas. This limits the ability to distinguish between short convenience trips and longer recreational or commuting routes; for instance, casual riders’ longer durations may reflect leisurely riding rather than greater distances, but this cannot be confirmed without spatial analysis. Tracking location throughout a ride, rather than merely the start and end coordinates, could help to better understand how rider behavior varies geographically.

Conclusions

This analysis set out to support Cyclistic’s goal of converting casual riders into annual members by identifying behavioral differences between rider types. The data shows that casual riders are more seasonal, more weekend-oriented, and more likely to take longer recreational trips. In contrast, current annual members, display more consistent usage patterns and stronger weekday activity, reflecting a more routine or practical use of Cyclistic’s service.

These differences indicate that casual riders may not yet perceive enough long-term value in a full year-long membership. By aligning marketing strategies with casual riders’ recreational preferences and seasonal activity, Cyclistic can better position memberships as appealing and relevant to their needs.

While further research is needed to understand rider motivations more deeply, the findings from this analysis provide a data-driven foundation for targeted marketing strategies and future experimentation. With thoughtful testing and refinement, Cyclistic can use these insights to increase membership conversion and strengthen long-term customer loyalty.

References

  1. Wikipedia Contributors. (2025, December 18). Divvy. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Divvy
  2. Home | Divvy Bikes. (n.d.). Divvybikes.com. https://divvybikes.com/
  3. Divvy Data. (n.d.). Divvybikes.com. https://divvybikes.com/system-data
  4. Boundaries - City - Map | City of Chicago | Data Portal. (2025, July 24). Cityofchicago.org. https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-City-Map/ewy2-6yfk