Introduction

This R Markdown file documents the exploration, summary, and visualization of Divvy’s bikeshare data. For more details about the data and the data cleaning process using PostgreSQL, please see my other portfolio here: https://eemchua.github.io/divvy_data_cleaning/

Since the purpose of this R Markdown file is to showcase my Exploratory Data Analysis (EDA) approach, the coding is provided for demonstration purposes (I apologize for the length of this report). I will explore nine topics, progressing from univariate to bivariate and then multivariate analysis. Here is the list of topics:

  1. Distribution of Bike Type (rideable_type)
  2. Distribution of Rider Type (member_casual)
  3. Distribution of Start Stations (start_station_name)
  4. Distribution of End Stations (end_station_name)
  5. Analysis of the Top 10 Start and End Stations
  6. Analysis of Rider Type (member_casual) vs. Rideable Type (rideable_type)
  7. Analysis of Rider Type (member_casual) by Day of the Week
  8. Analysis of Rider Type (member_casual) by Hour of the Day
  9. Analysis of Distance vs. Duration by Rider Type

Before starting the EDA, all necessary libraries were installed. The cleaned dataset was also loaded into RStudio through PostgreSQL.


Overview of the Imported Dataset

To ensure the accuracy of the cleaned dataset imported from PostgreSQL, a detailed report has been generated:

library(skimr)
skim(trip_data)
Data summary
Name trip_data
Number of rows 4241247
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 4241126 0
rideable_type 0 1 11 13 0 3 0
start_station_name 0 1 10 64 0 1654 0
start_station_id 0 1 3 14 0 1629 0
end_station_name 0 1 10 64 0 1668 0
end_station_id 0 1 3 36 0 1636 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_lat 0 1 41.90 0.04 41.65 41.88 41.89 41.93 42.06 ▁▁▇▇▁
start_lng 0 1 -87.64 0.03 -87.84 -87.66 -87.64 -87.63 -87.53 ▁▁▃▇▁
end_lat 0 1 41.90 0.05 0.00 41.88 41.90 41.93 42.06 ▁▁▁▁▇
end_lng 0 1 -87.64 0.05 -87.84 -87.66 -87.64 -87.63 0.00 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2023-08-01 00:00:07 2024-07-31 23:54:26 2024-02-12 15:22:57 3873410
ended_at 0 1 2023-08-01 00:01:03 2024-07-31 23:59:56 2024-02-12 15:36:39 3880164

Observation:
* The cleaned dataset has been successfully loaded to RStudio.


EDA 1 - Distribution of Bike Type (rideable_type)

Summary:

table(trip_data$rideable_type)
## 
##  classic_bike   docked_bike electric_bike 
##       2854488         15473       1371286

Plot:

library(ggplot2)
library(forcats)
library(scales)
ggplot(trip_data, aes(x = fct_infreq(rideable_type))) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  ggtitle("Distribution of Rideable Types") +
  xlab("Rideable Type") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("")))

Observation:
* Classic bikes were the most commonly used bike type, followed by electric bikes and docked bikes.
* While classic bikes dominate overall usage, electric bikes are also popular, likely due to their efficiency and ease of use over longer distances.


EDA 2 - Distribution of Rider Type (member_casual)

Summary:

table(trip_data$member_casual)
## 
##  casual  member 
## 1490330 2750917

Plot:

ggplot(trip_data, aes(x = member_casual)) +
  geom_bar(fill = "darkgreen") +
  theme_minimal() +
  ggtitle("Distribution of Member vs Casual Riders") +
  xlab("Rider Type") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("")))

Observation:
* Members represent a larger portion of total riders compared to casual users. Member riders account for nearly 65% of all trips, while casual riders make up the remaining 35%.
* This suggests that the majority of trips are taken by subscribed members, likely indicating a strong base of regular users who frequently utilize the service. Although casual riders are significant, they represent a smaller portion of the overall user base.


EDA 3 - Distribution of Start Stations (start_station_name)

For the start stations, we will explore the top 10 stations and the number of rides recorded.

Summary:

library(dplyr)
library(DT)

top_start_stations <- trip_data %>%
  count(start_station_name, sort = TRUE) %>%
  slice_max(n, n = 10)

datatable(top_start_stations, colnames = c("Start Station Name", "Count"), 
          options = list(pageLength = 10), 
          caption = 'Top 10 Start Stations')

Plot:

library(scales)

ggplot(top_start_stations, aes(x = reorder(start_station_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "orange") +
  coord_flip() +
  theme_minimal() +
  ggtitle("Top 10 Most Frequent Start Stations") +
  xlab("Start Station Name") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))

Observation:
* As the plot shows, the 2nd to 10th stations have ride counts between 27K and 40K, displaying a stable upward trend. The 1st station at Streeter Dr & Grand Ave stands out with the highest count, just over 60K.
* Possible explanations for the significantly higher ride count at the Streeter Dr & Grand Ave station include: (1) its proximity to Navy Pier, a major tourist and event destination, (2) its location next to the popular bike trail along Lake Michigan, and (3) its central position between Chicago’s north and south side.


EDA 4 - Distribution of End Stations (end_station_name)

For the end stations, we will also explore the top 10 stations and the number of rides recorded.

Summary:

top_end_stations <- trip_data %>%
  count(end_station_name, sort = TRUE) %>%
  slice_max(n, n = 10)  

datatable(top_end_stations, colnames = c("End Station Name", "Count"), 
          options = list(pageLength = 10), 
          caption = 'Top 10 End Stations')

Plot:

ggplot(top_end_stations, aes(x = reorder(end_station_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  theme_minimal() +
  ggtitle("Top 10 Most Frequent End Stations") +
  xlab("End Station Name") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))

Observation:
* As the summary and plot show, the distribution pattern of End Stations is almost identical to that of the Start Stations. In fact, both share the same top 10 stations. We can further examine these 10 stations and perform a bivariate analysis (EDA 5).


EDA 5 - Analysis of the Top 10 Stations

Summary:

library(tidyr) 
library(knitr)

combined_stations <- top_start_stations %>%
  inner_join(top_end_stations, by = c("start_station_name" = "end_station_name")) %>%
  rename(station_name = start_station_name, start_count = n.x, end_count = n.y)

kable(combined_stations, caption = "Combined Table of Top 10 Start and End Stations")
Combined Table of Top 10 Start and End Stations
station_name start_count end_count
Streeter Dr & Grand Ave 60207 61788
DuSable Lake Shore Dr & Monroe St 39446 37645
Michigan Ave & Oak St 35540 35991
DuSable Lake Shore Dr & North Blvd 35012 38115
Kingsbury St & Kinzie St 33149 32515
Clark St & Elm St 31865 31499
Clinton St & Washington Blvd 31256 32006
Millennium Park 28639 29633
Wells St & Concord Ln 27692 28294
Clinton St & Madison St 27549 28390

Observation:
* The fact that the top 10 stations are the same for both Start and End Stations reinforces that these stations serve as major hubs for bikeshare activity. Potential topics to discuss with stakeholders include the possibility of increasing bike capacity and/or expanding infrastructure at these key hubs.

Plot:

combined_stations_long <- combined_stations %>%
  pivot_longer(cols = c(start_count, end_count), 
               names_to = "type", 
               values_to = "count")

ggplot(combined_stations_long, aes(x = reorder(station_name, count), y = count, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Comparison of Start and End Counts for Top 10 Stations", 
       x = "Station Name", 
       y = "Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +  # Format y-axis in K (thousands)
  scale_fill_manual(values = c("start_count" = "orange", "end_count" = "blue")) +
  theme(legend.title = element_blank())

Observation:
* The start and end counts are identical for all 10 stations, as shown in the plot. This pattern suggests that riders are likely taking round-trip rides, making it worthwhile to explore targeted marketing strategies for round-trip rides.


EDA 6 - Analysis Rider Type (member_casual) vs. Rideable Type (rideable_type)

Summary:

rider_vs_rideable_table <- trip_data %>%
  group_by(member_casual, rideable_type) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(rideable_type = reorder(rideable_type, -count))

print(rider_vs_rideable_table)
## # A tibble: 5 × 3
##   member_casual rideable_type   count
##   <chr>         <fct>           <int>
## 1 casual        classic_bike   975607
## 2 casual        docked_bike     15473
## 3 casual        electric_bike  499250
## 4 member        classic_bike  1878881
## 5 member        electric_bike  872036

Plot:

ggplot(rider_vs_rideable_table, aes(x = member_casual, y = count, fill = rideable_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + 
  labs(title = "Rider Type vs. Rideable Type", x = "Rider Type", y = "Count") +
  theme_minimal() +
  theme(legend.title = element_blank())

Observation:
* As shown, the results align with the observations made earlier in EDA 1 and EDA 2. For both casual and member riders, the current prediction is that the distribution of trips will be approximately two-thirds on classic bikes and one-third on electric bikes.


EDA 7 - Analysis of Rider Type (member_casual) by Day of the Week

We will explore two scenarios: (i) Classic Bike Usage by Day and (ii) Electric Bike Usage by Day.

Adding a day-of-the-week column to the dataset:

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
trip_data$started_at <- as.POSIXct(trip_data$started_at, format = "%Y-%m-%d %H:%M:%S")
trip_data$ended_at <- as.POSIXct(trip_data$ended_at, format = "%Y-%m-%d %H:%M:%S")
trip_data <- trip_data %>%
  mutate(day_of_week = wday(started_at, label = TRUE, week_start = 1))

(i) Classic Bike Usage by Day

Plot:

classic_bike_usage_by_rider_type <- trip_data %>%
  filter(rideable_type == "classic_bike") %>%
  group_by(member_casual, day_of_week) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(classic_bike_usage_by_rider_type, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(title = "Member vs. Casual Riders on Classic Bikes by Day of the Week",
       x = "Day of the Week", y = "Number of Trips",
       fill = "Rider Type") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "darkorange", "member" = "blue")) +
  theme_minimal()

(ii) Electric Bike Usage by Day

Plot:

electric_bike_usage_by_rider_type <- trip_data %>%
  filter(rideable_type == "electric_bike") %>%
  group_by(member_casual, day_of_week) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(electric_bike_usage_by_rider_type, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(title = "Member vs. Casual Riders on Electric Bikes by Day of the Week",
       x = "Day of the Week", y = "Number of Trips",
       fill = "Rider Type") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) + 
  scale_fill_manual(values = c("casual" = "orange", "member" = "lightblue")) +
  theme_minimal()

Observation:
* Comparing the plots, the usage of classic and electric bikes is identical across the days of the week. This suggests that bike type is not a significant factor in these scenarios.
* Members dominate bike usage throughout the week, with significantly higher trip counts compared to casual riders.
* Casual riders show consistent bike usage, with peaks toward the weekends, indicating they may use bikes more for weekend leisure trips.


EDA 8 - Analysis of Rider Type (member_casual) by Hour of the Day

We will explore four scenarios: (i) Classic Bike Usage by Hour on Weekdays, (ii) Electric Bike Usage by Hour on Weekdays, (iii) Classic Bike Usage by Hour on Weekends, and (iv) Electric Bike Usage by Hour on Weekends.

Converting and adding the hour column in AM/PM format to the dataset:

trip_data <- trip_data %>%
  mutate(hour_of_day = hour(started_at))

convert_to_12hr <- function(x) {
  ifelse(x == 0, "12 AM",
         ifelse(x < 12, paste(x, "AM"),
                ifelse(x == 12, "12 PM", paste(x - 12, "PM"))))
}

(i) Classic Bike Usage by Hour on Weekdays

Plot:

classic_bike_weekday <- trip_data %>%
  filter(rideable_type == "classic_bike", day_of_week %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(classic_bike_weekday, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") + 
  labs(title = "Member vs. Casual Riders on Classic Bikes by Hour (Weekdays)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "brown3", "member" = "darkgreen")) +
  theme_minimal()

(ii) Electric Bike Usage by Hour on Weekdays

Plot:

electric_bike_weekday <- trip_data %>%
  filter(rideable_type == "electric_bike", day_of_week %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(electric_bike_weekday, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Member vs. Casual Riders on Electric Bikes by Hour (Weekdays)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "brown1", "member" = "green")) +
  theme_minimal()

Observation:
* Comparing plots (i) and (ii), the usage of classic and electric bikes by hour on weekdays is identical, suggesting that bike type is not a crucial factor in these scenarios.
* Members dominate the usage of both bike types, particularly during peak commuting hours around 8AM and 5-6PM.
* Casual riders use both bikes more consistently throughout the day, with the peak usage occurring between 4-6PM.

(iii) Classic Bike Usage by Hour on Weekends

Plot:

classic_bike_weekend <- trip_data %>%
  filter(rideable_type == "classic_bike", day_of_week %in% c("Sat", "Sun")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(classic_bike_weekend, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Member vs. Casual Riders on Classic Bikes by Hour (Weekends)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +  
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "darkgoldenrod3", "member" = "deepskyblue3")) +
  theme_minimal()

(iv) Electric Bike Usage by Hour on Weekends

Plot:

electric_bike_weekend <- trip_data %>%
  filter(rideable_type == "electric_bike", day_of_week %in% c("Sat", "Sun")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(electric_bike_weekend, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Member vs. Casual Riders on Electric Bikes by Hour (Weekends)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "goldenrod1", "member" = "skyblue1")) +
  theme_minimal()

Observation:
* Comparing plots (iii) and (iv), the usage of classic and electric bikes by hour on weekends is also identical, suggesting that bike type is not a crucial factor in these scenarios.
* Both bike types show a strong weekend pattern, where both casual and member riders are highly active during the midday and afternoon hours. There is a steady increase in trips starting around 8AM, with a peak around noon to 2PM, followed by a gradual decline after 4PM.
* This pattern suggests that weekend usage is largely driven by leisure trips, such as visits to parks, tourist attractions, or other recreational activities.


EDA 9 - Analysis of Distance vs. Duration by Rider Type

In this section, we explore the relationship between distance and duration for both casual riders and member riders.

Distance and duration have been added to the table for analysis, and some outliers have been removed using the code shown below:

# Adding duration in minutes to the dataset
trip_data <- trip_data %>%
  mutate(duration_minutes = as.numeric(difftime(ended_at, started_at, units = "mins")))

# Adding distance in miles to the dataset using Haversine formula
library(geosphere)
trip_data <- trip_data %>%
  mutate(distance_miles = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat)) / 1609.34)  # Convert meters to miles

# Remove outliers where distance is greater than 6000 miles
trip_data <- trip_data %>%
  filter(distance_miles <= 6000)

# Remove outliers where trip duration is greater than 2000 minutes
trip_data <- trip_data %>%
  filter(duration_minutes <= 2000)

Plots:

ggplot(trip_data, aes(x = distance_miles, y = duration_minutes)) +
  geom_hex() +
  labs(title = "Distance vs. Duration by Rider Type (Hexagonal Binning)", 
       x = "Distance (Miles)", y = "Duration (Minutes)") +
  facet_wrap(~member_casual) +  # Facet by rider type (member vs. casual)
  theme_minimal() +
  scale_fill_gradient(low = "lightblue", high = "darkred")  # Adjust the color gradient

Observation:
* The plots show a high concentration of trips (represented by darker red in the lower-left corners) with short distances (< 5 miles) and short durations (< 100 minutes). The density of trips becomes sparser as distances exceed 10 miles. These observations suggest that the bikeshare system is predominantly used for shorter, local trips rather than long-distance travel.
* Comparing the plots, casual riders tend to have longer durations for shorter distances compared to members, reflecting that casual riders are more likely to use the bikes for leisure or exploration rather than commuting.
* Most of the rides completed by members are short in both distance and duration, indicating that they primarily use the service for functional trips, such as commuting or regular, planned travel.
* Long-distance and long-duration trips are rare for both rider types.


Business Recommendations:

Before and during the EDA, it is important to continuously collaborate with stakeholders to understand their business model and identify strategies that will help achieve their revenue goals. Since I am not working directly with stakeholders and do not have insight into Divvy Bikes’ key revenue sources, I will suggest several potential strategies to optimize operations and increase user engagement based solely on the EDA findings outlined.

  1. Targeted Commuter Offers and Incentives: Member riders dominate during commuting hours on weekdays. Offering loyalty programs or commuter-specific packages could help retain and grow the member base.
  2. Enhance Service at Popular Stations: Ensure ample bike availability at the popular stations, particularly during peak hours on weekdays and weekends. Consider implementing real-time bike availability displays at these stations or in the app to help riders locate available bikes quickly.
  3. Implement Dynamic Bike Rebalancing: Given the clear peaks in demand by time of day and day of the week, invest in dynamic bike rebalancing solutions. This ensures that bikes are distributed efficiently across high-demand stations, especially on weekends when casual riders tend to crowd popular areas.
  4. Optimize User Experience for Casual Riders: Casual riders tend to take longer-duration trips. Enhancing the riding experience for casual users by providing more tourist-friendly routes or partnerships with local attractions, could encourage longer rides and more frequent usage.

To conclude, it is crucial to continuously monitor real-time usage trends and adjust bikeshare availability, pricing, and promotions accordingly.