This R Markdown file documents the exploration, summary, and visualization of Divvy’s bikeshare data. For more details about the data and the data cleaning process using PostgreSQL, please see my other portfolio here: https://eemchua.github.io/divvy_data_cleaning/
Since the purpose of this R Markdown file is to showcase my Exploratory Data Analysis (EDA) approach, the coding is provided for demonstration purposes (I apologize for the length of this report). I will explore nine topics, progressing from univariate to bivariate and then multivariate analysis. Here is the list of topics:
Before starting the EDA, all necessary libraries were installed. The cleaned dataset was also loaded into RStudio through PostgreSQL.
To ensure the accuracy of the cleaned dataset imported from PostgreSQL, a detailed report has been generated:
library(skimr)
skim(trip_data)
| Name | trip_data |
| Number of rows | 4241247 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 4241126 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 0 | 1 | 10 | 64 | 0 | 1654 | 0 |
| start_station_id | 0 | 1 | 3 | 14 | 0 | 1629 | 0 |
| end_station_name | 0 | 1 | 10 | 64 | 0 | 1668 | 0 |
| end_station_id | 0 | 1 | 3 | 36 | 0 | 1636 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.65 | 41.88 | 41.89 | 41.93 | 42.06 | ▁▁▇▇▁ |
| start_lng | 0 | 1 | -87.64 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.53 | ▁▁▃▇▁ |
| end_lat | 0 | 1 | 41.90 | 0.05 | 0.00 | 41.88 | 41.90 | 41.93 | 42.06 | ▁▁▁▁▇ |
| end_lng | 0 | 1 | -87.64 | 0.05 | -87.84 | -87.66 | -87.64 | -87.63 | 0.00 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2023-08-01 00:00:07 | 2024-07-31 23:54:26 | 2024-02-12 15:22:57 | 3873410 |
| ended_at | 0 | 1 | 2023-08-01 00:01:03 | 2024-07-31 23:59:56 | 2024-02-12 15:36:39 | 3880164 |
Observation:
* The cleaned dataset has been successfully loaded to RStudio.
Summary:
table(trip_data$rideable_type)
##
## classic_bike docked_bike electric_bike
## 2854488 15473 1371286
Plot:
library(ggplot2)
library(forcats)
library(scales)
ggplot(trip_data, aes(x = fct_infreq(rideable_type))) +
geom_bar(fill = "steelblue") +
theme_minimal() +
ggtitle("Distribution of Rideable Types") +
xlab("Rideable Type") +
ylab("Count") +
scale_y_continuous(labels = label_number(scale_cut = cut_si("")))
Observation:
* Classic bikes were the most commonly used bike type, followed by
electric bikes and docked bikes.
* While classic bikes dominate overall usage, electric bikes are also
popular, likely due to their efficiency and ease of use over longer
distances.
Summary:
table(trip_data$member_casual)
##
## casual member
## 1490330 2750917
Plot:
ggplot(trip_data, aes(x = member_casual)) +
geom_bar(fill = "darkgreen") +
theme_minimal() +
ggtitle("Distribution of Member vs Casual Riders") +
xlab("Rider Type") +
ylab("Count") +
scale_y_continuous(labels = label_number(scale_cut = cut_si("")))
Observation:
* Members represent a larger portion of total riders compared to casual
users. Member riders account for nearly 65% of all trips, while casual
riders make up the remaining 35%.
* This suggests that the majority of trips are taken by subscribed
members, likely indicating a strong base of regular users who frequently
utilize the service. Although casual riders are significant, they
represent a smaller portion of the overall user base.
For the start stations, we will explore the top 10 stations and the number of rides recorded.
Summary:
library(dplyr)
library(DT)
top_start_stations <- trip_data %>%
count(start_station_name, sort = TRUE) %>%
slice_max(n, n = 10)
datatable(top_start_stations, colnames = c("Start Station Name", "Count"),
options = list(pageLength = 10),
caption = 'Top 10 Start Stations')
Plot:
library(scales)
ggplot(top_start_stations, aes(x = reorder(start_station_name, n), y = n)) +
geom_bar(stat = "identity", fill = "orange") +
coord_flip() +
theme_minimal() +
ggtitle("Top 10 Most Frequent Start Stations") +
xlab("Start Station Name") +
ylab("Count") +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))
Observation:
* As the plot shows, the 2nd to 10th stations have ride counts between
27K and 40K, displaying a stable upward trend. The 1st station at
Streeter Dr & Grand Ave stands out with the highest count, just over
60K.
* Possible explanations for the significantly higher ride count at the
Streeter Dr & Grand Ave station include: (1) its proximity to Navy
Pier, a major tourist and event destination, (2) its location next to
the popular bike trail along Lake Michigan, and (3) its central position
between Chicago’s north and south side.
For the end stations, we will also explore the top 10 stations and the number of rides recorded.
Summary:
top_end_stations <- trip_data %>%
count(end_station_name, sort = TRUE) %>%
slice_max(n, n = 10)
datatable(top_end_stations, colnames = c("End Station Name", "Count"),
options = list(pageLength = 10),
caption = 'Top 10 End Stations')
Plot:
ggplot(top_end_stations, aes(x = reorder(end_station_name, n), y = n)) +
geom_bar(stat = "identity", fill = "blue") +
coord_flip() +
theme_minimal() +
ggtitle("Top 10 Most Frequent End Stations") +
xlab("End Station Name") +
ylab("Count") +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))
Observation:
* As the summary and plot show, the distribution pattern of End Stations
is almost identical to that of the Start Stations. In fact, both share
the same top 10 stations. We can further examine these 10 stations and
perform a bivariate analysis (EDA 5).
Summary:
library(tidyr)
library(knitr)
combined_stations <- top_start_stations %>%
inner_join(top_end_stations, by = c("start_station_name" = "end_station_name")) %>%
rename(station_name = start_station_name, start_count = n.x, end_count = n.y)
kable(combined_stations, caption = "Combined Table of Top 10 Start and End Stations")
| station_name | start_count | end_count |
|---|---|---|
| Streeter Dr & Grand Ave | 60207 | 61788 |
| DuSable Lake Shore Dr & Monroe St | 39446 | 37645 |
| Michigan Ave & Oak St | 35540 | 35991 |
| DuSable Lake Shore Dr & North Blvd | 35012 | 38115 |
| Kingsbury St & Kinzie St | 33149 | 32515 |
| Clark St & Elm St | 31865 | 31499 |
| Clinton St & Washington Blvd | 31256 | 32006 |
| Millennium Park | 28639 | 29633 |
| Wells St & Concord Ln | 27692 | 28294 |
| Clinton St & Madison St | 27549 | 28390 |
Observation:
* The fact that the top 10 stations are the same for both Start and End
Stations reinforces that these stations serve as major hubs for
bikeshare activity. Potential topics to discuss with stakeholders
include the possibility of increasing bike capacity and/or expanding
infrastructure at these key hubs.
Plot:
combined_stations_long <- combined_stations %>%
pivot_longer(cols = c(start_count, end_count),
names_to = "type",
values_to = "count")
ggplot(combined_stations_long, aes(x = reorder(station_name, count), y = count, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
theme_minimal() +
labs(title = "Comparison of Start and End Counts for Top 10 Stations",
x = "Station Name",
y = "Count") +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + # Format y-axis in K (thousands)
scale_fill_manual(values = c("start_count" = "orange", "end_count" = "blue")) +
theme(legend.title = element_blank())
Observation:
* The start and end counts are identical for all 10 stations, as shown
in the plot. This pattern suggests that riders are likely taking
round-trip rides, making it worthwhile to explore targeted marketing
strategies for round-trip rides.
Summary:
rider_vs_rideable_table <- trip_data %>%
group_by(member_casual, rideable_type) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(rideable_type = reorder(rideable_type, -count))
print(rider_vs_rideable_table)
## # A tibble: 5 × 3
## member_casual rideable_type count
## <chr> <fct> <int>
## 1 casual classic_bike 975607
## 2 casual docked_bike 15473
## 3 casual electric_bike 499250
## 4 member classic_bike 1878881
## 5 member electric_bike 872036
Plot:
ggplot(rider_vs_rideable_table, aes(x = member_casual, y = count, fill = rideable_type)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
labs(title = "Rider Type vs. Rideable Type", x = "Rider Type", y = "Count") +
theme_minimal() +
theme(legend.title = element_blank())
Observation:
* As shown, the results align with the observations made earlier in EDA
1 and EDA 2. For both casual and member riders, the current prediction
is that the distribution of trips will be approximately two-thirds on
classic bikes and one-third on electric bikes.
We will explore two scenarios: (i) Classic Bike Usage by Day and (ii) Electric Bike Usage by Day.
Adding a day-of-the-week column to the dataset:
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
trip_data$started_at <- as.POSIXct(trip_data$started_at, format = "%Y-%m-%d %H:%M:%S")
trip_data$ended_at <- as.POSIXct(trip_data$ended_at, format = "%Y-%m-%d %H:%M:%S")
trip_data <- trip_data %>%
mutate(day_of_week = wday(started_at, label = TRUE, week_start = 1))
Plot:
classic_bike_usage_by_rider_type <- trip_data %>%
filter(rideable_type == "classic_bike") %>%
group_by(member_casual, day_of_week) %>%
summarise(trip_count = n(), .groups = "drop")
ggplot(classic_bike_usage_by_rider_type, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Member vs. Casual Riders on Classic Bikes by Day of the Week",
x = "Day of the Week", y = "Number of Trips",
fill = "Rider Type") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
scale_fill_manual(values = c("casual" = "darkorange", "member" = "blue")) +
theme_minimal()
Plot:
electric_bike_usage_by_rider_type <- trip_data %>%
filter(rideable_type == "electric_bike") %>%
group_by(member_casual, day_of_week) %>%
summarise(trip_count = n(), .groups = "drop")
ggplot(electric_bike_usage_by_rider_type, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Member vs. Casual Riders on Electric Bikes by Day of the Week",
x = "Day of the Week", y = "Number of Trips",
fill = "Rider Type") +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
scale_fill_manual(values = c("casual" = "orange", "member" = "lightblue")) +
theme_minimal()
Observation:
* Comparing the plots, the usage of classic and electric bikes is
identical across the days of the week. This suggests that bike type is
not a significant factor in these scenarios.
* Members dominate bike usage throughout the week, with significantly
higher trip counts compared to casual riders.
* Casual riders show consistent bike usage, with peaks toward the
weekends, indicating they may use bikes more for weekend leisure
trips.
We will explore four scenarios: (i) Classic Bike Usage by Hour on Weekdays, (ii) Electric Bike Usage by Hour on Weekdays, (iii) Classic Bike Usage by Hour on Weekends, and (iv) Electric Bike Usage by Hour on Weekends.
Converting and adding the hour column in AM/PM format to the dataset:
trip_data <- trip_data %>%
mutate(hour_of_day = hour(started_at))
convert_to_12hr <- function(x) {
ifelse(x == 0, "12 AM",
ifelse(x < 12, paste(x, "AM"),
ifelse(x == 12, "12 PM", paste(x - 12, "PM"))))
}
Plot:
classic_bike_weekday <- trip_data %>%
filter(rideable_type == "classic_bike", day_of_week %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
group_by(member_casual, hour_of_day) %>%
summarise(trip_count = n(), .groups = "drop")
ggplot(classic_bike_weekday, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Member vs. Casual Riders on Classic Bikes by Hour (Weekdays)",
x = "Hour of the Day", y = "Number of Trips",
fill = "Rider Type") +
scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
scale_fill_manual(values = c("casual" = "brown3", "member" = "darkgreen")) +
theme_minimal()
Plot:
electric_bike_weekday <- trip_data %>%
filter(rideable_type == "electric_bike", day_of_week %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
group_by(member_casual, hour_of_day) %>%
summarise(trip_count = n(), .groups = "drop")
ggplot(electric_bike_weekday, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Member vs. Casual Riders on Electric Bikes by Hour (Weekdays)",
x = "Hour of the Day", y = "Number of Trips",
fill = "Rider Type") +
scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
scale_fill_manual(values = c("casual" = "brown1", "member" = "green")) +
theme_minimal()
Observation:
* Comparing plots (i) and (ii), the usage of classic and electric bikes
by hour on weekdays is identical, suggesting that bike type is not a
crucial factor in these scenarios.
* Members dominate the usage of both bike types, particularly during
peak commuting hours around 8AM and 5-6PM.
* Casual riders use both bikes more consistently throughout the day,
with the peak usage occurring between 4-6PM.
Plot:
classic_bike_weekend <- trip_data %>%
filter(rideable_type == "classic_bike", day_of_week %in% c("Sat", "Sun")) %>%
group_by(member_casual, hour_of_day) %>%
summarise(trip_count = n(), .groups = "drop")
ggplot(classic_bike_weekend, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Member vs. Casual Riders on Classic Bikes by Hour (Weekends)",
x = "Hour of the Day", y = "Number of Trips",
fill = "Rider Type") +
scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
scale_fill_manual(values = c("casual" = "darkgoldenrod3", "member" = "deepskyblue3")) +
theme_minimal()
Plot:
electric_bike_weekend <- trip_data %>%
filter(rideable_type == "electric_bike", day_of_week %in% c("Sat", "Sun")) %>%
group_by(member_casual, hour_of_day) %>%
summarise(trip_count = n(), .groups = "drop")
ggplot(electric_bike_weekend, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Member vs. Casual Riders on Electric Bikes by Hour (Weekends)",
x = "Hour of the Day", y = "Number of Trips",
fill = "Rider Type") +
scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
scale_fill_manual(values = c("casual" = "goldenrod1", "member" = "skyblue1")) +
theme_minimal()
Observation:
* Comparing plots (iii) and (iv), the usage of classic and electric
bikes by hour on weekends is also identical, suggesting that bike type
is not a crucial factor in these scenarios.
* Both bike types show a strong weekend pattern, where both casual and
member riders are highly active during the midday and afternoon hours.
There is a steady increase in trips starting around 8AM, with a peak
around noon to 2PM, followed by a gradual decline after 4PM.
* This pattern suggests that weekend usage is largely driven by leisure
trips, such as visits to parks, tourist attractions, or other
recreational activities.
In this section, we explore the relationship between distance and duration for both casual riders and member riders.
Distance and duration have been added to the table for analysis, and some outliers have been removed using the code shown below:
# Adding duration in minutes to the dataset
trip_data <- trip_data %>%
mutate(duration_minutes = as.numeric(difftime(ended_at, started_at, units = "mins")))
# Adding distance in miles to the dataset using Haversine formula
library(geosphere)
trip_data <- trip_data %>%
mutate(distance_miles = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat)) / 1609.34) # Convert meters to miles
# Remove outliers where distance is greater than 6000 miles
trip_data <- trip_data %>%
filter(distance_miles <= 6000)
# Remove outliers where trip duration is greater than 2000 minutes
trip_data <- trip_data %>%
filter(duration_minutes <= 2000)
Plots:
ggplot(trip_data, aes(x = distance_miles, y = duration_minutes)) +
geom_hex() +
labs(title = "Distance vs. Duration by Rider Type (Hexagonal Binning)",
x = "Distance (Miles)", y = "Duration (Minutes)") +
facet_wrap(~member_casual) + # Facet by rider type (member vs. casual)
theme_minimal() +
scale_fill_gradient(low = "lightblue", high = "darkred") # Adjust the color gradient
Observation:
* The plots show a high concentration of trips (represented by darker
red in the lower-left corners) with short distances (< 5 miles) and
short durations (< 100 minutes). The density of trips becomes sparser
as distances exceed 10 miles. These observations suggest that the
bikeshare system is predominantly used for shorter, local trips rather
than long-distance travel.
* Comparing the plots, casual riders tend to have longer durations for
shorter distances compared to members, reflecting that casual riders are
more likely to use the bikes for leisure or exploration rather than
commuting.
* Most of the rides completed by members are short in both distance and
duration, indicating that they primarily use the service for functional
trips, such as commuting or regular, planned travel.
* Long-distance and long-duration trips are rare for both rider
types.
Before and during the EDA, it is important to continuously collaborate with stakeholders to understand their business model and identify strategies that will help achieve their revenue goals. Since I am not working directly with stakeholders and do not have insight into Divvy Bikes’ key revenue sources, I will suggest several potential strategies to optimize operations and increase user engagement based solely on the EDA findings outlined.
To conclude, it is crucial to continuously monitor real-time usage trends and adjust bikeshare availability, pricing, and promotions accordingly.