Exploratory Data Analysis of Divvy Bikes in R

Introduction

This R Markdown file documents the exploration, summary, and visualization of Divvy’s bikeshare data. For more details about the data and the data cleaning process using PostgreSQL, please see my other portfolio here: https://eemchua.github.io/divvy_data_cleaning/

Since the purpose of this R Markdown file is to showcase my Exploratory Data Analysis (EDA) approach, the coding is provided for demonstration purposes (I apologize for the length of this report). I will explore nine topics, progressing from univariate to bivariate and then multivariate analysis. Here is the list of topics:

Distribution of Bike Type (rideable_type)
Distribution of Rider Type (member_casual)
Distribution of Start Stations (start_station_name)
Distribution of End Stations (end_station_name)
Analysis of the Top 10 Start and End Stations
Analysis of Rider Type (member_casual) vs. Rideable Type (rideable_type)
Analysis of Rider Type (member_casual) by Day of the Week
Analysis of Rider Type (member_casual) by Hour of the Day
Analysis of Distance vs. Duration by Rider Type

Before starting the EDA, all necessary libraries were installed. The cleaned dataset was also loaded into RStudio through PostgreSQL.

Overview of the Imported Dataset

To ensure the accuracy of the cleaned dataset imported from PostgreSQL, a detailed report has been generated:

library(skimr)
skim(trip_data)

Data summary
Name	trip_data
Number of rows	4241247
Number of columns	13
_______________________
Column type frequency:
character	7
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
ride_id	1	16	16	4241126
rideable_type	1	11	13	3
start_station_name	1	10	64	1654
start_station_id	1	3	14	1629
end_station_name	1	10	64	1668
end_station_id	1	3	36	1636
member_casual	1	6	6	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_lat	1	41.90	0.04	41.65	41.88	41.89	41.93	42.06	▁▁▇▇▁
start_lng	1	-87.64	0.03	-87.84	-87.66	-87.64	-87.63	-87.53	▁▁▃▇▁
end_lat	1	41.90	0.05	0.00	41.88	41.90	41.93	42.06	▁▁▁▁▇
end_lng	1	-87.64	0.05	-87.84	-87.66	-87.64	-87.63	0.00	▇▁▁▁▁

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2023-08-01 00:00:07	2024-07-31 23:54:26	2024-02-12 15:22:57	3873410
ended_at	0	1	2023-08-01 00:01:03	2024-07-31 23:59:56	2024-02-12 15:36:39	3880164

Observation:
* The cleaned dataset has been successfully loaded to RStudio.

EDA 1 - Distribution of Bike Type (rideable_type)

Summary:

table(trip_data$rideable_type)

## 
##  classic_bike   docked_bike electric_bike 
##       2854488         15473       1371286

Plot:

library(ggplot2)
library(forcats)
library(scales)
ggplot(trip_data, aes(x = fct_infreq(rideable_type))) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  ggtitle("Distribution of Rideable Types") +
  xlab("Rideable Type") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("")))

Observation:
* Classic bikes were the most commonly used bike type, followed by electric bikes and docked bikes.
* While classic bikes dominate overall usage, electric bikes are also popular, likely due to their efficiency and ease of use over longer distances.

EDA 2 - Distribution of Rider Type (member_casual)

Summary:

table(trip_data$member_casual)

## 
##  casual  member 
## 1490330 2750917

Plot:

ggplot(trip_data, aes(x = member_casual)) +
  geom_bar(fill = "darkgreen") +
  theme_minimal() +
  ggtitle("Distribution of Member vs Casual Riders") +
  xlab("Rider Type") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_si("")))

Observation:
* Members represent a larger portion of total riders compared to casual users. Member riders account for nearly 65% of all trips, while casual riders make up the remaining 35%.
* This suggests that the majority of trips are taken by subscribed members, likely indicating a strong base of regular users who frequently utilize the service. Although casual riders are significant, they represent a smaller portion of the overall user base.

EDA 3 - Distribution of Start Stations (start_station_name)

For the start stations, we will explore the top 10 stations and the number of rides recorded.

Summary:

library(dplyr)
library(DT)

top_start_stations <- trip_data %>%
  count(start_station_name, sort = TRUE) %>%
  slice_max(n, n = 10)

datatable(top_start_stations, colnames = c("Start Station Name", "Count"), 
          options = list(pageLength = 10), 
          caption = 'Top 10 Start Stations')

Plot:

library(scales)

ggplot(top_start_stations, aes(x = reorder(start_station_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "orange") +
  coord_flip() +
  theme_minimal() +
  ggtitle("Top 10 Most Frequent Start Stations") +
  xlab("Start Station Name") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))

Observation:
* As the plot shows, the 2nd to 10th stations have ride counts between 27K and 40K, displaying a stable upward trend. The 1st station at Streeter Dr & Grand Ave stands out with the highest count, just over 60K.
* Possible explanations for the significantly higher ride count at the Streeter Dr & Grand Ave station include: (1) its proximity to Navy Pier, a major tourist and event destination, (2) its location next to the popular bike trail along Lake Michigan, and (3) its central position between Chicago’s north and south side.

EDA 4 - Distribution of End Stations (end_station_name)

For the end stations, we will also explore the top 10 stations and the number of rides recorded.

Summary:

top_end_stations <- trip_data %>%
  count(end_station_name, sort = TRUE) %>%
  slice_max(n, n = 10)  

datatable(top_end_stations, colnames = c("End Station Name", "Count"), 
          options = list(pageLength = 10), 
          caption = 'Top 10 End Stations')

Plot:

ggplot(top_end_stations, aes(x = reorder(end_station_name, n), y = n)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  theme_minimal() +
  ggtitle("Top 10 Most Frequent End Stations") +
  xlab("End Station Name") +
  ylab("Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))

Observation:
* As the summary and plot show, the distribution pattern of End Stations is almost identical to that of the Start Stations. In fact, both share the same top 10 stations. We can further examine these 10 stations and perform a bivariate analysis (EDA 5).

EDA 5 - Analysis of the Top 10 Stations

Summary:

library(tidyr) 
library(knitr)

combined_stations <- top_start_stations %>%
  inner_join(top_end_stations, by = c("start_station_name" = "end_station_name")) %>%
  rename(station_name = start_station_name, start_count = n.x, end_count = n.y)

kable(combined_stations, caption = "Combined Table of Top 10 Start and End Stations")

Combined Table of Top 10 Start and End Stations
station_name	start_count	end_count
Streeter Dr & Grand Ave	60207	61788
DuSable Lake Shore Dr & Monroe St	39446	37645
Michigan Ave & Oak St	35540	35991
DuSable Lake Shore Dr & North Blvd	35012	38115
Kingsbury St & Kinzie St	33149	32515
Clark St & Elm St	31865	31499
Clinton St & Washington Blvd	31256	32006
Millennium Park	28639	29633
Wells St & Concord Ln	27692	28294
Clinton St & Madison St	27549	28390

Observation:
* The fact that the top 10 stations are the same for both Start and End Stations reinforces that these stations serve as major hubs for bikeshare activity. Potential topics to discuss with stakeholders include the possibility of increasing bike capacity and/or expanding infrastructure at these key hubs.

Plot:

combined_stations_long <- combined_stations %>%
  pivot_longer(cols = c(start_count, end_count), 
               names_to = "type", 
               values_to = "count")

ggplot(combined_stations_long, aes(x = reorder(station_name, count), y = count, fill = type)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Comparison of Start and End Counts for Top 10 Stations", 
       x = "Station Name", 
       y = "Count") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +  # Format y-axis in K (thousands)
  scale_fill_manual(values = c("start_count" = "orange", "end_count" = "blue")) +
  theme(legend.title = element_blank())

Observation:
* The start and end counts are identical for all 10 stations, as shown in the plot. This pattern suggests that riders are likely taking round-trip rides, making it worthwhile to explore targeted marketing strategies for round-trip rides.

EDA 6 - Analysis Rider Type (member_casual) vs. Rideable Type (rideable_type)

Summary:

rider_vs_rideable_table <- trip_data %>%
  group_by(member_casual, rideable_type) %>%
  summarise(count = n(), .groups = "drop") %>%
  mutate(rideable_type = reorder(rideable_type, -count))

print(rider_vs_rideable_table)

## # A tibble: 5 × 3
##   member_casual rideable_type   count
##   <chr>         <fct>           <int>
## 1 casual        classic_bike   975607
## 2 casual        docked_bike     15473
## 3 casual        electric_bike  499250
## 4 member        classic_bike  1878881
## 5 member        electric_bike  872036

Plot:

ggplot(rider_vs_rideable_table, aes(x = member_casual, y = count, fill = rideable_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) + 
  labs(title = "Rider Type vs. Rideable Type", x = "Rider Type", y = "Count") +
  theme_minimal() +
  theme(legend.title = element_blank())

Observation:
* As shown, the results align with the observations made earlier in EDA 1 and EDA 2. For both casual and member riders, the current prediction is that the distribution of trips will be approximately two-thirds on classic bikes and one-third on electric bikes.

EDA 7 - Analysis of Rider Type (member_casual) by Day of the Week

We will explore two scenarios: (i) Classic Bike Usage by Day and (ii) Electric Bike Usage by Day.

Adding a day-of-the-week column to the dataset:

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

trip_data$started_at <- as.POSIXct(trip_data$started_at, format = "%Y-%m-%d %H:%M:%S")
trip_data$ended_at <- as.POSIXct(trip_data$ended_at, format = "%Y-%m-%d %H:%M:%S")
trip_data <- trip_data %>%
  mutate(day_of_week = wday(started_at, label = TRUE, week_start = 1))

(i) Classic Bike Usage by Day

Plot:

classic_bike_usage_by_rider_type <- trip_data %>%
  filter(rideable_type == "classic_bike") %>%
  group_by(member_casual, day_of_week) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(classic_bike_usage_by_rider_type, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(title = "Member vs. Casual Riders on Classic Bikes by Day of the Week",
       x = "Day of the Week", y = "Number of Trips",
       fill = "Rider Type") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "darkorange", "member" = "blue")) +
  theme_minimal()

(ii) Electric Bike Usage by Day

Plot:

electric_bike_usage_by_rider_type <- trip_data %>%
  filter(rideable_type == "electric_bike") %>%
  group_by(member_casual, day_of_week) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(electric_bike_usage_by_rider_type, aes(x = day_of_week, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(title = "Member vs. Casual Riders on Electric Bikes by Day of the Week",
       x = "Day of the Week", y = "Number of Trips",
       fill = "Rider Type") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) + 
  scale_fill_manual(values = c("casual" = "orange", "member" = "lightblue")) +
  theme_minimal()

Observation:
* Comparing the plots, the usage of classic and electric bikes is identical across the days of the week. This suggests that bike type is not a significant factor in these scenarios.
* Members dominate bike usage throughout the week, with significantly higher trip counts compared to casual riders.
* Casual riders show consistent bike usage, with peaks toward the weekends, indicating they may use bikes more for weekend leisure trips.

EDA 8 - Analysis of Rider Type (member_casual) by Hour of the Day

We will explore four scenarios: (i) Classic Bike Usage by Hour on Weekdays, (ii) Electric Bike Usage by Hour on Weekdays, (iii) Classic Bike Usage by Hour on Weekends, and (iv) Electric Bike Usage by Hour on Weekends.

Converting and adding the hour column in AM/PM format to the dataset:

trip_data <- trip_data %>%
  mutate(hour_of_day = hour(started_at))

convert_to_12hr <- function(x) {
  ifelse(x == 0, "12 AM",
         ifelse(x < 12, paste(x, "AM"),
                ifelse(x == 12, "12 PM", paste(x - 12, "PM"))))
}

(i) Classic Bike Usage by Hour on Weekdays

Plot:

classic_bike_weekday <- trip_data %>%
  filter(rideable_type == "classic_bike", day_of_week %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(classic_bike_weekday, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") + 
  labs(title = "Member vs. Casual Riders on Classic Bikes by Hour (Weekdays)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "brown3", "member" = "darkgreen")) +
  theme_minimal()

(ii) Electric Bike Usage by Hour on Weekdays

Plot:

electric_bike_weekday <- trip_data %>%
  filter(rideable_type == "electric_bike", day_of_week %in% c("Mon", "Tue", "Wed", "Thu", "Fri")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(electric_bike_weekday, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Member vs. Casual Riders on Electric Bikes by Hour (Weekdays)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "brown1", "member" = "green")) +
  theme_minimal()

Observation:
* Comparing plots (i) and (ii), the usage of classic and electric bikes by hour on weekdays is identical, suggesting that bike type is not a crucial factor in these scenarios.
* Members dominate the usage of both bike types, particularly during peak commuting hours around 8AM and 5-6PM.
* Casual riders use both bikes more consistently throughout the day, with the peak usage occurring between 4-6PM.

(iii) Classic Bike Usage by Hour on Weekends

Plot:

classic_bike_weekend <- trip_data %>%
  filter(rideable_type == "classic_bike", day_of_week %in% c("Sat", "Sun")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(classic_bike_weekend, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Member vs. Casual Riders on Classic Bikes by Hour (Weekends)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +  
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "darkgoldenrod3", "member" = "deepskyblue3")) +
  theme_minimal()

(iv) Electric Bike Usage by Hour on Weekends

Plot:

electric_bike_weekend <- trip_data %>%
  filter(rideable_type == "electric_bike", day_of_week %in% c("Sat", "Sun")) %>%
  group_by(member_casual, hour_of_day) %>%
  summarise(trip_count = n(), .groups = "drop")

ggplot(electric_bike_weekend, aes(x = hour_of_day, y = trip_count, fill = member_casual)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Member vs. Casual Riders on Electric Bikes by Hour (Weekends)",
       x = "Hour of the Day", y = "Number of Trips",
       fill = "Rider Type") +
  scale_x_continuous(breaks = seq(0, 23, by = 4), labels = convert_to_12hr) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = " K")) +
  scale_fill_manual(values = c("casual" = "goldenrod1", "member" = "skyblue1")) +
  theme_minimal()

Observation:
* Comparing plots (iii) and (iv), the usage of classic and electric bikes by hour on weekends is also identical, suggesting that bike type is not a crucial factor in these scenarios.
* Both bike types show a strong weekend pattern, where both casual and member riders are highly active during the midday and afternoon hours. There is a steady increase in trips starting around 8AM, with a peak around noon to 2PM, followed by a gradual decline after 4PM.
* This pattern suggests that weekend usage is largely driven by leisure trips, such as visits to parks, tourist attractions, or other recreational activities.

EDA 9 - Analysis of Distance vs. Duration by Rider Type

In this section, we explore the relationship between distance and duration for both casual riders and member riders.

Distance and duration have been added to the table for analysis, and some outliers have been removed using the code shown below:

# Adding duration in minutes to the dataset
trip_data <- trip_data %>%
  mutate(duration_minutes = as.numeric(difftime(ended_at, started_at, units = "mins")))

# Adding distance in miles to the dataset using Haversine formula
library(geosphere)
trip_data <- trip_data %>%
  mutate(distance_miles = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat)) / 1609.34)  # Convert meters to miles

# Remove outliers where distance is greater than 6000 miles
trip_data <- trip_data %>%
  filter(distance_miles <= 6000)

# Remove outliers where trip duration is greater than 2000 minutes
trip_data <- trip_data %>%
  filter(duration_minutes <= 2000)

Plots:

ggplot(trip_data, aes(x = distance_miles, y = duration_minutes)) +
  geom_hex() +
  labs(title = "Distance vs. Duration by Rider Type (Hexagonal Binning)", 
       x = "Distance (Miles)", y = "Duration (Minutes)") +
  facet_wrap(~member_casual) +  # Facet by rider type (member vs. casual)
  theme_minimal() +
  scale_fill_gradient(low = "lightblue", high = "darkred")  # Adjust the color gradient

Observation:
* The plots show a high concentration of trips (represented by darker red in the lower-left corners) with short distances (< 5 miles) and short durations (< 100 minutes). The density of trips becomes sparser as distances exceed 10 miles. These observations suggest that the bikeshare system is predominantly used for shorter, local trips rather than long-distance travel.
* Comparing the plots, casual riders tend to have longer durations for shorter distances compared to members, reflecting that casual riders are more likely to use the bikes for leisure or exploration rather than commuting.
* Most of the rides completed by members are short in both distance and duration, indicating that they primarily use the service for functional trips, such as commuting or regular, planned travel.
* Long-distance and long-duration trips are rare for both rider types.

Exploratory Data Analysis of Divvy Bikes in R

Ee Ming Chua

2024-09-10

Introduction

Overview of the Imported Dataset

EDA 1 - Distribution of Bike Type (rideable_type)

EDA 2 - Distribution of Rider Type (member_casual)

EDA 3 - Distribution of Start Stations (start_station_name)

EDA 4 - Distribution of End Stations (end_station_name)

EDA 5 - Analysis of the Top 10 Stations

EDA 6 - Analysis Rider Type (member_casual) vs. Rideable Type (rideable_type)

EDA 7 - Analysis of Rider Type (member_casual) by Day of the Week

(i) Classic Bike Usage by Day

(ii) Electric Bike Usage by Day

EDA 8 - Analysis of Rider Type (member_casual) by Hour of the Day

(i) Classic Bike Usage by Hour on Weekdays

(ii) Electric Bike Usage by Hour on Weekdays

(iii) Classic Bike Usage by Hour on Weekends

(iv) Electric Bike Usage by Hour on Weekends

EDA 9 - Analysis of Distance vs. Duration by Rider Type

Business Recommendations: