In this project, I explore the March 2021 dataset of Citi Bike rides in New York City. The dataset was collected by Lyft, the parent company of Citi Bike, and was obtained from the Citi Bike website (source: Citi Bike).
Citi Bike is a bike-sharing service that allows users to rent bicycles for short trips, providing a convenient and eco-friendly mode of transportation in New York City. The dataset contains information about bike rides, including start and stop times, trip duration, start and end station locations, user types, and gender.
Dataset Description
The dataset includes the following columns:
starttime: Start time of the ride
stoptime: Stop time of the ride
start station name: Name of the start station
start station latitude: Latitude of the start station
start station longitude: Longitude of the start station
end station name: Name of the end station
end station latitude: Latitude of the end station
end station longitude: Longitude of the end station
tripduration: Duration of the trip in seconds
usertype: User type (Customer or Subscriber)
gender: Gender of the user (0 for unknown, 1 for male, 2 for female)
Questions to Explore
What are the top 10 end stations by trip count?
What is the distribution of trip duration?
How does trip duration vary by user type?
What are the top 10 end stations by average speed?
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 16843 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): start station name, end station name, usertype
dbl (10): tripduration, start station id, start station latitude, start sta...
dttm (2): starttime, stoptime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- data %>%mutate(start_date =format(as.Date(starttime), "%d"),start_time =format(as.POSIXct(starttime), format ="%H:%M:%S"))# Extract day from stoptime columndata <- data %>%mutate(stop_date =format(as.Date(stoptime), "%d"),stop_time =format(as.POSIXct(stoptime), format ="%H:%M:%S"))# Select only the columns you needdata <- data %>%select(-starttime, -stoptime)colnames(data)[which(names(data) =="start_date")] <-"start_day"colnames(data)[which(names(data) =="stop_date")] <-"stop_day"
# Create a new column 'gender_category' based on the 'gender' column# Assign categories: "Unknown" for gender = 0, "Male" for gender = 1, and "Female" for gender = 2# Remove the 'gender' columndata <- data %>%mutate(gender_category =case_when( gender ==0~"Unknown", gender ==1~"Male", gender ==2~"Female")) %>%select(-gender)# Convert trip duration from seconds to minutesdata <- data %>%mutate(tripduration = tripduration /60) # Filter out rides with a duration greater than 15 minutesridesover15 <- data %>%filter(tripduration >15)
# Arrange rides in descending order of trip duration and select the top 50 longest ridestop_50_longest <- ridesover15 %>%arrange(desc(tripduration)) %>%head(50)
# Define function to calculate distance between two points using Haversine formulacalculate_distance <-function(lat1, lon1, lat2, lon2) { R <-3959# Earth radius in miles d_lat <- (lat2 - lat1) * pi /180 d_lon <- (lon2 - lon1) * pi /180 a <-sin(d_lat /2)^2+cos(lat1 * pi /180) *cos(lat2 * pi /180) *sin(d_lon /2)^2 c <-2*atan2(sqrt(a), sqrt(1- a)) distance <- R * creturn(distance) # From google search}# Calculate distance and average speed for the top 50 longest ridestop_50_longest <- top_50_longest %>%mutate(distance_mi =calculate_distance(`start station latitude`, `start station longitude`, `end station latitude`, `end station longitude`),average_speed_mph = distance_mi / tripduration *60)
fit <-lm(tripduration ~ distance_mi + average_speed_mph, data = top_50_longest)summary(fit)
Call:
lm(formula = tripduration ~ distance_mi + average_speed_mph,
data = top_50_longest)
Residuals:
Min 1Q Median 3Q Max
-4916.1 -1615.5 -643.7 0.3 26472.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1020.5 835.4 1.222 0.22792
distance_mi 3958.4 1561.2 2.535 0.01462 *
average_speed_mph -19340.8 7155.7 -2.703 0.00954 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4377 on 47 degrees of freedom
Multiple R-squared: 0.1351, Adjusted R-squared: 0.09825
F-statistic: 3.67 on 2 and 47 DF, p-value: 0.03305
top_50_longest$predicted <-predict(fit) # Save the predicted valuestop_50_longest$residuals <-residuals(fit) # Save the residual values# Select relevant columns and display the first few rowshead(select(top_50_longest, tripduration, predicted, residuals))
Equation for the model: tripduration = 1020.5 + 3958.4distance_mi - 19340.8average_speed_mph
P-values:
Intercept: 0.22792
Distance (distance_mi): 0.01462
Average Speed (average_speed_mph): 0.00954
Adjusted R2 value: 0.09825
Analysis:
Interpretation of coefficients:
The coefficient for ‘distance_mi’ is positive (3958.4), indicating that as the distance increases, the trip duration also increases.
The coefficient for ‘average_speed_mph’ is negative (-19340.8), suggesting that as the average speed increases, the trip duration decreases.
P-values:
Both ‘distance_mi’ and ‘average_speed_mph’ have p-values less than 0.05, indicating that they are statistically significant in predicting trip duration.
Adjusted R2 value:
The adjusted R2 value of 0.09825 suggests that approximately 9.83% of the variability in trip duration can be explained by the model.
Conclusion: The model suggests that both distance and average speed significantly affect trip duration. However, the model’s predictive power is limited, as indicated by the relatively low adjusted R2 value.
# Create scatter plot of observed vs predicted trip durationggplot(top_50_longest, aes(x = tripduration, y = predicted)) +geom_point(aes(color ="Observed")) +geom_point(aes(y = predicted, color ="Predicted"), shape =1) +geom_smooth(method ="lm", se =FALSE, color ="black") +# Add a line of best fitlabs(x ="Observed Trip Duration (hours)", y ="Predicted Trip Duration (hours)", color ="Data") +ggtitle("Observed vs Predicted Trip Duration")
`geom_smooth()` using formula = 'y ~ x'
# Create a bar plot of average arriving speed by end stationplot_ly(data = top_50_longest, x =~`end station name`, y =~average_speed_mph, type ='bar', marker =list(color = colors)) %>%layout(title ="Average Ariivng Speed by End Station",xaxis =list(title ="End Station"),yaxis =list(title ="Average Speed (mph)"))
# Count the number of trips for each end stationend_station_counts <- top_50_longest %>%count(`end station name`) %>%arrange(desc(n)) %>%head(10)# Create a bar plot for the top 10 end stations by trip countplot_ly(end_station_counts, x =~`end station name`, y =~n, type ="bar", color =~`end station name`) %>%layout(title ="Top 10 End Stations by Trip Count",xaxis =list(title ="End Station"),yaxis =list(title ="Trip Count"))
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
library(ggalluvial)ggplot(top_50_longest,aes(axis1 =`start station name`, axis2 =`end station name`, y = tripduration)) +geom_alluvium(aes(fill = average_speed_mph)) +geom_stratum() +geom_text(stat ="stratum", aes(label =after_stat(stratum))) +scale_fill_gradient(low ="blue", high ="red") +theme_void()
Warning in to_lodes_form(data = data, axes = axis_ind, discern =
params$discern): Some strata appear at multiple axes.
Warning in to_lodes_form(data = data, axes = axis_ind, discern =
params$discern): Some strata appear at multiple axes.
Warning in to_lodes_form(data = data, axes = axis_ind, discern =
params$discern): Some strata appear at multiple axes.
cols <-brewer.pal(5,"Set1")
avg_speed <- top_50_longest %>%group_by(`end station name`) %>%summarize(avg_speed =mean(average_speed_mph)) %>%top_n(10, wt = avg_speed) # Select top 10 end stations by average speedpalette <-brewer.pal(n =10, name ="Set3")plot_ly(data = avg_speed,labels =~`end station name`,values =~avg_speed,type ='pie',text =~paste("End Station:", `end station name`, "<br>", "Average Speed: ", round(avg_speed, 2), " mph"),hoverinfo ='text+percent', # Show text and percentage on hovermarker =list(colors = palette)) %>%layout(title ="Top 10 Visited End Stations by Average Speed" )
leaflet() |>addTiles() |>addPolylines(lng =c(top_50_longest$`start station longitude`, top_50_longest$`end station longitude`),lat =c(top_50_longest$`start station latitude`, top_50_longest$`end station latitude`),color ="green" ) |>addCircles(data = top_50_longest,lng = top_50_longest$`start station longitude`,lat = top_50_longest$`start station latitude`,color ="blue",radius =5,popup = popup_info ) |>addCircles(data = top_50_longest,lng = top_50_longest$`end station longitude`,lat = top_50_longest$`end station latitude`,color ="red",radius =5,popup = popup_info )
Background Research:
Bike-sharing systems are getting really popular as a sustainable mode of urban transportation. New York City’s Citibike system, launched in 2013, is one of the biggest, with over 300 stations and 5000 bikes. But managing these systems is a challenge, especially keeping the bikes balanced across the city.
According to the paper, they’ve come up with some cool ways to solve this problem, like figuring out the best places to put bikes and planning routes to move them around overnight. They’ve also looked at what other researchers have done in this area, like how to decide where to put bike stations.
Visualization and Analysis:
The visualization in the paper probably shows how they optimized the bike-sharing system. It might include before and after pictures of where the bikes are located, showing how they improved the balance of bikes across the city.
Interesting things to look for in the visualization:
How effective different solutions are during busy times and overnight.
How the optimization changes where the bikes are located.
How well their solution works compared to other methods.
One thing they could have added is a comparison between different solutions to see which one works best. They also could have compared the actual bike distribution to the optimized one to show how much they improved things.Reference:
O’Mahony, Eoin, and David Shmoys. “Data Analysis and Optimization for (Citi)Bike Sharing.” Proceedings of the … AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015, https://doi.org/10.1609/aaai.v29i1.9245.