Final Project

Author

Thejitha Rajapakshe

Citi Bikes Analysis

https://ny1.com/nyc/all-boroughs/transit/2024/01/03/citi-bike-price-increases-to-start-taking-effect-thursday

Introduction

In this project, I explore the March 2021 dataset of Citi Bike rides in New York City. The dataset was collected by Lyft, the parent company of Citi Bike, and was obtained from the Citi Bike website (source: Citi Bike).

Citi Bike is a bike-sharing service that allows users to rent bicycles for short trips, providing a convenient and eco-friendly mode of transportation in New York City. The dataset contains information about bike rides, including start and stop times, trip duration, start and end station locations, user types, and gender.

Dataset Description

The dataset includes the following columns:

starttime: Start time of the ride
stoptime: Stop time of the ride
start station name: Name of the start station
start station latitude: Latitude of the start station
start station longitude: Longitude of the start station
end station name: Name of the end station
end station latitude: Latitude of the end station
end station longitude: Longitude of the end station
tripduration: Duration of the trip in seconds
usertype: User type (Customer or Subscriber)
gender: Gender of the user (0 for unknown, 1 for male, 2 for female)

Questions to Explore

What are the top 10 end stations by trip count?
What is the distribution of trip duration?
How does trip duration vary by user type?
What are the top 10 end stations by average speed?

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggplot2)
library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(treemap)
library(dplyr)

setwd("/Users/thejitharajapakshe/Desktop/DATA 110")
data <- read_csv("March2021.csv")

Rows: 16843 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): start station name, end station name, usertype
dbl  (10): tripduration, start station id, start station latitude, start sta...
dttm  (2): starttime, stoptime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data <- data %>%
  mutate(start_date = format(as.Date(starttime), "%d"),
         start_time = format(as.POSIXct(starttime), format = "%H:%M:%S"))

# Extract day from stoptime column
data <- data %>%
  mutate(stop_date = format(as.Date(stoptime), "%d"),
         stop_time = format(as.POSIXct(stoptime), format = "%H:%M:%S"))

# Select only the columns you need
data <- data %>%
  select(-starttime, -stoptime)

colnames(data)[which(names(data) == "start_date")] <- "start_day"
colnames(data)[which(names(data) == "stop_date")] <- "stop_day"

# Create a new column 'gender_category' based on the 'gender' column
# Assign categories: "Unknown" for gender = 0, "Male" for gender = 1, and "Female" for gender = 2
# Remove the 'gender' column
data <- data %>%
  mutate(gender_category = case_when(
    gender == 0 ~ "Unknown",
    gender == 1 ~ "Male",
    gender == 2 ~ "Female")) %>%
  select(-gender)

# Convert trip duration from seconds to minutes
data <- data %>%
  mutate(tripduration = tripduration / 60) 

# Filter out rides with a duration greater than 15 minutes
ridesover15 <- data %>%
  filter(tripduration > 15)

# Arrange rides in descending order of trip duration and select the top 50 longest rides
top_50_longest <- ridesover15 %>%
  arrange(desc(tripduration)) %>%
  head(50)

# Define function to calculate distance between two points using Haversine formula
calculate_distance <- function(lat1, lon1, lat2, lon2) {
  R <- 3959 # Earth radius in miles
  d_lat <- (lat2 - lat1) * pi / 180
  d_lon <- (lon2 - lon1) * pi / 180
  a <- sin(d_lat / 2)^2 + cos(lat1 * pi / 180) * cos(lat2 * pi / 180) * sin(d_lon / 2)^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  distance <- R * c
  return(distance) # From google search
}

# Calculate distance and average speed for the top 50 longest rides
top_50_longest <- top_50_longest %>%
  mutate(distance_mi = calculate_distance(`start station latitude`, `start station longitude`, `end station latitude`, `end station longitude`),
         average_speed_mph = distance_mi / tripduration * 60)

fit <- lm(tripduration ~ distance_mi + average_speed_mph, data = top_50_longest)

summary(fit)


Call:
lm(formula = tripduration ~ distance_mi + average_speed_mph, 
    data = top_50_longest)

Residuals:
    Min      1Q  Median      3Q     Max 
-4916.1 -1615.5  -643.7     0.3 26472.5 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)         1020.5      835.4   1.222  0.22792   
distance_mi         3958.4     1561.2   2.535  0.01462 * 
average_speed_mph -19340.8     7155.7  -2.703  0.00954 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4377 on 47 degrees of freedom
Multiple R-squared:  0.1351,    Adjusted R-squared:  0.09825 
F-statistic:  3.67 on 2 and 47 DF,  p-value: 0.03305

top_50_longest$predicted <- predict(fit)   # Save the predicted values
top_50_longest$residuals <- residuals(fit) # Save the residual values

# Select relevant columns and display the first few rows
head(select(top_50_longest, tripduration, predicted, residuals))

# A tibble: 6 × 3
  tripduration predicted residuals
         <dbl>     <dbl>     <dbl>
1       30926.     4454.    26472.
2       10427.     4180.     6247.
3        6101.     3631.     2470.
4        5681.     1021.     4660.
5        5328.     7137.    -1809.
6        4363.     5746.    -1383.

Equation for the model: tripduration = 1020.5 + 3958.4distance_mi - 19340.8average_speed_mph

P-values:

Intercept: 0.22792
Distance (distance_mi): 0.01462
Average Speed (average_speed_mph): 0.00954

Adjusted R2 value: 0.09825

Analysis:

Interpretation of coefficients:
- The coefficient for ‘distance_mi’ is positive (3958.4), indicating that as the distance increases, the trip duration also increases.
- The coefficient for ‘average_speed_mph’ is negative (-19340.8), suggesting that as the average speed increases, the trip duration decreases.
P-values:
- Both ‘distance_mi’ and ‘average_speed_mph’ have p-values less than 0.05, indicating that they are statistically significant in predicting trip duration.
Adjusted R2 value:
- The adjusted R2 value of 0.09825 suggests that approximately 9.83% of the variability in trip duration can be explained by the model.

Conclusion: The model suggests that both distance and average speed significantly affect trip duration. However, the model’s predictive power is limited, as indicated by the relatively low adjusted R2 value.

# Create scatter plot of observed vs predicted trip duration
ggplot(top_50_longest, aes(x = tripduration, y = predicted)) +
  geom_point(aes(color = "Observed")) +
  geom_point(aes(y = predicted, color = "Predicted"), shape = 1) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add a line of best fit
  labs(x = "Observed Trip Duration (hours)", y = "Predicted Trip Duration (hours)", color = "Data") +
  ggtitle("Observed vs Predicted Trip Duration")

`geom_smooth()` using formula = 'y ~ x'

# Create a bar plot of average arriving speed by end station
plot_ly(data = top_50_longest, x = ~`end station name`, y = ~average_speed_mph, type = 'bar', 
        marker = list(color = colors)) %>%
  layout(title = "Average Ariivng Speed by End Station",
         xaxis = list(title = "End Station"),
         yaxis = list(title = "Average Speed (mph)"))

# Count the number of trips for each end station
end_station_counts <- top_50_longest %>%
  count(`end station name`) %>%
  arrange(desc(n)) %>%
  head(10)
# Create a bar plot for the top 10 end stations by trip count
plot_ly(end_station_counts, x = ~`end station name`, y = ~n, type = "bar", color = ~`end station name`) %>%
  layout(title = "Top 10 End Stations by Trip Count",
         xaxis = list(title = "End Station"),
         yaxis = list(title = "Trip Count"))

Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors

Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors

library(ggalluvial)
ggplot(top_50_longest,
       aes(axis1 = `start station name`, axis2 = `end station name`, y = tripduration)) +
  geom_alluvium(aes(fill = average_speed_mph)) +
  geom_stratum() +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_fill_gradient(low = "blue", high = "red") +
  theme_void()

Warning in to_lodes_form(data = data, axes = axis_ind, discern =
params$discern): Some strata appear at multiple axes.

Warning in to_lodes_form(data = data, axes = axis_ind, discern =
params$discern): Some strata appear at multiple axes.

Warning in to_lodes_form(data = data, axes = axis_ind, discern =
params$discern): Some strata appear at multiple axes.

cols <- brewer.pal(5,"Set1")

avg_speed <- top_50_longest %>%
  group_by(`end station name`) %>%
  summarize(avg_speed = mean(average_speed_mph)) %>%
  top_n(10, wt = avg_speed)  # Select top 10 end stations by average speed

palette <- brewer.pal(n = 10, name = "Set3")
 plot_ly(
  data = avg_speed,
  labels = ~`end station name`,
  values = ~avg_speed,
  type = 'pie',
  text = ~paste("End Station:", `end station name`, "<br>", "Average Speed: ", round(avg_speed, 2), " mph"),
  hoverinfo = 'text+percent',  # Show text and percentage on hover
  marker = list(colors = palette)
) %>%
  layout(
    title = "Top 10 Visited End Stations by Average Speed"
  )

popup_info <- paste0(
  "<b>Start Station:</b>", top_50_longest$`start station name`, "<br>",
  "<b>End Station:</b>", top_50_longest$`end station name`, "<br>",
  "<b>Trip Duration (min):</b>", top_50_longest$tripduration, "<br>",
  "<b>Distance (miles):</b>", top_50_longest$distance_mi, "<br>",
  "<b>Average Speed (mph):</b>", top_50_longest$average_speed_mph
)

library(leaflet)

leaflet() |>
  addTiles() |>
  addPolylines(
    lng = c(top_50_longest$`start station longitude`, top_50_longest$`end station longitude`),
    lat = c(top_50_longest$`start station latitude`, top_50_longest$`end station latitude`),
    color = "green"
  ) |>
  addCircles(
    data = top_50_longest,
    lng = top_50_longest$`start station longitude`,
    lat = top_50_longest$`start station latitude`,
    color = "blue",
    radius = 5,
    popup = popup_info
  ) |>
  addCircles(
    data = top_50_longest,
    lng = top_50_longest$`end station longitude`,
    lat = top_50_longest$`end station latitude`,
    color = "red",
    radius = 5,
    popup = popup_info
  )

Background Research:

Bike-sharing systems are getting really popular as a sustainable mode of urban transportation. New York City’s Citibike system, launched in 2013, is one of the biggest, with over 300 stations and 5000 bikes. But managing these systems is a challenge, especially keeping the bikes balanced across the city.

According to the paper, they’ve come up with some cool ways to solve this problem, like figuring out the best places to put bikes and planning routes to move them around overnight. They’ve also looked at what other researchers have done in this area, like how to decide where to put bike stations.

Visualization and Analysis:

The visualization in the paper probably shows how they optimized the bike-sharing system. It might include before and after pictures of where the bikes are located, showing how they improved the balance of bikes across the city.

Interesting things to look for in the visualization:

How effective different solutions are during busy times and overnight.
How the optimization changes where the bikes are located.
How well their solution works compared to other methods.

One thing they could have added is a comparison between different solutions to see which one works best. They also could have compared the actual bike distribution to the optimized one to show how much they improved things.Reference:

O’Mahony, Eoin, and David Shmoys. “Data Analysis and Optimization for (Citi)Bike Sharing.” Proceedings of the … AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015, https://doi.org/10.1609/aaai.v29i1.9245.