If you’ve ever been to NYC or Jersey City, you have likely seen long rows of bright blue bikes lined up along some city streets. Maybe you’ve even taken a ride, or used one on your way to work. These are a part of the Citibike bike-sharing system, a network of over 12,000 communal bikes with over 800 stations all of the boroughs and Jersey City. Between January 1, 2017 and January 30, 2018, there were 16 million check-ins at Citibike stations.
A Citibike Station in NYC.
Using data collected about the stations and types of people who use them, we wanted to find out: Who is using Citibike the most? Does a user’s gender or customer/subscriber status affect the stations they use? What about the times they use them? We’ve worked with MySQL , R and GitHub1 to help analyze the data and answer these questions.
#load libraries
library(mdsr)
library(RMySQL)
library(leaflet)
library(tidyverse)
library(lubridate)
library(scales)
#connecting to server
db <- dbConnect_scidb(dbname = "citibike")# Number of visits made by subscribers and customers
db %>%
dbGetQuery("SELECT count(*)
FROM citibike.trips
WHERE user_type = 'Subscriber';") # 14,579,325 --> 89.18%
db %>%
dbGetQuery("SELECT count(*)
FROM citibike.trips
WHERE user_type = 'Customer';") # 1,769,423 --> 10.82%
# Dates of Citibike data
# Earliest date
db %>%
dbGetQuery("SELECT earliest, latest
FROM citibike.station_months
ORDER BY earliest
LIMIT 10;") # 2017-01-01
# Latest date
db %>%
dbGetQuery("SELECT earliest, latest
FROM citibike.station_months
ORDER BY latest DESC
LIMIT 10;") # 2018-01-30#finding gender distribution of citibike subscribers
#14670034 is total number of subs (male + female)
#note: gender is only recorded for subs.
sub_gender <- db %>%
dbGetQuery("
SELECT gender, sum(1) AS N, sum(1)/14670034 AS percent
FROM citibike.trips
WHERE gender != 0
GROUP BY gender;")
#0.7468 for men,
#0.2532 for womenInitially, we found that the overwhelming majority, 89.18%, of visits over all stations were made by subscribers and the remaining 10.82% were made by customers (non-subscribers). This could be from typical pricing plan behavior. It’s inevitably cheaper to subscribe to Citibike if you use it more frequently. Additionally, once people buy a plan they may be incentivized to use it more often to get their money’s worth. Of subscriber visits, only 25% of them were made by women. What could be causing this large gender disparity?
#getting start times for everyone across all stations
trips <- dbGetQuery(db, "SELECT start_time
FROM citibike.trips;")
#calculates the number of visits per every hour of the day for the whole year across all stations
everyone_hours <- trips %>%
select(start_time) %>%
mutate(new_start_time = as_datetime(start_time)) %>%
mutate(hour_time = hour(new_start_time)) %>%
group_by(hour_time) %>%
summarize(N = n())#this function calculates the number of visits per every hour of the day for the whole year across all stations FOR A DEMOGRAPHIC
hours_fun <- function(hour_arg, demographic, demo_arg) {
dbGetQuery(db, paste("
SELECT start_time, user_type, gender
FROM citibike.trips
WHERE", demographic, "=", demo_arg, ";")) %>%
mutate(new_start_time = as_datetime(start_time)) %>%
mutate(hour_time = hour(new_start_time)) %>%
group_by(hour_time) %>%
summarize(N = n())
}
#outputs a table with visits per hour across all stations for the year FOR SUBSCRIBERS
subscriber_hours <- hours_fun(demographic = "user_type", demo_arg = "'Subscriber'")
#outputs a table with visits per hour across all stations for the year FOR CUSTOMERS
customer_hours <- hours_fun(demographic = "user_type", demo_arg = "'Customer'")
#outputs a table with visits per hour across all stations for the year FOR MALES
male_hours <- hours_fun(demographic = "gender", demo_arg = "1")
#outputs a table with visits per hour across all stations for the year FOR FEMALES
female_hours <- hours_fun(demographic = "gender", demo_arg = "2")Using Citibike data, we created graphics for the number of visits to all stations per every hour of the day for the whole year. We broke this visiting data down by gender and customer/subscriber status.
hours_list <- seq(0, 23, 1)
female_hours$hour_time = factor(female_hours$hour_time, levels = hours_list)
male_hours$hour_time = factor(male_hours$hour_time, levels = hours_list)
customer_hours$hour_time = factor(customer_hours$hour_time, levels = hours_list)
subscriber_hours$hour_time = factor(subscriber_hours$hour_time, levels = hours_list)
#this function takes in a demographic and uses the results of the function above to create a bar chart of counts for each hour.
plot_hours <- function(demographic, title, color, yEnd) {
#making a "google trends" -like bar graph, with hours on x and count on y
hours_plot <- ggplot(data = demographic, aes(x = hour_time, y = N)) +
coord_cartesian(ylim = c(0, yEnd), expand = c(0, 0)) +
geom_col(fill = color) +
ggtitle(title) +
xlab(label = "Time Distribution (24-hour days)") +
ylab(label = "Number of Visits") +
scale_y_continuous(labels = comma) +
scale_x_discrete(labels = c("12 AM", "1 AM", "2 AM", "3 AM", "4 AM", "5 AM", "6 AM", "7 AM", "8 AM", "9 AM", "10 AM", "11 AM", "12 PM",
"1 PM", "2 PM", "3 PM", "4 PM", "5 PM", "6 PM", "7 PM", "8 PM", "9 PM", "10 PM", "11 PM")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
hours_plot
}
plot_hours(demographic = female_hours, title = "Count of Female Visits by Hour", color = "red", yEnd = 375000)plot_hours(demographic = male_hours, title = "Count of Male Visits by Hour", color = "blue", yEnd = 1000000)plot_hours(demographic = customer_hours, title = "Count of Customer Visits by Hour", color = "gold", yEnd = 187500)plot_hours(demographic = subscriber_hours, title = "Count of Subscriber Visits by Hour", color = "purple", yEnd = 1500000)The male, female, and subscriber hour distributions look very similar, due to the fact that all people with a recorded gender are subscribers. This also indicates that male and female subscribers use bikes at the same time, probably for the same purpose of commuting. Subscribers overall had distributions around normal rush hour times–peaks at 8:00 AM when people usually go to work and 5:00-6:00 PM when people go home. Customers had a very different distribution, with a bell curve-esque peak around 3:00 PM. A reason for this may be that subscribers use Citibike for their commute, while customers use them for recreational purposes, and do not have the strict constraints on their time that workers do.
This hypothesis that subscribers are usually commuters could be the reason why so few subscribers are female. Commuting women have to wear professional clothing may be too restricting for bikes. This includes slim pencil skirts, high heels, carefully done hairstyles, and purses. None of these are conducive to riding bicycles. Men’s professional wear is much freer–backpacks, pants, flat-heeled shoes–and are also not expected to care for their personal appearances to the same extent, so it might be easier for them to cycle to work.
A male commuter using Citibike (with typical business casual menswear).
#finding top 10 most visited stations for everyone
everyone_visits <- db %>%
dbGetQuery("
SELECT name, lat, lon, num_starts + num_stops AS visits
FROM citibike.station_summary ss
ORDER BY visits desc
LIMIT 10;")#station_summary table from database -- only selecting columns we need
station_summary <- db %>%
dbGetQuery("SELECT station_id, name, lat, lon
FROM station_summary;")
#this function takes in a demographic type and outputs one table that is the 10 most visited (start and end) stations for the demographic
top_10_visits <- function(demographic, demo_type) {
#number of demographic visits to each start station (N_start)
start_tally <- db %>%
dbGetQuery(paste("
SELECT trips.start_station_id, sum(1) AS N_start
FROM citibike.trips
WHERE", demographic, "=",demo_type, "
GROUP BY trips.start_station_id
ORDER BY N_start desc;"))
#number of demographic visits to each end station (N_end)
end_tally <- db %>%
dbGetQuery(paste("
SELECT trips.end_station_id, sum(1) AS N_end
FROM citibike.trips
WHERE ", demographic, "=",demo_type, "
GROUP BY trips.end_station_id
ORDER BY N_end desc;"))
#join start_tally and end_tally in order to calculate the total number of visits
visits <- full_join(start_tally, end_tally, by = c("start_station_id" = "end_station_id"))
#replacing NA values with 0
visits$N_start[is.na(visits$N_start)] <- 0
visits$N_end[is.na(visits$N_end)] <- 0
#calculating total visits for each station
visits <- visits %>%
mutate(total_visits = N_start + N_end) %>%
arrange(desc(total_visits)) %>%
head(10) %>%
#getting lat, lon, and station name from station summary table
left_join(station_summary, by = c("start_station_id" = "station_id")) %>%
select(lat, lon, name, start_station_id, total_visits)
return(visits)
}#spatial data on Leaflet -- plotting top 10 most visited stations for everyone, men, women, subs, and customers
#querying by demographic using top_10_visits function to find top 10 stations for each demo
female_visits <- top_10_visits(demographic = "gender", demo_type = "2")
sub_visits <- top_10_visits(demographic = "user_type", demo_type = "'Subscriber'")
male_visits <- top_10_visits(demographic = "gender", demo_type = "1")
cust_visits <- top_10_visits(demographic = "user_type", demo_type = "'Customer'")
#setting up icon appearances
female_visits_icons <- awesomeIcons(library = 'fa', icon = 'bicycle', markerColor = "red", iconColor = "ivory")
male_visits_icons <- awesomeIcons(library = 'fa', icon = 'bicycle', markerColor = "blue", iconColor = "ivory")
sub_visits_icons <- awesomeIcons(library = 'fa', icon = 'bicycle', markerColor = "orange", iconColor = "black")
cust_visits_icons <- awesomeIcons(library = 'fa', icon = 'bicycle', markerColor = "darkpurple", iconColor = "ivory")
everyone_visits_icons <- awesomeIcons(library = 'fa', icon = 'bicycle', markerColor = "green", iconColor = "ivory")
#plotting using leaflet!
leaflet() %>%
addTiles() %>%
addAwesomeMarkers(data = female_visits, popup = ~name, icon = female_visits_icons, group = "Top 10 Stations for Women") %>%
addAwesomeMarkers(data = male_visits, popup = ~name, icon = male_visits_icons, group = "Top 10 Stations for Men") %>%
addAwesomeMarkers(data = sub_visits, popup = ~name, icon = sub_visits_icons, group = "Top 10 Stations for Subscribers") %>%
addAwesomeMarkers(data = cust_visits, popup = ~name, icon = cust_visits_icons, group = "Top 10 Stations for Customers") %>%
addAwesomeMarkers(data = everyone_visits, popup = ~name, icon = everyone_visits_icons, group = "Top 10 Stations for Everyone") %>%
addLegend("bottomright", colors = c("red", "blue", "orange", "purple", "green"), labels = c("Women", "Men", "Subscribers", "Customers", "Everyone")) %>%
addLayersControl(overlayGroups = c("Top 10 Stations for Women", "Top 10 Stations for Men", "Top 10 Stations for Subscribers", "Top 10 Stations for Customers", "Top 10 Stations for Everyone"), options = layersControlOptions(collapsed = FALSE))Additionally, we plotted the 10 most visited stations overall and for each demographic on an interactive map. All of the most visited stations are in Manhattan, the epicenter of NYC. Customers are generally concentrated in the Central Park area, with 6 out of their top 10 stations located there. It is likely that they are using Citibike to have a scenic ride around the park. The other stations were either in recreational areas like parks or sports centers, or near transportation hubs like the Brooklyn Bridge, which also doubles as a tourist spot.
Tourists enjoy a riverside ride in NYC on their Citibikes.
The top subscriber stations can be divided into three groups. Five stations are near large transportation hubs–Port Authority, Grand Central Station, and Penn Station. This fits in with our earlier hypothesis about hours distributions. If subscribers are generally commuting to work, then it makes sense that half of the top 10 stations are near large train stations. So, Citibike may represent just one leg of subscriber commutes. This may be because using both Citibike and the subway offers more flexibility than using just one transportation option. There are only so many train stations and Citibike stations, but combining the two gives more options to get from point A to point B in Manhattan.
Three other top 10 subscriber stations are in the Flatiron district, which is considered the “Silicon Alley” of New York as it is a large technology sector. A large number of subscribers could work in the tech industry, which may have implications for advertising campaigns if Citibike wanted to expand its user base. One station is near the Chelsea Piers, a recreational area with workout options like swimming pools. The last is near Rockefeller Park, several schools, and a library. This is also the only point of overlap between customers and subscribers, likely because of the draw of Rockefeller Park for tourists or a day visit by residents.
The top 10 stations for men and women were in different parts of Manhattan with some overlap. For women, many stations seem to be concentrated in retail areas. Men use stations in the Garment District and near the train stations Penn Station and Port Authority. Why are women going to transport stations less than men? Both demographics have Grand Central Station in their top 10 lists, but that is the only major transport hub for women.
One hypothesis is that men use Citibike for commuting, which is indicated by how they more often frequent transport stations. On the other hand, women use Citibike for recreational reasons, suggested by how their most frequented stations are located in districts densely populated by restaurants and shopping establishments. An alternate hypothesis is that women using Citibike have jobs in retail industries. We can hypothesize this based on how they are using Citibike during regular work commute hours. Additionally, because retail stores do not normally require formal business attire for employees, women in retail jobs may find Citibike a more feasible transportation option than women with other city jobs.
We’ve found that depending on the category of user, the stations that Citibike users frequent the most often and the times that they’re on the bikes do change. While we can only guess at the reasons for why they take the routes that they do, we can still construct a picture of what kind of person that a user is, however incomplete. One possible concern for future data work is the ethics of data gathering. By just using spatial data, we’ve found a lot of information about the things that people tend to do. Citibike, and other companies like it, probably has much more data on each customer. In the future, it will be necessary to have discussions on what data can be collected and released to not violate the privacy of consumer.
Here is the link to the used GitHub respository: https://github.com/emilyliu47/mp4.git↩