For Project 2 Dataset 2, I chose to analyze “MTA Daily Ridership” provided by John. As suggest by John in the Week 5 discussion forum, we should compare the MTA daily riderships during Covid and prior to Covid 19.
library(dplyr)
#library(tidyverse)
library(tidyr)
library(DT)
raw_data <- read.csv('https://raw.githubusercontent.com/suswong/DATA-607-Project-2/main/MTA_Daily_Ridership_Data__Beginning_2020.csv')
raw_data$Date <- as.Date(raw_data$Date,
format = "%m/%d/%Y")
datatable(raw_data)
newtable <- raw_data
colnames(newtable) <- c("Date", "Subway", "Subway Percentage", "Bus", "Bus Percentage", "Lirr", "Lirr Percentage","Metro", "Metro Percentage","Access", "Access Percentage","Bridge", "Bridge Percentage","SIRailway", "SIRailway Percentage")
Tidydata <- newtable[c('Date', 'Subway', 'Bus', 'Lirr','Metro',"Access", 'Bridge', 'SIRailway')]
long_data <- Tidydata %>%
pivot_longer(cols = c('Subway', 'Bus', 'Lirr','Metro',"Access", 'Bridge', 'SIRailway'))
colnames(long_data) <-c('Date', 'Transportation','Total_Ridership')
datatable(long_data)
newtable <- raw_data
colnames(newtable) <- c("Date", "Subway", "Subway Percentage", "Bus", "Bus Percentage", "Lirr", "Lirr Percentage","Metro", "Metro Percentage","Access", "Access Percentage","Bridge", "Bridge Percentage","SIRailway", "SIRailway Percentage")
Tidydata <- newtable[c('Date', 'Subway', 'Bus', 'Lirr','Metro',"Access", 'Bridge', 'SIRailway')]
long_data <- Tidydata %>%
pivot_longer(cols = c('Subway', 'Bus', 'Lirr','Metro',"Access", 'Bridge', 'SIRailway'))
colnames(long_data) <-c('Date', 'Transportation','Total_Ridership')
datatable(long_data)
I used ‘geom_line’ at first. However, it was very hard to visualize the graph as most of the lines overlapped. I used the following link to help me draw the trends.
Between 2020 and mid 2020, there was a time where more people took the bus than MTA. This was around the start of the pandemic. Otherwise, people most often took the train compared to other transportation. During the pandemic, the number of people that took the MTA decreased between early 2020 to mid 2020. However, the the number of people that took the MTA increased overtime.
It is also very interesting to see between 2020 and mid 2020, there was decrease in all transportation by bridge and tunnel. The number of people that took the Staten Island railway and Metro North stayed almost constant throughout the time.
library(ggplot2)
ggplot(long_data, aes(x = Date, y = Total_Ridership, colour = Transportation)) +
geom_smooth() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Shoshana suggested to analyze and compare the number of ridership before and after the pandemic. In the following, I calculated the total estimated ridership prior to the Covid-19 pandemic for each transportation using the provided percentage. I assumed the calculation of the provided percentage was calculated by dividing the Total Estimated Riderships after the pandemic by the total estimated ridership prior to the Covid-19 pandemic.
Create a new datatable that only contain the “Subway” in the column.
subway <- raw_data%>%
select(contains("Subway"))
subway <- cbind(raw_data$Date,subway)
colnames(subway) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
subway$Total.Estimated.Ridership.PrePandemic <- (subway$Total.Estimated.Ridership)/(subway$PrePandemic.Percentage)
datatable(subway)
long_data_subway <- subway %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_subway) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
ggplot(long_data_subway, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Create a new datatable that only contain the “Bus” in the column.
bus <- raw_data%>%
select(contains("Bus"))
bus <- cbind(raw_data$Date,bus)
colnames(bus) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
bus$Total.Estimated.Ridership.PrePandemic <- (bus$Total.Estimated.Ridership)/(bus$PrePandemic.Percentage)
datatable(bus)
long_data_bus <- bus %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_bus) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
ggplot(long_data_bus, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Create a new datatable that only contain the “Bridge” in the column.
Bridge <- raw_data%>%
select(contains("Bridge"))
Bridge <- cbind(raw_data$Date,Bridge)
colnames(Bridge) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
Bridge$Total.Estimated.Ridership.PrePandemic <- (Bridge$Total.Estimated.Ridership)/(Bridge$PrePandemic.Percentage)
datatable(Bridge)
long_data_Bridge <- Bridge %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_Bridge) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
ggplot(long_data_Bridge, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Create a new datatable that only contain the “Metro” in the column.
Metro <- raw_data%>%
select(contains("Metro"))
Metro <- cbind(raw_data$Date,Metro)
colnames(Metro) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
Metro$Total.Estimated.Ridership.PrePandemic <- (Metro$Total.Estimated.Ridership)/(Metro$PrePandemic.Percentage)
datatable(Metro)
long_data_Metro <- Metro %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_Metro) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
ggplot(long_data_Metro, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Create a new datatable that only contain the “Access” in the column.
Access <- raw_data%>%
select(contains("Access"))
Access <- cbind(raw_data$Date,Access)
colnames(Access) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
Access$Total.Estimated.Ridership.PrePandemic <- (Access$Total.Estimated.Ridership)/(Access$PrePandemic.Percentage)
datatable(Access)
long_data_Access <- Access %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_Access) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
ggplot(long_data_Access, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Create a new datatable that only contain the “Bridge” in the column.
Bridge <- raw_data%>%
select(contains("Bridge"))
Bridge <- cbind(raw_data$Date,Bridge)
colnames(Bridge) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
Bridge$Total.Estimated.Ridership.PrePandemic <- (Bridge$Total.Estimated.Ridership)/(Bridge$PrePandemic.Percentage)
datatable(Bridge)
long_data_Bridge <- Bridge %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_Bridge) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
By mid 2021, the number of riderships is almost the same as the number of riderships prior the pandemic.
ggplot(long_data_Bridge, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Create a new datatable that only contain the “SIRailway” in the column.
SIRailway <- raw_data[,14:15]
SIRailway <- cbind(raw_data$Date,SIRailway)
colnames(SIRailway) <- c("Date", "Total.Estimated.Ridership", "PrePandemic.Percentage")
The dataset does not provide data of the total estimated ridership prior to the Covid-19 pandemic. The total estimated ridership prior to the Covid-19 pandemic is calculated by the total estimated ridership/Percentage of Comparable Pre-Pandemic Day
SIRailway$Total.Estimated.Ridership.PrePandemic <- (SIRailway$Total.Estimated.Ridership)/(SIRailway$PrePandemic.Percentage)
datatable(SIRailway)
long_data_SIRailway <- SIRailway %>%
pivot_longer(cols = c('Total.Estimated.Ridership', 'Total.Estimated.Ridership.PrePandemic'))
colnames(long_data_SIRailway) <- c("Date", "PrePandemic.Percentage", "Ridership", "Total.Estimated.Ridership")
ggplot(long_data_SIRailway, aes(x = Date, y = Total.Estimated.Ridership, colour = Ridership)) +
geom_line() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
Between 2020 and mid 2020, there was a time where more people took the bus than MTA. This was around the start of the pandemic. Otherwise, there were more riders that took the train compared to other transportation. During the pandemic, the number of people that took the MTA decreased between early 2020 to mid 2020. However, the the number of people that took the MTA increased overtime.
It is also very interesting to see between 2020 and mid 2020, there was a decrease in usage of all transportation except for those who took bridge and tunnel.
By 2023, the number of riderships is almost the same the number of ridership prior to the pandemic for those who took Access ride and the Bridge and Tunnel.