Below I import the csv file containing MTA ridership data from Github, where I saved the file.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
fileURL <- 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/MTA_Daily_Ridership_Data__Beginning_2020_20240302.csv'
mtaData <-read.csv(url(fileURL))
head(mtaData)
## Date Subways..Total.Estimated.Ridership
## 1 03/01/2020 2212965
## 2 03/02/2020 5329915
## 3 03/03/2020 5481103
## 4 03/04/2020 5498809
## 5 03/05/2020 5496453
## 6 03/06/2020 5189447
## Subways....of.Comparable.Pre.Pandemic.Day Buses..Total.Estimated.Ridership
## 1 0.97 984908
## 2 0.96 2209066
## 3 0.98 2228608
## 4 0.99 2177165
## 5 0.99 2244515
## 6 0.93 2066743
## Buses....of.Comparable.Pre.Pandemic.Day LIRR..Total.Estimated.Ridership
## 1 0.99 NA
## 2 0.99 321569
## 3 0.99 319727
## 4 0.97 311662
## 5 1.00 307597
## 6 0.92 289171
## LIRR....of.Comparable.Pre.Pandemic.Day Metro.North..Total.Estimated.Ridership
## 1 NA 55826
## 2 1.03 180702
## 3 1.02 190648
## 4 0.99 192689
## 5 0.98 194387
## 6 0.92 205056
## Metro.North....of.Comparable.Pre.Pandemic.Day
## 1 0.59
## 2 0.66
## 3 0.69
## 4 0.70
## 5 0.70
## 6 0.74
## Access.A.Ride..Total.Scheduled.Trips
## 1 19922
## 2 30338
## 3 32767
## 4 34297
## 5 33209
## 6 30970
## Access.A.Ride....of.Comparable.Pre.Pandemic.Day
## 1 1.13
## 2 1.02
## 3 1.10
## 4 1.15
## 5 1.12
## 6 1.04
## Bridges.and.Tunnels..Total.Traffic
## 1 786961
## 2 874620
## 3 882175
## 4 905558
## 5 929298
## 6 945408
## Bridges.and.Tunnels....of.Comparable.Pre.Pandemic.Day
## 1 0.98
## 2 0.95
## 3 0.96
## 4 0.98
## 5 1.01
## 6 1.03
## Staten.Island.Railway..Total.Estimated.Ridership
## 1 1636
## 2 17140
## 3 17453
## 4 17136
## 5 17203
## 6 15285
## Staten.Island.Railway....of.Comparable.Pre.Pandemic.Day
## 1 0.52
## 2 1.07
## 3 1.09
## 4 1.07
## 5 1.08
## 6 0.96
The data is not in a tidy format because each date has multiple observations. For example, the total subway ridership for March 1st, 2020 and the total bus ridership for March 1st, 2020 are two different observations, but they are currently contained in the same row.
Below I tidy the data by pivoting it to a longer format. Additionally, I set the measurement column to be a factor, and I convert the Data column (which is currently type character) to dates using the Lubridate package
mtaDataTidy <- pivot_longer(mtaData, cols = !c('Date'),
names_to = 'measurement',
values_to = 'value')
mtaDataTidy$measurement <- as.factor(mtaDataTidy$measurement)
mtaDataTidy$Date <- mdy(mtaDataTidy$Date)
head(mtaDataTidy, n = 10)
## # A tibble: 10 × 3
## Date measurement value
## <date> <fct> <dbl>
## 1 2020-03-01 Subways..Total.Estimated.Ridership 2212965
## 2 2020-03-01 Subways....of.Comparable.Pre.Pandemic.Day 0.97
## 3 2020-03-01 Buses..Total.Estimated.Ridership 984908
## 4 2020-03-01 Buses....of.Comparable.Pre.Pandemic.Day 0.99
## 5 2020-03-01 LIRR..Total.Estimated.Ridership NA
## 6 2020-03-01 LIRR....of.Comparable.Pre.Pandemic.Day NA
## 7 2020-03-01 Metro.North..Total.Estimated.Ridership 55826
## 8 2020-03-01 Metro.North....of.Comparable.Pre.Pandemic.Day 0.59
## 9 2020-03-01 Access.A.Ride..Total.Scheduled.Trips 19922
## 10 2020-03-01 Access.A.Ride....of.Comparable.Pre.Pandemic.Day 1.13
One of the deliverables is to summarize the 2020, 2021, 2022 average ridership for Subway, Buses, LIRR and Metro-North and find which transportation has the highest and lowest ridership in 2020.
First I filter the tidy data to grab only the measurements for ridership totals and then I add a year column, which extracts the year from the date column
riderShipTotalsTidy <- filter(mtaDataTidy, grepl('Ridership', measurement))
riderShipTotalsTidy$Year <- year(riderShipTotalsTidy$Date)
head(riderShipTotalsTidy)
## # A tibble: 6 × 4
## Date measurement value Year
## <date> <fct> <dbl> <dbl>
## 1 2020-03-01 Subways..Total.Estimated.Ridership 2212965 2020
## 2 2020-03-01 Buses..Total.Estimated.Ridership 984908 2020
## 3 2020-03-01 LIRR..Total.Estimated.Ridership NA 2020
## 4 2020-03-01 Metro.North..Total.Estimated.Ridership 55826 2020
## 5 2020-03-01 Staten.Island.Railway..Total.Estimated.Ridership 1636 2020
## 6 2020-03-02 Subways..Total.Estimated.Ridership 5329915 2020
Next, I summarize the data by grouping first by year then by transportation type
summaryTableRidership <- riderShipTotalsTidy |>
group_by(Year, measurement) |>
summarize(avg = mean(value, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
head(summaryTableRidership)
## # A tibble: 6 × 3
## # Groups: Year [2]
## Year measurement avg
## <dbl> <fct> <dbl>
## 1 2020 Buses..Total.Estimated.Ridership 481659.
## 2 2020 LIRR..Total.Estimated.Ridership 58131.
## 3 2020 Metro.North..Total.Estimated.Ridership 37545.
## 4 2020 Staten.Island.Railway..Total.Estimated.Ridership 2383.
## 5 2020 Subways..Total.Estimated.Ridership 1209467.
## 6 2021 Buses..Total.Estimated.Ridership 1045583.
Below I visualize the data
ggplot(summaryTableRidership, aes(x = Year, y = avg, fill = measurement)) +
geom_bar(position = 'dodge', stat ='identity') +
ggtitle('Average yearly ridership by transportation type')
Below I create a table that has the same data as above but in an easily readable format. I do this by pivoting the table to a wider format
summaryTableRidershipWider <- pivot_wider(summaryTableRidership, names_from = measurement, values_from = avg)
summaryTableRidershipWider[2:6] <- apply(summaryTableRidershipWider[2:6], c(1,2), comma)
summaryTableRidershipWider
## # A tibble: 5 × 6
## # Groups: Year [5]
## Year Buses..Total.Estimated.R…¹ LIRR..Total.Estimate…² Metro.North..Total.E…³
## <dbl> <chr> <chr> <chr>
## 1 2020 481,659 58,131 37,545
## 2 2021 1,045,583 96,630 72,498
## 3 2022 1,161,498 142,462 126,138
## 4 2023 1,165,862 177,760 159,019
## 5 2024 1,049,513 178,554 157,306
## # ℹ abbreviated names: ¹​Buses..Total.Estimated.Ridership,
## # ²​LIRR..Total.Estimated.Ridership, ³​Metro.North..Total.Estimated.Ridership
## # ℹ 2 more variables: Staten.Island.Railway..Total.Estimated.Ridership <chr>,
## # Subways..Total.Estimated.Ridership <chr>
Based on the above graph and table, the Staten Island Railway had the lowest average ridership in 2020 and the subway had the highest, however these numbers were lower than in 2021 through 2024 because of the COVID pandemic.
The second deliverable is to compare the Subway and Buses ridership and determine did more people take the Subway or Bus in 2020.
Below I subset the original tidy data frame for data only from 2020 and then for data only on subway and buses ridership
subAndBusSubset2020 <- mtaDataTidy |> filter(year(Date) == 2020)
subAndBusSubset2020 <- subAndBusSubset2020 |>
filter(measurement %in% c('Subways..Total.Estimated.Ridership','Buses..Total.Estimated.Ridership'))
head(subAndBusSubset2020)
## # A tibble: 6 × 3
## Date measurement value
## <date> <fct> <dbl>
## 1 2020-03-01 Subways..Total.Estimated.Ridership 2212965
## 2 2020-03-01 Buses..Total.Estimated.Ridership 984908
## 3 2020-03-02 Subways..Total.Estimated.Ridership 5329915
## 4 2020-03-02 Buses..Total.Estimated.Ridership 2209066
## 5 2020-03-03 Subways..Total.Estimated.Ridership 5481103
## 6 2020-03-03 Buses..Total.Estimated.Ridership 2228608
Below I visualize the data
ggplot(subAndBusSubset2020, aes (x = Date, y = value, color = measurement)) +
geom_line()
The subway had higher ridership in 2020 than buses, however both saw a meaningful decline due to the onset of COVID.