Loading MTA Ridership Data

Below I import the csv file containing MTA ridership data from Github, where I saved the file.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
fileURL <- 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/MTA_Daily_Ridership_Data__Beginning_2020_20240302.csv'

mtaData <-read.csv(url(fileURL))

head(mtaData)
##         Date Subways..Total.Estimated.Ridership
## 1 03/01/2020                            2212965
## 2 03/02/2020                            5329915
## 3 03/03/2020                            5481103
## 4 03/04/2020                            5498809
## 5 03/05/2020                            5496453
## 6 03/06/2020                            5189447
##   Subways....of.Comparable.Pre.Pandemic.Day Buses..Total.Estimated.Ridership
## 1                                      0.97                           984908
## 2                                      0.96                          2209066
## 3                                      0.98                          2228608
## 4                                      0.99                          2177165
## 5                                      0.99                          2244515
## 6                                      0.93                          2066743
##   Buses....of.Comparable.Pre.Pandemic.Day LIRR..Total.Estimated.Ridership
## 1                                    0.99                              NA
## 2                                    0.99                          321569
## 3                                    0.99                          319727
## 4                                    0.97                          311662
## 5                                    1.00                          307597
## 6                                    0.92                          289171
##   LIRR....of.Comparable.Pre.Pandemic.Day Metro.North..Total.Estimated.Ridership
## 1                                     NA                                  55826
## 2                                   1.03                                 180702
## 3                                   1.02                                 190648
## 4                                   0.99                                 192689
## 5                                   0.98                                 194387
## 6                                   0.92                                 205056
##   Metro.North....of.Comparable.Pre.Pandemic.Day
## 1                                          0.59
## 2                                          0.66
## 3                                          0.69
## 4                                          0.70
## 5                                          0.70
## 6                                          0.74
##   Access.A.Ride..Total.Scheduled.Trips
## 1                                19922
## 2                                30338
## 3                                32767
## 4                                34297
## 5                                33209
## 6                                30970
##   Access.A.Ride....of.Comparable.Pre.Pandemic.Day
## 1                                            1.13
## 2                                            1.02
## 3                                            1.10
## 4                                            1.15
## 5                                            1.12
## 6                                            1.04
##   Bridges.and.Tunnels..Total.Traffic
## 1                             786961
## 2                             874620
## 3                             882175
## 4                             905558
## 5                             929298
## 6                             945408
##   Bridges.and.Tunnels....of.Comparable.Pre.Pandemic.Day
## 1                                                  0.98
## 2                                                  0.95
## 3                                                  0.96
## 4                                                  0.98
## 5                                                  1.01
## 6                                                  1.03
##   Staten.Island.Railway..Total.Estimated.Ridership
## 1                                             1636
## 2                                            17140
## 3                                            17453
## 4                                            17136
## 5                                            17203
## 6                                            15285
##   Staten.Island.Railway....of.Comparable.Pre.Pandemic.Day
## 1                                                    0.52
## 2                                                    1.07
## 3                                                    1.09
## 4                                                    1.07
## 5                                                    1.08
## 6                                                    0.96

Tidying the data

The data is not in a tidy format because each date has multiple observations. For example, the total subway ridership for March 1st, 2020 and the total bus ridership for March 1st, 2020 are two different observations, but they are currently contained in the same row.

Below I tidy the data by pivoting it to a longer format. Additionally, I set the measurement column to be a factor, and I convert the Data column (which is currently type character) to dates using the Lubridate package

mtaDataTidy <- pivot_longer(mtaData, cols = !c('Date'), 
                            names_to = 'measurement',
                            values_to = 'value')


mtaDataTidy$measurement <- as.factor(mtaDataTidy$measurement)

mtaDataTidy$Date <- mdy(mtaDataTidy$Date)

head(mtaDataTidy, n = 10)
## # A tibble: 10 × 3
##    Date       measurement                                          value
##    <date>     <fct>                                                <dbl>
##  1 2020-03-01 Subways..Total.Estimated.Ridership              2212965   
##  2 2020-03-01 Subways....of.Comparable.Pre.Pandemic.Day             0.97
##  3 2020-03-01 Buses..Total.Estimated.Ridership                 984908   
##  4 2020-03-01 Buses....of.Comparable.Pre.Pandemic.Day               0.99
##  5 2020-03-01 LIRR..Total.Estimated.Ridership                      NA   
##  6 2020-03-01 LIRR....of.Comparable.Pre.Pandemic.Day               NA   
##  7 2020-03-01 Metro.North..Total.Estimated.Ridership            55826   
##  8 2020-03-01 Metro.North....of.Comparable.Pre.Pandemic.Day         0.59
##  9 2020-03-01 Access.A.Ride..Total.Scheduled.Trips              19922   
## 10 2020-03-01 Access.A.Ride....of.Comparable.Pre.Pandemic.Day       1.13

Deliverable 1

One of the deliverables is to summarize the 2020, 2021, 2022 average ridership for Subway, Buses, LIRR and Metro-North and find which transportation has the highest and lowest ridership in 2020.

First I filter the tidy data to grab only the measurements for ridership totals and then I add a year column, which extracts the year from the date column

riderShipTotalsTidy <- filter(mtaDataTidy, grepl('Ridership', measurement))

riderShipTotalsTidy$Year <- year(riderShipTotalsTidy$Date)

head(riderShipTotalsTidy)
## # A tibble: 6 × 4
##   Date       measurement                                        value  Year
##   <date>     <fct>                                              <dbl> <dbl>
## 1 2020-03-01 Subways..Total.Estimated.Ridership               2212965  2020
## 2 2020-03-01 Buses..Total.Estimated.Ridership                  984908  2020
## 3 2020-03-01 LIRR..Total.Estimated.Ridership                       NA  2020
## 4 2020-03-01 Metro.North..Total.Estimated.Ridership             55826  2020
## 5 2020-03-01 Staten.Island.Railway..Total.Estimated.Ridership    1636  2020
## 6 2020-03-02 Subways..Total.Estimated.Ridership               5329915  2020

Next, I summarize the data by grouping first by year then by transportation type

summaryTableRidership <- riderShipTotalsTidy |> 
  group_by(Year, measurement) |>
  summarize(avg = mean(value, na.rm = TRUE))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
head(summaryTableRidership)
## # A tibble: 6 × 3
## # Groups:   Year [2]
##    Year measurement                                           avg
##   <dbl> <fct>                                               <dbl>
## 1  2020 Buses..Total.Estimated.Ridership                  481659.
## 2  2020 LIRR..Total.Estimated.Ridership                    58131.
## 3  2020 Metro.North..Total.Estimated.Ridership             37545.
## 4  2020 Staten.Island.Railway..Total.Estimated.Ridership    2383.
## 5  2020 Subways..Total.Estimated.Ridership               1209467.
## 6  2021 Buses..Total.Estimated.Ridership                 1045583.

Below I visualize the data

ggplot(summaryTableRidership, aes(x = Year, y = avg, fill = measurement)) +
  geom_bar(position = 'dodge', stat ='identity') + 
  ggtitle('Average yearly ridership by transportation type')

Below I create a table that has the same data as above but in an easily readable format. I do this by pivoting the table to a wider format

summaryTableRidershipWider <- pivot_wider(summaryTableRidership, names_from = measurement, values_from = avg)

summaryTableRidershipWider[2:6] <- apply(summaryTableRidershipWider[2:6], c(1,2), comma)

summaryTableRidershipWider
## # A tibble: 5 × 6
## # Groups:   Year [5]
##    Year Buses..Total.Estimated.R…¹ LIRR..Total.Estimate…² Metro.North..Total.E…³
##   <dbl> <chr>                      <chr>                  <chr>                 
## 1  2020 481,659                    58,131                 37,545                
## 2  2021 1,045,583                  96,630                 72,498                
## 3  2022 1,161,498                  142,462                126,138               
## 4  2023 1,165,862                  177,760                159,019               
## 5  2024 1,049,513                  178,554                157,306               
## # ℹ abbreviated names: ¹​Buses..Total.Estimated.Ridership,
## #   ²​LIRR..Total.Estimated.Ridership, ³​Metro.North..Total.Estimated.Ridership
## # ℹ 2 more variables: Staten.Island.Railway..Total.Estimated.Ridership <chr>,
## #   Subways..Total.Estimated.Ridership <chr>

Based on the above graph and table, the Staten Island Railway had the lowest average ridership in 2020 and the subway had the highest, however these numbers were lower than in 2021 through 2024 because of the COVID pandemic.

Deliverable 2

The second deliverable is to compare the Subway and Buses ridership and determine did more people take the Subway or Bus in 2020.

Below I subset the original tidy data frame for data only from 2020 and then for data only on subway and buses ridership

subAndBusSubset2020 <- mtaDataTidy |> filter(year(Date) == 2020)
subAndBusSubset2020 <- subAndBusSubset2020 |> 
  filter(measurement %in% c('Subways..Total.Estimated.Ridership','Buses..Total.Estimated.Ridership'))


head(subAndBusSubset2020)
## # A tibble: 6 × 3
##   Date       measurement                          value
##   <date>     <fct>                                <dbl>
## 1 2020-03-01 Subways..Total.Estimated.Ridership 2212965
## 2 2020-03-01 Buses..Total.Estimated.Ridership    984908
## 3 2020-03-02 Subways..Total.Estimated.Ridership 5329915
## 4 2020-03-02 Buses..Total.Estimated.Ridership   2209066
## 5 2020-03-03 Subways..Total.Estimated.Ridership 5481103
## 6 2020-03-03 Buses..Total.Estimated.Ridership   2228608

Below I visualize the data

ggplot(subAndBusSubset2020, aes (x = Date, y = value, color = measurement)) +
  geom_line()

The subway had higher ridership in 2020 than buses, however both saw a meaningful decline due to the onset of COVID.