WMATA Ridership

Introduction

For this project, I am using a data set from Washington Metropolitan Area Transit Authority (WMATA) to analyze metro ridership within Washington, DC. As somebody who relies on the metro on a weekly basis to commute to work (and a frequent victim of the red line’s delays), I want to look The data set source is: https://www.wmata.com/about/open-data-hub/ridership-data/metrorail-ridership-summary.html. Variables of the data set include Station Name, Date, Month, Year, Day of the Week, Time Period, Holiday Average Daily Tap Entries, Average Daily Nontap Entries and Total Entries.

library(dslabs)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

metro_ridership <- read_csv("Entries by Tap and noTap_Full Data_data.csv") #import the metro ridership data

Rows: 814390 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Station Name, Date, Day of Week, Holiday, Month, Service Type, Time...
dbl (6): Year, Avg Daily NonTapped Entries, Avg Daily Tapped Entries, Entrie...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

view(metro_ridership) #view the data to make sure it is tidy

The data is currently

colSums(is.na(metro_ridership)) #how many NA values are in each column

               Station Name                        Date 
                          0                           0 
                Day of Week                     Holiday 
                          0                           0 
                      Month                Service Type 
                          0                           0 
                Time Period                        Year 
                          0                           0 
          Avg Daily Entries Avg Daily NonTapped Entries 
                          0                           0 
   Avg Daily Tapped Entries                     Entries 
                          0                           0 
          NonTapped Entries                 Tap Entries 
                          0                           0

summary(metro_ridership)

    Station Name           Date           Day of Week          Holiday      
 Length   :814390   Length   :814390   Length   :814390   Length   :814390  
 N.unique :    98   N.unique :   355   N.unique :     7   N.unique :     1  
 N.blank  :     0   N.blank  :     0   N.blank  :     0   N.blank  :     0  
 Min.nchar:     6   Min.nchar:    20   Min.nchar:     3   Min.nchar:     2  
 Max.nchar:    41   Max.nchar:    22   Max.nchar:     3   Max.nchar:     2  
                                                                            
       Month           Service Type       Time Period          Year     
 Length   :814390   Length   :814390   Length   :814390   Min.   :2025  
 N.unique :    12   N.unique :     3   N.unique :     5   1st Qu.:2025  
 N.blank  :     0   N.blank  :     0   N.blank  :     0   Median :2025  
 Min.nchar:     3   Min.nchar:     6   Min.nchar:    17   Mean   :2025  
 Max.nchar:     9   Max.nchar:     8   Max.nchar:    23   3rd Qu.:2026  
                                                          Max.   :2026  
 Avg Daily Entries  Avg Daily NonTapped Entries Avg Daily Tapped Entries
 Length   :814390   Min.   :0.000e+00           Min.   : 0.000          
 N.unique :     1   1st Qu.:0.000e+00           1st Qu.: 0.000          
 N.blank  :     0   Median :0.000e+00           Median : 0.000          
 Min.nchar:     4   Mean   :4.249e-04           Mean   : 0.413          
 Max.nchar:     4   3rd Qu.:0.000e+00           3rd Qu.: 1.000          
                    Max.   :1.100e+01           Max.   :29.000          
    Entries        NonTapped Entries   Tap Entries     
 Min.   :    0.0   Min.   :   0.000   Min.   :    0.0  
 1st Qu.:   25.0   1st Qu.:   1.000   1st Qu.:   24.0  
 Median :   85.0   Median :   4.000   Median :   81.0  
 Mean   :  181.9   Mean   :   8.354   Mean   :  173.6  
 3rd Qu.:  209.0   3rd Qu.:   9.000   3rd Qu.:  199.0  
 Max.   :10492.0   Max.   :4031.000   Max.   :10361.0

by_station <- metro_ridership %>% 
  group_by(`Station Name`) %>%  #group by station
  summarise(count = n(), #counts total ridership at each station
            avg_tap = mean(`Tap Entries`),
            avg_nontap = mean(`NonTapped Entries`))
by_station

# A tibble: 98 × 4
   `Station Name`     count avg_tap avg_nontap
   <chr>              <int>   <dbl>      <dbl>
 1 Addison Road        8389    54.7       4.42
 2 Anacostia           8322   180.       13.7 
 3 Archives            8363   206.        8.56
 4 Arlington Cemetery  7226    43.0       1.12
 5 Ashburn             8357    71.3       2.86
 6 Ballston-MU         8255   252.        8.69
 7 Benning Road        8153    74         3.83
 8 Bethesda            8636   211.        9.94
 9 Braddock Road       8358   103.        3.75
10 Branch Ave          7570   124.        8.05
# ℹ 88 more rows

#Let's look at just Union Station ridership and order by date
library(dplyr)
union_station <- metro_ridership |>
  filter(`Station Name` == "Union Station")
union_station

# A tibble: 8,523 × 14
   `Station Name` Date  `Day of Week` Holiday Month `Service Type` `Time Period`
   <chr>          <chr> <chr>         <chr>   <chr> <chr>          <chr>        
 1 Union Station  6/7/… Sat           No      June  Saturday       AM Peak (Ope…
 2 Union Station  8/16… Sat           No      Augu… Saturday       AM Peak (Ope…
 3 Union Station  6/28… Sat           No      June  Saturday       AM Peak (Ope…
 4 Union Station  3/7/… Sat           No      March Saturday       AM Peak (Ope…
 5 Union Station  2/7/… Sat           No      Febr… Saturday       AM Peak (Ope…
 6 Union Station  11/2… Sat           No      Nove… Saturday       AM Peak (Ope…
 7 Union Station  3/14… Sat           No      March Saturday       AM Peak (Ope…
 8 Union Station  7/5/… Sat           No      July  Saturday       AM Peak (Ope…
 9 Union Station  8/9/… Sat           No      Augu… Saturday       AM Peak (Ope…
10 Union Station  6/28… Sat           No      June  Saturday       AM Peak (Ope…
# ℹ 8,513 more rows
# ℹ 7 more variables: Year <dbl>, `Avg Daily Entries` <chr>,
#   `Avg Daily NonTapped Entries` <dbl>, `Avg Daily Tapped Entries` <dbl>,
#   Entries <dbl>, `NonTapped Entries` <dbl>, `Tap Entries` <dbl>

union_station <- union_station %>% 
  mutate(Date = str_remove(Date, "12:00:00 AM")) #clean up the Date column to remove

#convert Date column from chr to date
union_station$Date <- as.Date(union_station$Date, format = "%m/%d/%Y")
view(union_station)

#arrange the data chronologically
union_station_sorted <- union_station %>% 
  arrange(Date)
union_station_sorted

# A tibble: 8,523 × 14
   `Station Name` Date       `Day of Week` Holiday Month `Service Type`
   <chr>          <date>     <chr>         <chr>   <chr> <chr>         
 1 Union Station  2025-06-01 Sun           No      June  Sunday        
 2 Union Station  2025-06-01 Sun           No      June  Sunday        
 3 Union Station  2025-06-01 Sun           No      June  Sunday        
 4 Union Station  2025-06-01 Sun           No      June  Sunday        
 5 Union Station  2025-06-01 Sun           No      June  Sunday        
 6 Union Station  2025-06-01 Sun           No      June  Sunday        
 7 Union Station  2025-06-01 Sun           No      June  Sunday        
 8 Union Station  2025-06-01 Sun           No      June  Sunday        
 9 Union Station  2025-06-01 Sun           No      June  Sunday        
10 Union Station  2025-06-01 Sun           No      June  Sunday        
# ℹ 8,513 more rows
# ℹ 8 more variables: `Time Period` <chr>, Year <dbl>,
#   `Avg Daily Entries` <chr>, `Avg Daily NonTapped Entries` <dbl>,
#   `Avg Daily Tapped Entries` <dbl>, Entries <dbl>, `NonTapped Entries` <dbl>,
#   `Tap Entries` <dbl>

#combine all occurrences in the same day together
combined_union_station <- union_station_sorted %>% 
  distinct(Date, .keep_all = TRUE)
combined_union_station

# A tibble: 355 × 14
   `Station Name` Date       `Day of Week` Holiday Month `Service Type`
   <chr>          <date>     <chr>         <chr>   <chr> <chr>         
 1 Union Station  2025-06-01 Sun           No      June  Sunday        
 2 Union Station  2025-06-02 Mon           No      June  Weekday       
 3 Union Station  2025-06-03 Tue           No      June  Weekday       
 4 Union Station  2025-06-04 Wed           No      June  Weekday       
 5 Union Station  2025-06-05 Thu           No      June  Weekday       
 6 Union Station  2025-06-06 Fri           No      June  Weekday       
 7 Union Station  2025-06-07 Sat           No      June  Saturday      
 8 Union Station  2025-06-08 Sun           No      June  Sunday        
 9 Union Station  2025-06-09 Mon           No      June  Weekday       
10 Union Station  2025-06-10 Tue           No      June  Weekday       
# ℹ 345 more rows
# ℹ 8 more variables: `Time Period` <chr>, Year <dbl>,
#   `Avg Daily Entries` <chr>, `Avg Daily NonTapped Entries` <dbl>,
#   `Avg Daily Tapped Entries` <dbl>, Entries <dbl>, `NonTapped Entries` <dbl>,
#   `Tap Entries` <dbl>

#get rid of two columns
subset_union_station <- combined_union_station %>% select(-c(Holiday, `Service Type`))
subset_union_station

# A tibble: 355 × 12
   `Station Name` Date       `Day of Week` Month `Time Period`          Year
   <chr>          <date>     <chr>         <chr> <chr>                 <dbl>
 1 Union Station  2025-06-01 Sun           June  Midday (9:30am-3pm)    2025
 2 Union Station  2025-06-02 Mon           June  AM Peak (Open-9:30am)  2025
 3 Union Station  2025-06-03 Tue           June  AM Peak (Open-9:30am)  2025
 4 Union Station  2025-06-04 Wed           June  AM Peak (Open-9:30am)  2025
 5 Union Station  2025-06-05 Thu           June  AM Peak (Open-9:30am)  2025
 6 Union Station  2025-06-06 Fri           June  Evening (7pm-12am)     2025
 7 Union Station  2025-06-07 Sat           June  AM Peak (Open-9:30am)  2025
 8 Union Station  2025-06-08 Sun           June  Midday (9:30am-3pm)    2025
 9 Union Station  2025-06-09 Mon           June  AM Peak (Open-9:30am)  2025
10 Union Station  2025-06-10 Tue           June  PM Peak (3pm-7pm)      2025
# ℹ 345 more rows
# ℹ 6 more variables: `Avg Daily Entries` <chr>,
#   `Avg Daily NonTapped Entries` <dbl>, `Avg Daily Tapped Entries` <dbl>,
#   Entries <dbl>, `NonTapped Entries` <dbl>, `Tap Entries` <dbl>

library(ggplot2)
p1 <- subset_union_station |>
  ggplot(aes(x = Month,fill = Month, y = Entries)) +
  geom_bar(stat ="identity", alpha = .5, color="pink")+
  scale_fill_discrete(name ="Month",
                      labels = c("June", "July", "August", "September", "October", "November", "December", "January", "February", "March", "April", "May"))+
labs(x = "Monthly Entries from June 2025 - May 2026",
     y = "Frequency of Entries at Union Station from June 2025 to May 2026",
     caption = "Washington Metropolitan Area Transit Authority (WMATA)")

p1

#This graph is okay but I don't like how the x-axis is smooshed together or how it is labeled alphabetically

subset_union_station$Month <- factor(
  subset_union_station$Month,
  levels = c("June", "July", "August", "September", "October",
             "November", "December", "January", "February",
             "March", "April", "May"))

library(ggplot2)
p1 <- subset_union_station |>
  ggplot(aes(x = Month, fill = Month, y = Entries)) +
  geom_bar(stat ="identity", alpha = .5, color="white")+
  scale_fill_discrete(name ="Month",
                      labels = c("June", "July", "August", "September", "October", "November", "December", "January", "February", "March", "April", "May"))+
  labs(x = "Monthly Entries from June 2025 - May 2026",
       y = "Frequency of Entries at Union Station from June 2025 to May 2026",
       caption = "Washington Metropolitan Area Transit Authority (WMATA)") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 90, hjust = 1),
    panel.grid.major.x = element_blank())
p1

#need to change the colors and theme per project instructions
p2 <- subset_union_station |>
  ggplot(aes(x = Month, fill = Month, y = Entries)) +
  geom_bar(stat ="identity", alpha = .5, color="white")+
  scale_fill_manual(name ="Month",
                      values =c("June"="red" ,"July"="orange", "August"="yellow", "September"="green", "October"="blue", "November"="purple", "December"="pink", "January"="turquoise", "February"="navy", "March"="magenta", "April"="maroon", "May"="tan"))+
  labs(x = "Monthly Entries from June 2025 - May 2026",
       y = "Frequency of Entries at Union Station from June 2025 to May 2026",
       caption = "Washington Metropolitan Area Transit Authority (WMATA)") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 90, hjust = 1),
    panel.grid.major.x = element_blank())
p2

#Now going to look at several yellow line stations
library(dplyr)
yellow_line <- metro_ridership |>
  filter(`Station Name` %in% c( "Greenbelt", "College Park-U of Md" , "Hyattsville Crossing", "West Hyattsville", "Fort Totten", "Georgia Ave-Petworth", "Columbia Heights", "U Street", "Shaw-Howard U", "Mt Vernon Sq", "Gallery Place", "Archives", "L'Enfant Plaza"))

#clean up this data
yellow_line_clean <- yellow_line %>% 
  mutate(Date = str_remove(Date, "12:00:00 AM")) #clean up the Date column to remove
#convert Date column from chr to date
yellow_line_clean$Date <- as.Date(yellow_line$Date, format = "%m/%d/%Y")
view(yellow_line_clean)

#arrange the data chronologically
yellow_line_clean_sorted <- yellow_line_clean %>% 
arrange(Date)
yellow_line_clean_sorted

# A tibble: 109,451 × 14
   `Station Name`       Date       `Day of Week` Holiday Month `Service Type`
   <chr>                <date>     <chr>         <chr>   <chr> <chr>         
 1 Columbia Heights     2025-06-01 Sun           No      June  Sunday        
 2 Fort Totten          2025-06-01 Sun           No      June  Sunday        
 3 Shaw-Howard U        2025-06-01 Sun           No      June  Sunday        
 4 Mt Vernon Sq         2025-06-01 Sun           No      June  Sunday        
 5 Greenbelt            2025-06-01 Sun           No      June  Sunday        
 6 College Park-U of Md 2025-06-01 Sun           No      June  Sunday        
 7 Gallery Place        2025-06-01 Sun           No      June  Sunday        
 8 U Street             2025-06-01 Sun           No      June  Sunday        
 9 Hyattsville Crossing 2025-06-01 Sun           No      June  Sunday        
10 Columbia Heights     2025-06-01 Sun           No      June  Sunday        
# ℹ 109,441 more rows
# ℹ 8 more variables: `Time Period` <chr>, Year <dbl>,
#   `Avg Daily Entries` <chr>, `Avg Daily NonTapped Entries` <dbl>,
#   `Avg Daily Tapped Entries` <dbl>, Entries <dbl>, `NonTapped Entries` <dbl>,
#   `Tap Entries` <dbl>

#treemap
library(treemap)
station <- c("Greenbelt", "College Park-U of Md" , "Hyattsville Crossing", "West Hyattsville", "Fort Totten", "Georgia Ave-Petworth", "Columbia Heights", "U Street", "Shaw-Howard U", "Mt Vernon Sq", "Gallery Place", "Archives", "L'Enfant Plaza")
value <- ("Entries")
data <- data.frame(station, value)

treemap(yellow_line_clean_sorted,
        index = "Station Name",
        vSize = "Entries",
        vColor = "Entries",
        type = "value",
        title = "Yellow Line Station Entries from June 2025 to June 2026",
        palette = "RdYlBu") #Found on here: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html

To clean up the data used for both visualizations, I first viewed the data to have a better understanding of what I was looking at. I used the functions view(), head() and summary() throughout my project. Next, I wanted filter the data into something that would be better for me as a to be able manage. I used filter() extract all instances of the station, Union Station. From there I wanted to order the data chronologically but there were two issues: my “Date” column was a column and it had a time stamp on it. To remove the time stamp, I used the mutate() and str_remove() functions to drop the time stamp. I then converted the “Date” from a character string to a date using as.date().

For my first visualization, I used geom_bar to create a bar graph. I decided to look at the the entries per month for 1 year in Union Station. I tweaked the colors of the bars using scale_fill_manual and re-positioned the labels on the y-axis using a theme to angle the text 90 degrees. I also reordered the months since R automatically computes it alphabetically. To do this, I used the factor() function.

Overall, the Union Station metro station appears to be consistently utilized all year round. There is no month below 15,000 entries. This aligns with Union Station’s status as a commuter hub and solidifies itself as a cornerstone among DC public transportation I was interested to see that October 2025 had the most entries in Union Station, despite there not being any immediate events or holidays that I could immediately think of to cause that spike. Considering Union Station’s connections to commuter trains (MARC, Amtrak, etc), I do think looking at entries filtered by weekends and weekdays could be valuable.

My second visualization is a tree map of the entries at yellow line stations. I followed most of the same steps as my first visualization with regard to cleaning and organizing the data. To make the tree map more interesting, I used the palette “RdYlBu” which I found from the site we looked at in class.

I was interested in looking at one specific line so I chose the yellow line. Both the color and the size of the square indicate the number of entries for the stations. Unsurprisingly, Gallery Place has the highest number of entries with approximately 5 million over the course of one year. Gallery Place is a transfer point to the red line. The station with the second highest number of entries, L’Enfant Plaza also is a transfer point to green, orange, blue, and silver lines.