For this project, I am using a data set from Washington Metropolitan Area Transit Authority (WMATA) to analyze metro ridership within Washington, DC. As somebody who relies on the metro on a weekly basis to commute to work (and a frequent victim of the red line’s delays), I want to look The data set source is: https://www.wmata.com/about/open-data-hub/ridership-data/metrorail-ridership-summary.html. Variables of the data set include Station Name, Date, Month, Year, Day of the Week, Time Period, Holiday Average Daily Tap Entries, Average Daily Nontap Entries and Total Entries.
library(dslabs)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
metro_ridership <-read_csv("Entries by Tap and noTap_Full Data_data.csv") #import the metro ridership data
Rows: 814390 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Station Name, Date, Day of Week, Holiday, Month, Service Type, Time...
dbl (6): Year, Avg Daily NonTapped Entries, Avg Daily Tapped Entries, Entrie...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
view(metro_ridership) #view the data to make sure it is tidy
The data is currently
colSums(is.na(metro_ridership)) #how many NA values are in each column
Station Name Date
0 0
Day of Week Holiday
0 0
Month Service Type
0 0
Time Period Year
0 0
Avg Daily Entries Avg Daily NonTapped Entries
0 0
Avg Daily Tapped Entries Entries
0 0
NonTapped Entries Tap Entries
0 0
summary(metro_ridership)
Station Name Date Day of Week Holiday
Length :814390 Length :814390 Length :814390 Length :814390
N.unique : 98 N.unique : 355 N.unique : 7 N.unique : 1
N.blank : 0 N.blank : 0 N.blank : 0 N.blank : 0
Min.nchar: 6 Min.nchar: 20 Min.nchar: 3 Min.nchar: 2
Max.nchar: 41 Max.nchar: 22 Max.nchar: 3 Max.nchar: 2
Month Service Type Time Period Year
Length :814390 Length :814390 Length :814390 Min. :2025
N.unique : 12 N.unique : 3 N.unique : 5 1st Qu.:2025
N.blank : 0 N.blank : 0 N.blank : 0 Median :2025
Min.nchar: 3 Min.nchar: 6 Min.nchar: 17 Mean :2025
Max.nchar: 9 Max.nchar: 8 Max.nchar: 23 3rd Qu.:2026
Max. :2026
Avg Daily Entries Avg Daily NonTapped Entries Avg Daily Tapped Entries
Length :814390 Min. :0.000e+00 Min. : 0.000
N.unique : 1 1st Qu.:0.000e+00 1st Qu.: 0.000
N.blank : 0 Median :0.000e+00 Median : 0.000
Min.nchar: 4 Mean :4.249e-04 Mean : 0.413
Max.nchar: 4 3rd Qu.:0.000e+00 3rd Qu.: 1.000
Max. :1.100e+01 Max. :29.000
Entries NonTapped Entries Tap Entries
Min. : 0.0 Min. : 0.000 Min. : 0.0
1st Qu.: 25.0 1st Qu.: 1.000 1st Qu.: 24.0
Median : 85.0 Median : 4.000 Median : 81.0
Mean : 181.9 Mean : 8.354 Mean : 173.6
3rd Qu.: 209.0 3rd Qu.: 9.000 3rd Qu.: 199.0
Max. :10492.0 Max. :4031.000 Max. :10361.0
by_station <- metro_ridership %>%group_by(`Station Name`) %>%#group by stationsummarise(count =n(), #counts total ridership at each stationavg_tap =mean(`Tap Entries`),avg_nontap =mean(`NonTapped Entries`))by_station
#Let's look at just Union Station ridership and order by datelibrary(dplyr)union_station <- metro_ridership |>filter(`Station Name`=="Union Station")union_station
# A tibble: 8,523 × 14
`Station Name` Date `Day of Week` Holiday Month `Service Type` `Time Period`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Union Station 6/7/… Sat No June Saturday AM Peak (Ope…
2 Union Station 8/16… Sat No Augu… Saturday AM Peak (Ope…
3 Union Station 6/28… Sat No June Saturday AM Peak (Ope…
4 Union Station 3/7/… Sat No March Saturday AM Peak (Ope…
5 Union Station 2/7/… Sat No Febr… Saturday AM Peak (Ope…
6 Union Station 11/2… Sat No Nove… Saturday AM Peak (Ope…
7 Union Station 3/14… Sat No March Saturday AM Peak (Ope…
8 Union Station 7/5/… Sat No July Saturday AM Peak (Ope…
9 Union Station 8/9/… Sat No Augu… Saturday AM Peak (Ope…
10 Union Station 6/28… Sat No June Saturday AM Peak (Ope…
# ℹ 8,513 more rows
# ℹ 7 more variables: Year <dbl>, `Avg Daily Entries` <chr>,
# `Avg Daily NonTapped Entries` <dbl>, `Avg Daily Tapped Entries` <dbl>,
# Entries <dbl>, `NonTapped Entries` <dbl>, `Tap Entries` <dbl>
union_station <- union_station %>%mutate(Date =str_remove(Date, "12:00:00 AM")) #clean up the Date column to remove
#convert Date column from chr to dateunion_station$Date <-as.Date(union_station$Date, format ="%m/%d/%Y")view(union_station)
#arrange the data chronologicallyunion_station_sorted <- union_station %>%arrange(Date)union_station_sorted
# A tibble: 8,523 × 14
`Station Name` Date `Day of Week` Holiday Month `Service Type`
<chr> <date> <chr> <chr> <chr> <chr>
1 Union Station 2025-06-01 Sun No June Sunday
2 Union Station 2025-06-01 Sun No June Sunday
3 Union Station 2025-06-01 Sun No June Sunday
4 Union Station 2025-06-01 Sun No June Sunday
5 Union Station 2025-06-01 Sun No June Sunday
6 Union Station 2025-06-01 Sun No June Sunday
7 Union Station 2025-06-01 Sun No June Sunday
8 Union Station 2025-06-01 Sun No June Sunday
9 Union Station 2025-06-01 Sun No June Sunday
10 Union Station 2025-06-01 Sun No June Sunday
# ℹ 8,513 more rows
# ℹ 8 more variables: `Time Period` <chr>, Year <dbl>,
# `Avg Daily Entries` <chr>, `Avg Daily NonTapped Entries` <dbl>,
# `Avg Daily Tapped Entries` <dbl>, Entries <dbl>, `NonTapped Entries` <dbl>,
# `Tap Entries` <dbl>
#combine all occurrences in the same day togethercombined_union_station <- union_station_sorted %>%distinct(Date, .keep_all =TRUE)combined_union_station
# A tibble: 355 × 14
`Station Name` Date `Day of Week` Holiday Month `Service Type`
<chr> <date> <chr> <chr> <chr> <chr>
1 Union Station 2025-06-01 Sun No June Sunday
2 Union Station 2025-06-02 Mon No June Weekday
3 Union Station 2025-06-03 Tue No June Weekday
4 Union Station 2025-06-04 Wed No June Weekday
5 Union Station 2025-06-05 Thu No June Weekday
6 Union Station 2025-06-06 Fri No June Weekday
7 Union Station 2025-06-07 Sat No June Saturday
8 Union Station 2025-06-08 Sun No June Sunday
9 Union Station 2025-06-09 Mon No June Weekday
10 Union Station 2025-06-10 Tue No June Weekday
# ℹ 345 more rows
# ℹ 8 more variables: `Time Period` <chr>, Year <dbl>,
# `Avg Daily Entries` <chr>, `Avg Daily NonTapped Entries` <dbl>,
# `Avg Daily Tapped Entries` <dbl>, Entries <dbl>, `NonTapped Entries` <dbl>,
# `Tap Entries` <dbl>
#get rid of two columnssubset_union_station <- combined_union_station %>%select(-c(Holiday, `Service Type`))subset_union_station
# A tibble: 355 × 12
`Station Name` Date `Day of Week` Month `Time Period` Year
<chr> <date> <chr> <chr> <chr> <dbl>
1 Union Station 2025-06-01 Sun June Midday (9:30am-3pm) 2025
2 Union Station 2025-06-02 Mon June AM Peak (Open-9:30am) 2025
3 Union Station 2025-06-03 Tue June AM Peak (Open-9:30am) 2025
4 Union Station 2025-06-04 Wed June AM Peak (Open-9:30am) 2025
5 Union Station 2025-06-05 Thu June AM Peak (Open-9:30am) 2025
6 Union Station 2025-06-06 Fri June Evening (7pm-12am) 2025
7 Union Station 2025-06-07 Sat June AM Peak (Open-9:30am) 2025
8 Union Station 2025-06-08 Sun June Midday (9:30am-3pm) 2025
9 Union Station 2025-06-09 Mon June AM Peak (Open-9:30am) 2025
10 Union Station 2025-06-10 Tue June PM Peak (3pm-7pm) 2025
# ℹ 345 more rows
# ℹ 6 more variables: `Avg Daily Entries` <chr>,
# `Avg Daily NonTapped Entries` <dbl>, `Avg Daily Tapped Entries` <dbl>,
# Entries <dbl>, `NonTapped Entries` <dbl>, `Tap Entries` <dbl>
library(ggplot2)p1 <- subset_union_station |>ggplot(aes(x = Month,fill = Month, y = Entries)) +geom_bar(stat ="identity", alpha = .5, color="pink")+scale_fill_discrete(name ="Month",labels =c("June", "July", "August", "September", "October", "November", "December", "January", "February", "March", "April", "May"))+labs(x ="Monthly Entries from June 2025 - May 2026",y ="Frequency of Entries at Union Station from June 2025 to May 2026",caption ="Washington Metropolitan Area Transit Authority (WMATA)")p1
#This graph is okay but I don't like how the x-axis is smooshed together or how it is labeled alphabetically
library(ggplot2)p1 <- subset_union_station |>ggplot(aes(x = Month, fill = Month, y = Entries)) +geom_bar(stat ="identity", alpha = .5, color="white")+scale_fill_discrete(name ="Month",labels =c("June", "July", "August", "September", "October", "November", "December", "January", "February", "March", "April", "May"))+labs(x ="Monthly Entries from June 2025 - May 2026",y ="Frequency of Entries at Union Station from June 2025 to May 2026",caption ="Washington Metropolitan Area Transit Authority (WMATA)") +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold"),axis.text.x =element_text(angle =90, hjust =1),panel.grid.major.x =element_blank())p1
#need to change the colors and theme per project instructionsp2 <- subset_union_station |>ggplot(aes(x = Month, fill = Month, y = Entries)) +geom_bar(stat ="identity", alpha = .5, color="white")+scale_fill_manual(name ="Month",values =c("June"="red" ,"July"="orange", "August"="yellow", "September"="green", "October"="blue", "November"="purple", "December"="pink", "January"="turquoise", "February"="navy", "March"="magenta", "April"="maroon", "May"="tan"))+labs(x ="Monthly Entries from June 2025 - May 2026",y ="Frequency of Entries at Union Station from June 2025 to May 2026",caption ="Washington Metropolitan Area Transit Authority (WMATA)") +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold"),axis.text.x =element_text(angle =90, hjust =1),panel.grid.major.x =element_blank())p2
#Now going to look at several yellow line stationslibrary(dplyr)yellow_line <- metro_ridership |>filter(`Station Name`%in%c( "Greenbelt", "College Park-U of Md" , "Hyattsville Crossing", "West Hyattsville", "Fort Totten", "Georgia Ave-Petworth", "Columbia Heights", "U Street", "Shaw-Howard U", "Mt Vernon Sq", "Gallery Place", "Archives", "L'Enfant Plaza"))
#clean up this datayellow_line_clean <- yellow_line %>%mutate(Date =str_remove(Date, "12:00:00 AM")) #clean up the Date column to remove#convert Date column from chr to dateyellow_line_clean$Date <-as.Date(yellow_line$Date, format ="%m/%d/%Y")view(yellow_line_clean)
#arrange the data chronologicallyyellow_line_clean_sorted <- yellow_line_clean %>%arrange(Date)yellow_line_clean_sorted
# A tibble: 109,451 × 14
`Station Name` Date `Day of Week` Holiday Month `Service Type`
<chr> <date> <chr> <chr> <chr> <chr>
1 Columbia Heights 2025-06-01 Sun No June Sunday
2 Fort Totten 2025-06-01 Sun No June Sunday
3 Shaw-Howard U 2025-06-01 Sun No June Sunday
4 Mt Vernon Sq 2025-06-01 Sun No June Sunday
5 Greenbelt 2025-06-01 Sun No June Sunday
6 College Park-U of Md 2025-06-01 Sun No June Sunday
7 Gallery Place 2025-06-01 Sun No June Sunday
8 U Street 2025-06-01 Sun No June Sunday
9 Hyattsville Crossing 2025-06-01 Sun No June Sunday
10 Columbia Heights 2025-06-01 Sun No June Sunday
# ℹ 109,441 more rows
# ℹ 8 more variables: `Time Period` <chr>, Year <dbl>,
# `Avg Daily Entries` <chr>, `Avg Daily NonTapped Entries` <dbl>,
# `Avg Daily Tapped Entries` <dbl>, Entries <dbl>, `NonTapped Entries` <dbl>,
# `Tap Entries` <dbl>
#treemaplibrary(treemap)station <-c("Greenbelt", "College Park-U of Md" , "Hyattsville Crossing", "West Hyattsville", "Fort Totten", "Georgia Ave-Petworth", "Columbia Heights", "U Street", "Shaw-Howard U", "Mt Vernon Sq", "Gallery Place", "Archives", "L'Enfant Plaza")value <- ("Entries")data <-data.frame(station, value)treemap(yellow_line_clean_sorted,index ="Station Name",vSize ="Entries",vColor ="Entries",type ="value",title ="Yellow Line Station Entries from June 2025 to June 2026",palette ="RdYlBu") #Found on here: https://r-graph-gallery.com/38-rcolorbrewers-palettes.html
To clean up the data used for both visualizations, I first viewed the data to have a better understanding of what I was looking at. I used the functions view(), head() and summary() throughout my project. Next, I wanted filter the data into something that would be better for me as a to be able manage. I used filter() extract all instances of the station, Union Station. From there I wanted to order the data chronologically but there were two issues: my “Date” column was a column and it had a time stamp on it. To remove the time stamp, I used the mutate() and str_remove() functions to drop the time stamp. I then converted the “Date” from a character string to a date using as.date().
For my first visualization, I used geom_bar to create a bar graph. I decided to look at the the entries per month for 1 year in Union Station. I tweaked the colors of the bars using scale_fill_manual and re-positioned the labels on the y-axis using a theme to angle the text 90 degrees. I also reordered the months since R automatically computes it alphabetically. To do this, I used the factor() function.
Overall, the Union Station metro station appears to be consistently utilized all year round. There is no month below 15,000 entries. This aligns with Union Station’s status as a commuter hub and solidifies itself as a cornerstone among DC public transportation I was interested to see that October 2025 had the most entries in Union Station, despite there not being any immediate events or holidays that I could immediately think of to cause that spike. Considering Union Station’s connections to commuter trains (MARC, Amtrak, etc), I do think looking at entries filtered by weekends and weekdays could be valuable.
My second visualization is a tree map of the entries at yellow line stations. I followed most of the same steps as my first visualization with regard to cleaning and organizing the data. To make the tree map more interesting, I used the palette “RdYlBu” which I found from the site we looked at in class.
I was interested in looking at one specific line so I chose the yellow line. Both the color and the size of the square indicate the number of entries for the stations. Unsurprisingly, Gallery Place has the highest number of entries with approximately 5 million over the course of one year. Gallery Place is a transfer point to the red line. The station with the second highest number of entries, L’Enfant Plaza also is a transfer point to green, orange, blue, and silver lines.