The purpose of this assignment is to tidy and transform data. The dataset of interest describes subway ridership from 2013 to 2018 for Brooklyn, the Bronx, Manhattan and Queens and our task is to:
The original data set was downloaded from the NYC MTA’s website available here. Citation at bottom of page.
If you were to scroll down in this set, you would observe the total ridership for each borough from 2013 to 2018, this was the data we concerned ourselves with.
A wide format .csv was created per assignment description and uploaded to github.
With the .csv available on github, we shift to reading from this file prior to tidying and transforming the data therein.
We read the .csv (in its raw form) from github and store corresponding data in a variable named “data” (I know … very creative).
#Get URL, read .csv (in raw form) from github, and put into tabular form
url <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Project-2/subway_riders.csv")
data <- read.csv(text = url)
data <- tbl_df(data)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 5 x 8
## Borough X2013 X2014 X2015 X2016 X2017 X2018 X
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl>
## 1 "Brooklyn" "372,392,~ "382,724,~ "387,548~ "384,540~ "379,372~ "370,231~ NA
## 2 "Bronx" "149,111,~ "152,510,~ "152,317~ "153,079~ "149,220~ "139,238~ NA
## 3 "Manhatta~ "940,342,~ "965,303,~ "970,332~ "966,394~ "955,572~ "936,516~ NA
## 4 "Queens" "245,692,~ "250,683,~ "252,328~ "252,780~ "243,113~ "234,085~ NA
## 5 "" "" "" "" "" "" "" NA
Once we’ve read from .csv and stored our data, we shift to tidying and transforming the data therein.
##Remove the empty 8th column and the 2nd - 7th columns
data_chosen <- select(data, c(Borough, X2013, X2018))
#Remove the "X" from the year columns
names(data_chosen)[names(data_chosen) == "X2013"] <- "2013"
names(data_chosen)[names(data_chosen) == "X2018"] <- "2018"
##Make observations from variables using gather()
##This is also when we convert from a 'wide' to 'long' format
data_long <- gather(data_chosen, "Year", "Ridership", 2:3, factor_key=TRUE)
#Delete empty row values
data_long <- subset(data_long, data_long$Borough != "",
select=c(Borough, Year, Ridership))
##Sort 'Borough' column alphabetically
data_long <- data_long[order(data_long$Borough),]
data_long
## # A tibble: 8 x 3
## Borough Year Ridership
## <chr> <fct> <chr>
## 1 Bronx 2013 149,111,695
## 2 Bronx 2018 139,238,932
## 3 Brooklyn 2013 372,392,182
## 4 Brooklyn 2018 370,231,302
## 5 Manhattan 2013 940,342,922
## 6 Manhattan 2018 936,516,980
## 7 Queens 2013 245,692,630
## 8 Queens 2018 234,085,767
qplot(x=Borough, y=Ridership, data=data_long, col=Year, main=" Ridership v Borough 2013 to 2018", xlab="Borough", ylab="Ridership")
Badabing-badaboom. We got ourselves a long dataset.
What started as a less clear 5x8 ‘wide’ table (“data”) is now a tidy 5x3 ‘long’ table (“data_chosen”) with a corresponding plot that shows that every borough had a drop in ridership from 2013 to 2018.
What we couldn’t observe in the plot above is which borough had the largest v smallest drop in ridership. This is what we were hoping to analyze and thus, we’ll lean on transformation and mathematics to draw a clearer picture. More specifically, we can use the following equation:
((Ridership (2018) - Ridership (2013)) / (Ridership (2013)) x 100)
Doing so should paint a clearer picture of which borough had the largest v smallest drop in ridership.
#Remove ","s from Ridership column
data_long$Ridership <- gsub(",","",data_long$Ridership)
#Convert Ridership column to numeric
data_long$Ridership <- as.numeric(as.character(data_long$Ridership))
##Compute the percent change in ridership for the Bronx
bx_r <- filter(data_long, Borough == "Bronx")
bx_chg <- as.numeric(round(((bx_r[2,3] - bx_r[1,3]) / bx_r[1,3]) * 100, 2))
##Compute the percent change in ridership for Brooklyn
bk_r <- filter(data_long, Borough == "Brooklyn")
bk_chg <- as.numeric(round(((bk_r[2,3] - bk_r[1,3]) / bk_r[1,3]) * 100, 2))
##Compute the percent change in ridership for Manhattan
m_r <- filter(data_long, Borough == "Manhattan")
m_chg <- as.numeric(round(((m_r[2,3] - m_r[1,3]) / m_r[1,3]) * 100, 2))
##Compute the percent change in ridership for Queens
q_r <- filter(data_long, Borough == "Queens")
q_chg <- as.numeric(round(((q_r[2,3] - q_r[1,3]) / q_r[1,3]) * 100, 2))
#Now add a column to the table for these calculated values
data_long$Chg <- c(0, bx_chg, 0, bk_chg, 0, m_chg, 0, q_chg)
data_long
## # A tibble: 8 x 4
## Borough Year Ridership Chg
## <chr> <fct> <dbl> <dbl>
## 1 Bronx 2013 149111695 0
## 2 Bronx 2018 139238932 -6.62
## 3 Brooklyn 2013 372392182 0
## 4 Brooklyn 2018 370231302 -0.580
## 5 Manhattan 2013 940342922 0
## 6 Manhattan 2018 936516980 -0.41
## 7 Queens 2013 245692630 0
## 8 Queens 2018 234085767 -4.72
Now that we’ve performed our essential transformations and calculations, and brought our data into the desired form, the last step is to provide a plot that paints the picture clear as day.
Based on what we’re looking for, a barplot (with displayed values) will do the trick:
#Plot the change in ridership for each borough
rider_chg <- filter(data_long, Year == "2018")
barplot(rider_chg$Chg, beside=TRUE, main = "Change in ridership from 2013 to 2019 v NYC Borough", xlab = "Borough", ylab = "Chg in Ridership (%)", ylim = c(-7, 0), names.arg = c("Bronx", "Brooklyn", "Manhattan", "Queens"), col=c("#F75252", "#F75252", "#F75252", "#F75252", "#F75252"))
text(0.7, -6.2, as.character(bx_chg), col = "white")
text(2, -0.25, as.character(bk_chg), col = "white")
text(3.1, -0.25, as.character(m_chg), col = "white")
text(4.3, -4.3, as.character(q_chg), col = "white")
From this graphic, we observe that all boroughs saw a drop in ridership, 2 boroughs (Brooklyn and Manhattan) saw a slight drop in ridership , 2 boroughs (Queens and the Bronx) saw a significant drop in ridership, the Bronx saw the greatest drop in ridership, and Manhattan saw the smallest drop in ridership.
Regarding subway ridership the picture is clear as day: New Yorkers road the subway less in 2018 than they did in 2013 (especially in the Bronx).
Fewer New Yorkers are riding the subway and I would have thought the opposite. It raises a lot of questions:
-are the Bronx and Queens less public transport friendly? -are trains less reliable in the Bronx and Queens? -did more people move to Manhattan and Brooklyn during these years than to the Bronx and Queens? -are more people riding their bikes instead of taking the train? -were less people riding the train because the economy was good, they had spending cash and thus they taxid, Ubered, or bought a car?
and the list could go on …
Coming to this conclusion was paradigm-shifting. I would have assumed the opposite.
With more people moving to the city from 2013 - 2018 and more talk of “living green”, I would have thought more people would use public transport, but that’s not the case. The data tells a different story and raises some pretty interesting questions.
Data citation The Metropolitan Transportation Authority. (2019). Annual Subway Ridership [Data file]. Retrieved from http://web.mta.info/nyct/facts/ridership/ridership_sub_annual.htm