DATA 607 Project 2

Background

The purpose of this assignment is to tidy and transform data. The dataset of interest describes subway ridership from 2013 to 2018 for Brooklyn, the Bronx, Manhattan and Queens and our task is to:

Create a .csv file in a “wide” structure.
Read this .csv and use tidyr and dplyr to tidy and transform.
Analyze which borough had the largest v smallest drop in subway ridership from 2013 to 2018.
Provide the .rmd file, rpubs link, and descriptions of steps taken.

(1) Create a .csv

The original data set was downloaded from the NYC MTA’s website available here. Citation at bottom of page.

If you were to scroll down in this set, you would observe the total ridership for each borough from 2013 to 2018, this was the data we concerned ourselves with.

A wide format .csv was created per assignment description and uploaded to github.

With the .csv available on github, we shift to reading from this file prior to tidying and transforming the data therein.

(2) Read and tidy

We read the .csv (in its raw form) from github and store corresponding data in a variable named “data” (I know … very creative).

#Get URL, read .csv (in raw form) from github, and put into tabular form
url <- getURL("https://raw.githubusercontent.com/Magnus-PS/CUNY-SPS-DATA-607/Project-2/subway_riders.csv")
data <- read.csv(text = url)
data <- tbl_df(data)

## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

#Show what we're working with:
data

## # A tibble: 5 x 8
##   Borough    X2013      X2014      X2015     X2016     X2017     X2018     X    
##   <chr>      <chr>      <chr>      <chr>     <chr>     <chr>     <chr>     <lgl>
## 1 "Brooklyn" "372,392,~ "382,724,~ "387,548~ "384,540~ "379,372~ "370,231~ NA   
## 2 "Bronx"    "149,111,~ "152,510,~ "152,317~ "153,079~ "149,220~ "139,238~ NA   
## 3 "Manhatta~ "940,342,~ "965,303,~ "970,332~ "966,394~ "955,572~ "936,516~ NA   
## 4 "Queens"   "245,692,~ "250,683,~ "252,328~ "252,780~ "243,113~ "234,085~ NA   
## 5 ""         ""         ""         ""        ""        ""        ""        NA

Once we’ve read from .csv and stored our data, we shift to tidying and transforming the data therein.

##Remove the empty 8th column and the 2nd - 7th columns
data_chosen <- select(data, c(Borough, X2013, X2018))

#Remove the "X" from the year columns
names(data_chosen)[names(data_chosen) == "X2013"] <- "2013"
names(data_chosen)[names(data_chosen) == "X2018"] <- "2018"

##Make observations from variables using gather()
##This is also when we convert from a 'wide' to 'long' format
data_long <- gather(data_chosen, "Year", "Ridership", 2:3, factor_key=TRUE)

#Delete empty row values
data_long <- subset(data_long, data_long$Borough != "",
select=c(Borough, Year, Ridership))

##Sort 'Borough' column alphabetically
data_long <- data_long[order(data_long$Borough),]

data_long

## # A tibble: 8 x 3
##   Borough   Year  Ridership  
##   <chr>     <fct> <chr>      
## 1 Bronx     2013  149,111,695
## 2 Bronx     2018  139,238,932
## 3 Brooklyn  2013  372,392,182
## 4 Brooklyn  2018  370,231,302
## 5 Manhattan 2013  940,342,922
## 6 Manhattan 2018  936,516,980
## 7 Queens    2013  245,692,630
## 8 Queens    2018  234,085,767

qplot(x=Borough, y=Ridership, data=data_long, col=Year, main=" Ridership v Borough 2013 to 2018", xlab="Borough", ylab="Ridership")

Badabing-badaboom. We got ourselves a long dataset.

What started as a less clear 5x8 ‘wide’ table (“data”) is now a tidy 5x3 ‘long’ table (“data_chosen”) with a corresponding plot that shows that every borough had a drop in ridership from 2013 to 2018.

(3) Transform and analyze

What we couldn’t observe in the plot above is which borough had the largest v smallest drop in ridership. This is what we were hoping to analyze and thus, we’ll lean on transformation and mathematics to draw a clearer picture. More specifically, we can use the following equation:

((Ridership (2018) - Ridership (2013)) / (Ridership (2013)) x 100)

Doing so should paint a clearer picture of which borough had the largest v smallest drop in ridership.

#Remove ","s from Ridership column
data_long$Ridership <- gsub(",","",data_long$Ridership)

#Convert Ridership column to numeric
data_long$Ridership <- as.numeric(as.character(data_long$Ridership))

##Compute the percent change in ridership for the Bronx
bx_r <- filter(data_long, Borough == "Bronx")
bx_chg <- as.numeric(round(((bx_r[2,3] - bx_r[1,3]) / bx_r[1,3]) * 100, 2))

##Compute the percent change in ridership for Brooklyn
bk_r <- filter(data_long, Borough == "Brooklyn")
bk_chg <- as.numeric(round(((bk_r[2,3] - bk_r[1,3]) / bk_r[1,3]) * 100, 2))

##Compute the percent change in ridership for Manhattan
m_r <- filter(data_long, Borough == "Manhattan")
m_chg <- as.numeric(round(((m_r[2,3] - m_r[1,3]) / m_r[1,3]) * 100, 2))

##Compute the percent change in ridership for Queens
q_r <- filter(data_long, Borough == "Queens")
q_chg <- as.numeric(round(((q_r[2,3] - q_r[1,3]) / q_r[1,3]) * 100, 2))

#Now add a column to the table for these calculated values
data_long$Chg <- c(0, bx_chg, 0, bk_chg, 0, m_chg, 0, q_chg)
data_long

## # A tibble: 8 x 4
##   Borough   Year  Ridership    Chg
##   <chr>     <fct>     <dbl>  <dbl>
## 1 Bronx     2013  149111695  0    
## 2 Bronx     2018  139238932 -6.62 
## 3 Brooklyn  2013  372392182  0    
## 4 Brooklyn  2018  370231302 -0.580
## 5 Manhattan 2013  940342922  0    
## 6 Manhattan 2018  936516980 -0.41 
## 7 Queens    2013  245692630  0    
## 8 Queens    2018  234085767 -4.72

Now that we’ve performed our essential transformations and calculations, and brought our data into the desired form, the last step is to provide a plot that paints the picture clear as day.

Based on what we’re looking for, a barplot (with displayed values) will do the trick:

#Plot the change in ridership for each borough
rider_chg <- filter(data_long, Year == "2018")
barplot(rider_chg$Chg, beside=TRUE, main = "Change in ridership from 2013 to 2019 v NYC Borough", xlab = "Borough", ylab = "Chg in Ridership (%)", ylim = c(-7, 0), names.arg = c("Bronx", "Brooklyn", "Manhattan", "Queens"), col=c("#F75252", "#F75252", "#F75252", "#F75252", "#F75252"))

text(0.7, -6.2, as.character(bx_chg), col = "white")
text(2, -0.25, as.character(bk_chg), col = "white")
text(3.1, -0.25, as.character(m_chg), col = "white")
text(4.3, -4.3, as.character(q_chg), col = "white")

From this graphic, we observe that all boroughs saw a drop in ridership, 2 boroughs (Brooklyn and Manhattan) saw a slight drop in ridership , 2 boroughs (Queens and the Bronx) saw a significant drop in ridership, the Bronx saw the greatest drop in ridership, and Manhattan saw the smallest drop in ridership.

Regarding subway ridership the picture is clear as day: New Yorkers road the subway less in 2018 than they did in 2013 (especially in the Bronx).

(4) Conclude

Fewer New Yorkers are riding the subway and I would have thought the opposite. It raises a lot of questions:

-are the Bronx and Queens less public transport friendly? -are trains less reliable in the Bronx and Queens? -did more people move to Manhattan and Brooklyn during these years than to the Bronx and Queens? -are more people riding their bikes instead of taking the train? -were less people riding the train because the economy was good, they had spending cash and thus they taxid, Ubered, or bought a car?

and the list could go on …

Coming to this conclusion was paradigm-shifting. I would have assumed the opposite.

With more people moving to the city from 2013 - 2018 and more talk of “living green”, I would have thought more people would use public transport, but that’s not the case. The data tells a different story and raises some pretty interesting questions.

Data citation The Metropolitan Transportation Authority. (2019). Annual Subway Ridership [Data file]. Retrieved from http://web.mta.info/nyct/facts/ridership/ridership_sub_annual.htm

DATA 607 Project 2 - Dataset 3

Magnus Skonberg

2020-10-02

Background

(1) Create a .csv

(2) Read and tidy

(3) Transform and analyze

(4) Conclude