Ask a New Yorker how their commute has been going, and they will often complain that it feels like the subway has been getting worse in recent times. Media reports confirm this idea. For example, this Gothamist article states that monthly delays more than doubled from 2012 to 2017 (http://gothamist.com/2017/02/13/subway_delays_mta_cuomo.php).
At the same time, we all need to get around. Even with fare increases, a $2.75 subway ride is a lot cheaper than taking a cab. The MTA reports that average weekday subway ridership increased steadily every year from 2012-2016 (http://web.mta.info/nyct/facts/ridership/#atGlance_s). Average weekend subway ridership either increased or remained relatively constant.
This report will explore the details of changes in ridership and the frequency of performance issues over time.
It will also explore the tone of media reports about the MTA in recent months.
One major question to be answered is how changes in ridership relate to quality of service? In other words, do New Yorkers exhibit some elasticity of demand in response to delays? Or is demand mostly inelastic?
Another question will be to quantify the tone of media reports about the MTA, and compare to performance data to see if the tone matches the data.
Unfortunately, comprehensive news data using the Google News API to search by keyword limits the search space to the past three months. So we cannot compare how the tone of media reports about the MTA may have changed over time correlating with changes in performance.
One could use web scraping to extract media reports across a broader time span, but that is beyond the scope of this report. However, this represents an interesting possibility for future research.
Data sources for this report will include:
Since the weekly ridership data is by individual station rather than subway line, this report will also source complementary data from the MTA to match up subway lines to each station.
By matching up by subway line, one can correlate delays and ridership data in a more specific way than just looking at general trends across the whole system.
As for the news data, the hypothesis here is that media reports about the MTA in recent times will be mostly negative.
Libraries used for this analysis include:
library(stringr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(jsonlite)
library(lubridate)
library(RCurl)
library(XML)
library(indicoio)
The CSV file used in this report is available here:
The CSV is a wide dataset, where each row represents one week at a particular station. The columns represent the number of different types of fares, including full fare (a regular ride), weekly or monthly fares (someone swiping a weekly or monthly Metrocard), senior or disabled reduced-price fares, etc.
Here, we will start from a version of this dataset that is already somewhat tidied. One can find the details of the tidying here:
https://rpubs.com/hmgeiger/368699
Tidying included converting from wide to long format, so that each row would represent one week at a particular station for one particular fare type.
Also ran some simple cleaning, like collapsing cases where the same station was listed under two different names (“PROSPECT AVE-4TH AVE” and “PROSPECT AVENUE-4TH AVENUE”).
The tidied data set is available as an R object (called “mta”) in Github here:
https://github.com/heathergeiger/Data607_project2/blob/master/mta_by_station_tidied.Rdata
After downloading the R object (done outside of R), we are just going to do a bit of additional tidying.
For this analysis, we are mainly interested in fares as a proxy for ridership, and do not care so much about the types of fares.
So we are going to simply sum up the Num.fares column for each combination of date and station.
This should reduce the number of rows substantially, making the data a bit easier to work with.
load("mta_by_station_tidied.Rdata")
total_weekly_ridership <- mta %>% group_by(From.Date,Remote.Station.ID,Station) %>% summarize(Total.ridership = sum(Num.fares))
total_weekly_ridership <- data.frame(total_weekly_ridership,stringsAsFactors=FALSE)
Now, let’s match up the station names to what subway lines run there.
First, let’s download an Excel file with the station names and train lines per station.
download.file("http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls",destfile="Remote-Booth-Station.xls")
Outside of R, open Excel and save the file as a CSV called “Remote-Booth-Station.csv”. Then, we can use the “remote station ID” in both data sets to integrate them.
stations <- read.csv("Remote-Booth-Station.csv",header=TRUE,stringsAsFactors=FALSE)
total_weekly_ridership_remote_ids <- unique(total_weekly_ridership$Remote.Station.ID)
#A few of the remote station IDs are actually two or three IDs collapsed with a slash ("/") in between. Take the first occurence of the station ID in these cases.
total_weekly_ridership_remote_ids <- unique(unlist(lapply(strsplit(total_weekly_ridership_remote_ids,"/"),"[[",1)))
total_weekly_ridership$Remote.Station.ID <- plyr::mapvalues(total_weekly_ridership$Remote.Station.ID,
from=unique(total_weekly_ridership$Remote.Station.ID),
to=total_weekly_ridership_remote_ids)
#The only stations not in "stations" are very new stations like for the 2nd Avenue subway. Add these to the data frame, after filtering down to just columns of interest.
stations <- stations %>% select(c("Remote","Station","Line.Name"))
stations <- rbind(stations,data.frame(Remote = c("R072","R570","R571","R572"),
Station = c("34TH STREET - HUDSON YARDS","72ND STREET - 2 AVENUE","86TH STREET - 2 AVENUE","96TH STREET - 2 AVENUE"),
Line.Name = c("7","Q","Q","Q"),
stringsAsFactors=FALSE))
#Merge the two data frames by remote station ID.
colnames(stations)[1] <- "Remote.Station.ID"
total_weekly_ridership <- merge(total_weekly_ridership,stations[,c("Remote.Station.ID","Line.Name")],by="Remote.Station.ID",all=TRUE)
Let’s further pare down the data, summarizing weekly ridership per station as a mean per year instead of including information for every week.
mean_weekly_ridership_per_year <- total_weekly_ridership %>% group_by(year(From.Date),Remote.Station.ID,Station,Line.Name) %>% summarize(Mean.weekly.ridership = mean(Total.ridership,na.rm=TRUE))
mean_weekly_ridership_per_year <- data.frame(mean_weekly_ridership_per_year,stringsAsFactors=FALSE)
colnames(mean_weekly_ridership_per_year)[1] <- "Year"
mean_weekly_ridership_per_year <- data.frame(mean_weekly_ridership_per_year,Num.lines = str_length(mean_weekly_ridership_per_year$Line.Name),stringsAsFactors=FALSE)
You will notice that the ridership data is by station, and does not distinguish lines within the same station.
In most stations, one can swipe in at a turnstile closer to one train, and then walk through the station to take a different train instead. So these cannot be disentangled just by turnstile swipes.
In some cases (like if similar lines like the A/C/E are the only ones sharing a station), it seems reasonable to allow a bit of double-counting and count the station’s riders for each line.
However we probably don’t want to assume that all riders at Times Square are taking the R train, for example.
Looking at the lines per station, three trains per station seems like a reasonable cutoff where we can count riders at the station separately for each line.
For stations with 4+ train lines, designate them as a separate category called “Hub”.
#Make a longer version of mean_weekly_ridership_per_year, where we print the line again for each train that goes to the station if the station has 3 or fewer lines.
#Store this information in column "Line".
mean_weekly_ridership_per_year_long <- mean_weekly_ridership_per_year
mean_weekly_ridership_per_year_long <- mean_weekly_ridership_per_year_long[mean_weekly_ridership_per_year_long$Num.lines <= 3,]
mean_weekly_ridership_per_year_long <- data.frame(Year = rep(mean_weekly_ridership_per_year_long$Year,times=mean_weekly_ridership_per_year_long$Num.lines),
Remote.Station.ID = rep(mean_weekly_ridership_per_year_long$Remote.Station.ID,times=mean_weekly_ridership_per_year_long$Num.lines),
Station = rep(mean_weekly_ridership_per_year_long$Station,times=mean_weekly_ridership_per_year_long$Num.lines),
Line.Name = rep(mean_weekly_ridership_per_year_long$Line.Name,times=mean_weekly_ridership_per_year_long$Num.lines),
Mean.weekly.ridership = rep(mean_weekly_ridership_per_year_long$Mean.weekly.ridership,times=mean_weekly_ridership_per_year_long$Num.lines),
Num.lines = rep(mean_weekly_ridership_per_year_long$Num.lines,times=mean_weekly_ridership_per_year_long$Num.lines),
Line = unlist(strsplit(mean_weekly_ridership_per_year_long$Line.Name,"")),
stringsAsFactors=FALSE)
mean_weekly_ridership_per_year_long <- rbind(mean_weekly_ridership_per_year_long,
data.frame(mean_weekly_ridership_per_year[mean_weekly_ridership_per_year$Num.lines > 3,],
Line = "Hub",
stringsAsFactors=FALSE))
Performance data may be downloaded here:
http://web.mta.info/persdashboard/perxml/MTA_Performance_Datall.zip
After downloading, data will be under MTA_Performance_Datall/Performance_NYCT.csv in the folder where you download.
This data is actually already in long format. There is one row per month per metric for the metrics we are interested in.
We are interested in metrics “Subway Wait Assessment” and “OTP (Terminal)”.
Definition for “Subway Wait Assessment” (source http://web.mta.info/developers/Performance_Indicators_by%20Agency.xls):
The percent of actual intervals between trains that are no more than the scheduled interval plus 2 minutes during peak hours (6 AM - 9 AM and 4 PM - 7 PM) and plus 4 minutes during off-peak hours (9 AM - 4 PM and 7 PM - midnight). Wait assessment is measured weekdays between 6:00 AM and midnight when service is relatively frequent. The data is based on a sample methodology. The monthly sample size is not statistically significant and, therefore, the agency only reports a 12-month rolling average in the YTD column.
Definition of “OTP (Terminal)”:
The percent of trains making all the scheduled station stops arriving at the destination terminal on-time, early or no more than five minutes late. Trains re-routed, abandoned or operating under temporary schedule changes are all counted as late. This is a new indicator that New York City Transit began reporting in August 2008.
For the integrated analysis, we will focus on individual train lines, and we will also remove shuttle lines that only go a few stops (like the 42nd Street Shuttle).
Once we do this, we may match up the ridership data with the performance data by train line. We have performance data matched up one-to-one for nearly all train lines in our ridership data set. The only special case is performance for the J/Z line, which is collapsed into one set of metrics.
We will also use the mean of the monthly performance per year for both of our metrics of interest, so that the performance data is on the same time scale as the ridership data.
#Read in performance data.
performance <- read.csv("MTA_Performance_Datall/Performance_NYCT.csv",header=TRUE,stringsAsFactors=FALSE)
#Remove hubs and shuttles from mean_weekly_ridership_per_year_long, then use the line names to get the indicator names to use from performance data.
mean_weekly_ridership_per_year_minus_hubs_and_shuttles <- mean_weekly_ridership_per_year_long[mean_weekly_ridership_per_year_long$Line != "S" & mean_weekly_ridership_per_year_long$Line != "Hub",]
mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names <- unique(mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Line)
mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_OTP_indicator_names <- ifelse(mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names == "J" | mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names == "Z",
"OTP (Terminal) - J Z Line",
paste0("OTP (Terminal) - ",mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names," Line"))
mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_wait_indicator_names <- ifelse(mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names == "J" | mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names == "Z",
"Subway Wait Assessment - J Z Line",
paste0("Subway Wait Assessment - ",mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names," Line"))
#Get mean performance data per line per year.
OTP_per_line_per_year <- performance[performance$INDICATOR_NAME %in% mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_OTP_indicator_names,] %>%
select(c("INDICATOR_NAME","PERIOD_YEAR","MONTHLY_ACTUAL")) %>%
group_by(INDICATOR_NAME,PERIOD_YEAR) %>%
summarize(Mean.monthly.actual.performance = mean(MONTHLY_ACTUAL,na.rm=TRUE))
OTP_per_line_per_year <- data.frame(OTP_per_line_per_year,stringsAsFactors=FALSE)
wait_per_line_per_year <- performance[performance$INDICATOR_NAME %in% mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_wait_indicator_names,] %>%
select(c("INDICATOR_NAME","PERIOD_YEAR","MONTHLY_ACTUAL")) %>%
group_by(INDICATOR_NAME,PERIOD_YEAR) %>%
summarize(Mean.monthly.actual.performance = mean(MONTHLY_ACTUAL,na.rm=TRUE))
wait_per_line_per_year <- data.frame(wait_per_line_per_year,stringsAsFactors=FALSE)
#Add line names to performance data.
OTP_indicators_per_line <- data.frame(Line = mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names,
INDICATOR_NAME = mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_OTP_indicator_names,
stringsAsFactors=FALSE)
wait_indicators_per_line <- data.frame(Line = mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_names,
INDICATOR_NAME = mean_weekly_ridership_per_year_minus_hubs_and_shuttles_line_wait_indicator_names,
stringsAsFactors=FALSE)
OTP_per_line_per_year <- merge(OTP_per_line_per_year,OTP_indicators_per_line,by="INDICATOR_NAME",all=TRUE)
wait_per_line_per_year <- merge(wait_per_line_per_year,wait_indicators_per_line,by="INDICATOR_NAME",all=TRUE)
#Combine OTP_per_line_per_year and wait_per_line_per_year into one data frame with performance data as two separate columns "OTP" and "Wait".
colnames(OTP_per_line_per_year)[3] <- "OTP"
colnames(wait_per_line_per_year)[3] <- "Wait"
OTP_and_wait_per_line_per_year <- merge(OTP_per_line_per_year[,2:4],wait_per_line_per_year[,2:4],by=c("PERIOD_YEAR","Line"),all=TRUE)
colnames(OTP_and_wait_per_line_per_year)[1] <- "Year"
#Let's remove 2009 from OTP_and_wait_per_line_per_year, as we do not have ridership data for this year.
#Then, combine with mean_weekly_ridership_per_year_minus_hubs_and_shuttles.
OTP_and_wait_per_line_per_year <- OTP_and_wait_per_line_per_year[OTP_and_wait_per_line_per_year$Year != 2009,]
mean_weekly_ridership_per_year_minus_hubs_and_shuttles <- merge(mean_weekly_ridership_per_year_minus_hubs_and_shuttles,OTP_and_wait_per_line_per_year,by=c("Year","Line"),all=TRUE)
#We are now ready to proceed with integrated analysis using mean_weekly_ridership_per_year_minus_hubs_and_shuttles.
First, read in my API key, which I keep in a different folder from this script on my computer.
If running this script yourself, instead assign api_key to your own API key as a string.
After this, ready to query API.
I initially tried to query the API without giving a list of domains to “domains” when querying the API. But even trying to use keywords other than “MTA” (like adding “transit” or “subway”), this resulted in a lot of extra results that were not of interest.
So we limit to a set of websites known for their New York City news coverage.
However, we choose a broad range of sources to minimize bias caused by focusing too much on one source or type of source.
First, we include classic national news sources based in New York (New York Times and Wall Street Journal). We also include local news papers (e.g. NY Post, NY Daily News, AM NY), as well as the local affiliates of national news stations (NBC, ABC). Finally, we include online-only publications that focus on New York (e.g. Curbed, Gothamist).
For the date range, we limit the search space to span from February 4, 2018 to May 4, 2018. We will run this query only once, then save the object so that can access the results in the future without needing to query the API again.
api_key <- readLines("/Users/hmgeiger/Documents/google_news_api_key.txt")
my_date_start <- "2018-02-04"
my_date_end <- "2018-05-04"
domains_to_search <- c("gothamist.com","nypost.com","nydailynews.com","curbed.com","timeout.com","wnyc.org","observer.com","wsj.com","amny.com","nytimes.com","nbcnewyork.com","abc7ny.com","villagevoice.com")
query_string <- paste0("https://newsapi.org/v2/everything?q=MTA&from=",my_date_start,"&to=",my_date_end,"&sortBy=relevancy&pageSize=100&apiKey=",api_key)
query_string <- paste0(query_string,"&domains=",paste0(domains_to_search,collapse=","))
initialQuery <- fromJSON(query_string)
num_results = initialQuery$totalResults
if(num_results %% 100 == 0){maxPages = num_results / 100}
if(num_results %% 100 > 1){maxPages = trunc(num_results/100) + 1}
search_results_all_pages <- vector("list",length=maxPages)
for(i in 1:maxPages)
{
this_page <- fromJSON(paste0(query_string,"&page=",i))
search_results_all_pages[[i]] <- this_page$articles
Sys.sleep(5)
}
search_results_all_pages <- rbind_pages(search_results_all_pages)
save(search_results_all_pages,file="google_news_api_results_mta_02.04.18_to_05.04.18.Rdata")
Looking through the API results, it was apparent that even further filtering of sources would be necessary.
For example, the API returned a number of results from The Wall Street Journal, but I realized after including this source that the articles I wanted were mostly behind a paywall.
Some sources were also removed if the article text was not readily available by web scraping, or if there were very few articles found from that source.
After filtering by source, I manually went to an example article from each source and determined the HTML tags and class needed to extract article text.
Then for each URL, I downloaded the article HTML, then used the HTML tags and class for the appropriate source to extract article text.
The scraping of article URLs, like the API call, should be run only once and results saved to minimize the load on the source web pages.
mta_news_article_info <- data.frame(source = search_results_all_pages$source$name,search_results_all_pages[,setdiff(colnames(search_results_all_pages),"source")],stringsAsFactors=FALSE)
frequency_per_source <- data.frame(table(mta_news_article_info$source))
frequency_per_source$Var1 <- as.vector(frequency_per_source$Var1)
#Let's choose sources that have at least 5 articles, and also exclude "The Wall Street Journal" and "Wnyc.org".
sources_to_choose <- setdiff(frequency_per_source$Var1[frequency_per_source$Freq >= 5],c("Wnyc.org","The Wall Street Journal"))
#Get HTML tag and class per source.
HTML_tags_per_source <- data.frame(source = sources_to_choose,
tag = c("div","div","div","script","div","div","div"),
name=c("class","class","class","data-schema","class","class","class"),
subtype = c("body-text","c-entry-content","entry-body","NewsArticle","entry-content entry-content-read-more","entry-content","post"),
stringsAsFactors=FALSE)
HTML_tags_per_source <- data.frame(HTML_tags_per_source,
xpath.text = paste0("//",HTML_tags_per_source$tag,"[@",HTML_tags_per_source$name," = '",HTML_tags_per_source$subtype,"']"),
stringsAsFactors=FALSE)
#Filter mta_news_article_info by source, including remove the San Francisco version of Curbed.
mta_news_article_info <- mta_news_article_info[mta_news_article_info$source %in% sources_to_choose,]
mta_news_article_info <- mta_news_article_info[grep('sf.curbed',mta_news_article_info$url,invert=TRUE),]
#Merge with Xpath text per source.
mta_news_article_info <- merge(mta_news_article_info,HTML_tags_per_source[,c("source","xpath.text")],by="source",all=TRUE)
curl <- getCurlHandle()
curlSetOpt(cookiejar="cookies.txt", followlocation = TRUE, curl=curl)
plain_text_all_articles <- vector("list",length=nrow(mta_news_article_info))
for(i in 1:nrow(mta_news_article_info))
{
myurl <- mta_news_article_info$url[i]
my_article <- getURL(myurl,curl = curl, verbose = FALSE)
article_html_parsed <- htmlTreeParse(my_article,asText=TRUE, useInternalNodes = TRUE, encoding = 'UTF-8')
plain_text <- xpathSApply(xmlRoot(article_html_parsed),mta_news_article_info$xpath.text[i],xmlValue)
plain_text_all_articles[[i]] <- plain_text
}
save(plain_text_all_articles,file="mta_plain_text_all_articles_using_initialized_list.Rdata")
One of the articles we tried does not give any text using this method, and displays a “page not found” error in a browser.
We should remove this article from the API results data frame as well as from our vector of scraped text.
We also run some additional manual cleaning using string processing for certain articles that still contain extraneous text, or that contain special characters that we’d like to convert to normal characters.
text_found_per_article <- unlist(lapply(plain_text_all_articles,function(x)length(x)))
mta_news_article_info <- mta_news_article_info[text_found_per_article == 1,]
plain_text_all_articles <- plain_text_all_articles[which(text_found_per_article == 1)]
plain_text_all_articles <- unlist(plain_text_all_articles)
#Function to convert special HTML-specific characters to normal characters.
#Source: https://stackoverflow.com/questions/5060076/convert-html-character-entity-encoding-in-r
html2txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}
for(i in 1:length(plain_text_all_articles))
{
plain_text_all_articles[i] <- html2txt(plain_text_all_articles[i])
}
#Clean up that one Gothamist article that had a lot of extraneous text related to a photo gallery, by starting it at the actual starting text seen in a web browser.
gothamist_article_with_leading_photo_refs <- plain_text_all_articles[which(mta_news_article_info$source == "Gothamist.com")[1]]
gothamist_article_with_leading_photo_refs_corrected <- unlist(strsplit(unlist(gothamist_article_with_leading_photo_refs),"On any normal weekday"))
gothamist_article_with_leading_photo_refs_corrected <- paste0("On any normal weekday",gothamist_article_with_leading_photo_refs_corrected[2])
plain_text_all_articles[which(mta_news_article_info$source == "Gothamist.com")[1]] <- gothamist_article_with_leading_photo_refs_corrected
#Clean up all Daily News article text to get just the text specified in "articleBody" field.
for(i in grep('daily',mta_news_article_info$source,ignore.case=TRUE))
{
dailynews_article <- plain_text_all_articles[i]
dailynews_article_corrected <- unlist(strsplit(dailynews_article,"articleBody"))
dailynews_article_corrected <- dailynews_article_corrected[2]
dailynews_article_corrected <- strsplit(dailynews_article_corrected,'\\\"\\:')[[1]][2]
plain_text_all_articles[i] <- dailynews_article_corrected
}
We will use Indico sentiment analysis to determine whether articles about the MTA are more often positive or negative.
Indico is also an API. So again I will read in my key from my computer, but you should replace the API key variable with your own key.
Save results after running so only need to run once.
my_indicoio_API_Key <- readLines("/Users/hmgeiger/Documents/indico_api_key.txt")
sentiments_per_article <- rep(NA,times=length(plain_text_all_articles))
for(i in 1:length(plain_text_all_articles))
{
sentiments_per_article[i] <- unlist(sentiment_hq(plain_text_all_articles[i],api_key = my_indicoio_API_Key))
}
save(sentiments_per_article,file="sentiments_per_article.Rdata")
As a complement to the sentiment scores, we will also analyze each piece of text for the emotion expressed by the author.
my_indicoio_API_Key <- readLines("/Users/hmgeiger/Documents/indico_api_key.txt")
emotions_per_article <- vector("list",length=length(plain_text_all_articles))
for(i in 1:length(plain_text_all_articles))
{
emotions_per_article[[i]] <- unlist(emotion(plain_text_all_articles[i],api_key = my_indicoio_API_Key))
}
save(emotions_per_article,file="emotions_per_article.Rdata")
emotions_per_article <- data.frame(Article.num = rep(1:length(plain_text_all_articles),each=5),
Emotion = rep(c("anger","joy","sadness","fear","surprise"),times=length(plain_text_all_articles)),
Probability = as.vector(unlist(emotions_per_article)),
Sentiment.score.article = rep(sentiments_per_article,each=5),
stringsAsFactors=FALSE)
Let’s start with a simple boxplot of ridership by line. We’ll just do 2017 for now.
Jitter the outliers a bit so we can identify them more clearly.
mean_weekly_ridership_per_year_minus_hubs_and_shuttles <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[!(duplicated(paste0(mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Line,"/",mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Station,"/",mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Year))),]
#Looks like Penn Station got listed as two separate stations for the 1/2/3 vs. A/C/E. Let's remove these.
#Also remove Jay Street from R train, which should be a "hub" station.
mean_weekly_ridership_per_year_minus_hubs_and_shuttles <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Station != "34TH STREET & 7TH AVENUE" & mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Station != "34TH STREET & 8TH AVENUE" & mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Station != "JAY ST-METROTECH",]
#Get just ridership for 2017. Then, set up data frame to label outliers.
ridership_2017 <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Year == 2017,]
ridership_2017 <- ridership_2017[which(is.na(ridership_2017$Line) == FALSE),]
#Code from here: https://stackoverflow.com/questions/33524669/labeling-outliers-of-boxplots-in-r
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
rownames(ridership_2017) <- paste0(as.character(as.vector(ridership_2017$Line)),"/",as.character(as.vector(ridership_2017$Line.Name)),"/",round(ridership_2017$Mean.weekly.ridership/1000,digits=1),"k")
ridership_2017_to_plot <- ridership_2017 %>%
tibble::rownames_to_column(var="outlier") %>%
group_by(Line) %>%
mutate(is_outlier = ifelse(is_outlier(Mean.weekly.ridership),Mean.weekly.ridership,as.numeric(NA)))
ridership_2017_to_plot$outlier[which(is.na(ridership_2017_to_plot$is_outlier))] <- as.numeric(NA)
#Label just the combination of lines per point, so that each label can be small.
ridership_2017_to_plot$outlier[is.na(ridership_2017_to_plot$outlier) == FALSE] <- unlist(lapply(strsplit(ridership_2017_to_plot$outlier[is.na(ridership_2017_to_plot$outlier) == FALSE],"/"),"[[",2))
#Manually jitter the coordinates used to plot outliers in cases where they are very close.
ridership_2017_to_plot <- data.frame(ridership_2017_to_plot,Mean.weekly.ridership.jitter = ridership_2017_to_plot$Mean.weekly.ridership,stringsAsFactors=FALSE)
ridership_2017_to_plot$Mean.weekly.ridership.jitter[ridership_2017_to_plot$outlier == "25" & round(ridership_2017_to_plot$is_outlier) == 111640] <- 118600
ridership_2017_to_plot$Mean.weekly.ridership.jitter[ridership_2017_to_plot$outlier == "123" & round(ridership_2017_to_plot$is_outlier) == 228029] <- 237000
ridership_2017_to_plot$Mean.weekly.ridership.jitter[ridership_2017_to_plot$outlier == "BD4" & is.na(ridership_2017_to_plot$is_outlier) == FALSE] <- 144000
ridership_2017_to_plot$Mean.weekly.ridership.jitter[which(ridership_2017_to_plot$outlier == "JMZ" & is.na(ridership_2017_to_plot$is_outlier) == FALSE)[2]] <- 82000
ridership_2017_to_plot$Mean.weekly.ridership.jitter[which(ridership_2017_to_plot$outlier == "6DF" & is.na(ridership_2017_to_plot$is_outlier) == FALSE)] <- 103000
ridership_2017_to_plot$Mean.weekly.ridership.jitter[which(ridership_2017_to_plot$outlier == "L" & is.na(ridership_2017_to_plot$is_outlier) == FALSE & round(ridership_2017_to_plot$is_outlier) < 125000)] <- 132000
ggplot(ridership_2017_to_plot,aes(x = Line, y = Mean.weekly.ridership.jitter,fill=Line,colour=Line)) +
geom_boxplot() +
ylab("Mean weekly ridership per station") +
scale_fill_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
scale_colour_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
ggtitle("2017 ridership per station by line") +
geom_text(aes(label = outlier),na.rm = TRUE, size=2.5,hjust=0.5,vjust=-1) +
coord_flip()
What are the outlier stations for each line in 2017?
outlier_stations <- ridership_2017_to_plot[is.na(ridership_2017_to_plot$is_outlier) == FALSE,c("Station","Line","Line.Name","Mean.weekly.ridership")]
outlier_stations[order(outlier_stations$Line,outlier_stations$Mean.weekly.ridership),]
## Station Line Line.Name Mean.weekly.ridership
## 5 96TH STREET-BROADWAY 1 123 209487.02
## 24 72ND STREET & BROADWAY 1 123 228028.77
## 73 FLATBUSH AVE & NOSTRAND AV 2 25 89878.44
## 40 CHAMBERS STREET 2 123 108602.25
## 51 149TH STREET & 3RD AVENUE 2 25 111639.81
## 47 96TH STREET-BROADWAY 2 123 209487.02
## 43 WALL STREET 2 23 226964.44
## 55 72ND STREET & BROADWAY 2 123 228028.77
## 102 96TH STREET-BROADWAY 3 123 209487.02
## 80 WALL STREET 3 23 226964.44
## 90 72ND STREET & BROADWAY 3 123 228028.77
## 114 86TH STREET-LEXINGTON AVE 4 456 249429.38
## 122 FLATBUSH AVE & NOSTRAND AV 5 25 89878.44
## 153 BOWLING GREEN & BATTERY PL 5 45 111471.08
## 130 149TH STREET & 3RD AVENUE 5 25 111639.81
## 150 125TH STREET-LEXINGTON AVE 5 456 145383.02
## 155 86TH STREET-LEXINGTON AVE 5 456 249429.38
## 177 86TH STREET-LEXINGTON AVE 6 456 249429.38
## 201 MAIN STREET 7 7 266539.38
## 216 168TH STREET & BROADWAY A AC1 140322.77
## 277 GRAND ST-CHRISTIE ST B BD 134217.87
## 249 161ST STREET-YANKEE STADIUM B BD4 138909.73
## 301 23RD STREET & 8TH AVENUE C CE 129928.13
## 307 168TH STREET & BROADWAY C AC1 140322.77
## 326 SEVENTH AVENUE & WEST 53RD D BDE 93514.65
## 334 BLEECKER ST-LAFAYETTE ST D 6DF 93543.81
## 331 GRAND ST-CHRISTIE ST D BD 134217.87
## 310 161ST STREET-YANKEE STADIUM D BD4 138909.73
## 351 LEXINGTON AVENUE E EM6 225725.58
## 385 23RD STREET-6TH AVENUE F FM 155362.46
## 412 MYRTLE AVENUE-BROADWAY J JMZ 72369.69
## 407 MARCY AVENUE-BROADWAY J JMZ 73164.25
## 418 SUTPHIN BOULEVARD J EJZ 118565.27
## 426 JAMAICA CENTER-PARSONS BLVD J EJZ 156666.00
## 434 MYRTLE AVE-GATES AVENUE L LM 121117.33
## 438 1ST AVENUE-14TH STREET L L 122186.13
## 433 BEDFORD AVE-NORTH 7TH ST L L 175625.50
## 462 LEXINGTON AVENUE M EM6 225725.58
## 479 57TH STREET & 7TH AVENUE N NQR 205361.37
## 515 57TH STREET & 7TH AVENUE Q NQR 205361.37
## 563 57TH STREET & 7TH AVENUE R NQR 205361.37
## 570 SUTPHIN BOULEVARD Z EJZ 118565.27
## 576 JAMAICA CENTER-PARSONS BLVD Z EJZ 156666.00
Looking at the boxplot, it appears that the A, J, and Z lines have more stations that are infrequently used compared to others. While the E and 6 have more stations that are frequently used compared to others.
Let’s also look at the distribution of how ridership at a given station has changed over time.
Was going to use a nice even 5-year span (2012 to 2017), but Hurricane Sandy in 2012 complicates things.
Let’s use 2011 to 2017.
Also remove two stations known to be under construction or closed for a lot of 2011, as these will have high percent change values (Aqueduct Race Track and Cortlandt Street).
ridership_2017 <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Year == 2017,]
ridership_2017 <- ridership_2017[which(is.na(ridership_2017$Line) == FALSE),]
ridership_2011 <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Year == 2011,]
ridership_2011 <- ridership_2011[which(is.na(ridership_2011$Line) == FALSE),]
ridership_2011_and_2017 <- rbind(ridership_2011,ridership_2017)
ridership_2011_and_2017 <- ridership_2011_and_2017 %>%
select(c("Year","Mean.weekly.ridership","Line","Station","Line.Name")) %>%
spread(Year,Mean.weekly.ridership) %>%
mutate(Percent.change = (`2017` - `2011`)*100/`2011`)
ridership_2011_and_2017 <- ridership_2011_and_2017[ridership_2011_and_2017$Station != "AQUEDUCT RACE TRACK" & ridership_2011_and_2017$Station != "CORTLANDT STREET",]
ggplot(ridership_2011_and_2017,
aes(x = Line, y = Percent.change,fill=Line,colour=Line)) +
geom_boxplot() +
ylab("Percent change in mean weekly ridership 2011 to 2017") +
scale_fill_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
scale_colour_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
coord_flip()
Let’s also look at the trends in percent change across lines.
summary(ridership_2011_and_2017$Percent.change)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -54.751 -3.921 3.355 5.532 13.284 96.719 8
Overall, more stations than not experienced increases in ridership from 2011 to 2017.
Even fewer stations experienced dramatic decreases (more than 4-5%).
The G and L appear especially notable for the level of increase in ridership they experienced across stations (not just for a few outlier stations) during this time frame.
First, look at the correlation between our two performance metrics (% on time at terminal and in stations) across years.
#Remove Z line, as stats are duplicated for this and J.
line_vs_OTP_and_wait <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles %>% select(c("Year","Line","OTP","Wait")) %>% filter(Line != "Z")
line_vs_OTP_and_wait <- data.frame(line_vs_OTP_and_wait,stringsAsFactors=FALSE)
line_vs_OTP_and_wait <- line_vs_OTP_and_wait[!duplicated(line_vs_OTP_and_wait),]
#Was going to color by year, but then it draws a trend line separate per year. So just plot all years together for now.
#mycol <- c("grey","black","#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
ggplot(line_vs_OTP_and_wait,
aes(OTP,Wait)) +
geom_point(size=3) +
xlab("% on time at terminal") +
ylab("% on time at stations") +
geom_smooth(method = "lm")
While they are quite correlated, the correlation is imperfect enough that I think it makes sense to consider them separately.
Next, plot each metric in 2011 vs. 2017.
We draw a line for x=y as well as a linear model so we can get a sense of the trend compared to if the metrics remained constant over time.
line_vs_OTP_and_wait %>%
select(c("Year","OTP","Line")) %>%
spread(Year,OTP) %>%
select(c("2011","2017")) %>%
ggplot(.,
aes(`2011`,`2017`)) +
geom_point(size=3) +
xlab("% on time at terminal in 2011") +
ylab("% on time at terminal in 2017") +
geom_smooth(method = "lm") +
geom_abline(linetype=2)
line_vs_OTP_and_wait %>%
select(c("Year","Wait","Line")) %>%
spread(Year,Wait) %>%
select(c("2011","2017")) %>%
ggplot(.,
aes(`2011`,`2017`)) +
geom_point(size=3) +
xlab("% on time at stations in 2011") +
ylab("% on time at stations in 2017") +
geom_smooth(method = "lm") +
geom_abline(linetype=2)
We find that both performance metrics are lower across the board in 2017 as compared to 2011.
Finally, plot the full trend across time per line.
ggplot(line_vs_OTP_and_wait,
aes(x = Year, y = OTP,colour=Line)) +
geom_line() + geom_point() +
scale_colour_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
ylab("% on time at terminal") +
facet_wrap(~Line)
ggplot(line_vs_OTP_and_wait,
aes(x = Year, y = Wait, group = Line,colour=Line)) +
geom_line() + geom_point() +
scale_colour_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
ylab("% on time at stations") +
facet_wrap(~Line)
We find that service generally started to decline around 2013/2014.
In this section, I will explore the correlation between line changes in performance over time with changes in ridership over time.
Based on the performance trend lines, let’s compare 2013 to 2017.
line_vs_OTP_and_wait_2013_vs_2017 <- line_vs_OTP_and_wait %>%
filter(Year == 2013 | Year == 2017)
line_vs_OTP_2013_vs_2017 <- line_vs_OTP_and_wait_2013_vs_2017 %>%
select(c("Year","OTP","Line")) %>%
spread(Year,OTP) %>%
mutate(Percent.change.terminal.ontime = (`2017` - `2013`)*100/`2013`)
line_vs_wait_2013_vs_2017 <- line_vs_OTP_and_wait_2013_vs_2017 %>%
select(c("Year","Wait","Line")) %>%
spread(Year,Wait) %>%
mutate(Percent.change.stations.ontime = (`2017` - `2013`)*100/`2013`)
ridership_2017 <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Year == 2017,]
ridership_2017 <- ridership_2017[which(is.na(ridership_2017$Line) == FALSE),]
ridership_2013 <- mean_weekly_ridership_per_year_minus_hubs_and_shuttles[mean_weekly_ridership_per_year_minus_hubs_and_shuttles$Year == 2013,]
ridership_2013 <- ridership_2013[which(is.na(ridership_2013$Line) == FALSE),]
ridership_2013_and_2017 <- rbind(ridership_2013,ridership_2017)
ridership_2013_and_2017 <- ridership_2013_and_2017 %>%
select(c("Year","Mean.weekly.ridership","Line","Station","Line.Name")) %>%
spread(Year,Mean.weekly.ridership) %>%
mutate(Percent.change.ridership = (`2017` - `2013`)*100/`2013`)
ridership_2013_and_2017 <- merge(ridership_2013_and_2017,line_vs_OTP_2013_vs_2017[,c("Line","Percent.change.terminal.ontime")],by="Line",all=TRUE)
ridership_2013_and_2017 <- merge(ridership_2013_and_2017,line_vs_wait_2013_vs_2017[,c("Line","Percent.change.stations.ontime")],by="Line",all=TRUE)
Can we model change in ridership as a function of change in either performance metric?
summary(lm(Percent.change.ridership ~ Percent.change.terminal.ontime,data=ridership_2013_and_2017))
##
## Call:
## lm(formula = Percent.change.ridership ~ Percent.change.terminal.ontime,
## data = ridership_2013_and_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.999 -7.877 -1.547 6.052 278.026
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6688 2.5109 3.851 0.000131 ***
## Percent.change.terminal.ontime 0.2831 0.0793 3.570 0.000388 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.42 on 559 degrees of freedom
## (23 observations deleted due to missingness)
## Multiple R-squared: 0.02229, Adjusted R-squared: 0.02054
## F-statistic: 12.74 on 1 and 559 DF, p-value: 0.0003881
summary(lm(Percent.change.ridership ~ Percent.change.stations.ontime,data=ridership_2013_and_2017))
##
## Call:
## lm(formula = Percent.change.ridership ~ Percent.change.stations.ontime,
## data = ridership_2013_and_2017)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.613 -7.827 -1.655 6.072 279.049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5073 2.7280 3.852 0.000131 ***
## Percent.change.stations.ontime 1.1049 0.3088 3.578 0.000376 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.42 on 559 degrees of freedom
## (23 observations deleted due to missingness)
## Multiple R-squared: 0.02239, Adjusted R-squared: 0.02064
## F-statistic: 12.8 on 1 and 559 DF, p-value: 0.000376
We find that the coefficient of the model is statistically significant using either metric! However, the model does not explain very much of the variance (very low R-squared).
Let’s visualize.
Plot excludes Cortland Street, which had over 200% change in ridership during this time frame.
ggplot(ridership_2013_and_2017,
aes(Percent.change.terminal.ontime,Percent.change.ridership,colour=Line)) +
xlab("% change in proportion on time at terminal") +
ylab("% change in ridership") +
scale_colour_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
ggtitle("Percent changes from 2013 to 2017") +
geom_smooth(aes(group = 1),method="lm") +
geom_jitter(alpha=1/4) +
coord_cartesian(ylim=c(-75,75))
ggplot(ridership_2013_and_2017,
aes(Percent.change.stations.ontime,Percent.change.ridership,colour=Line)) +
xlab("% change in proportion on time at stations") +
ylab("% change in ridership") +
scale_colour_manual(values=c(rep(c("red","forestgreen"),each=3),"violet",rep(c("blue","orange"),times=3),"green","brown","grey","orange",rep("gold",times=3),"brown")) +
ggtitle("Percent changes from 2013 to 2017") +
geom_smooth(aes(group = 1),method="lm") +
geom_jitter(alpha=1/4) +
coord_cartesian(ylim=c(-75,75))
We confirm visually that the model, while significant, clearly still leaves a lot of the variance unexplained.
Make a simple histogram of the sentiment scores.
hist(sentiments_per_article,labels=TRUE,
xlab="Sentiment score (probability of positive sentiment)",
ylab="Number of articles",
main="News articles about MTA 02/04/18 to 05/04/18")
print("Proportion of articles with sentiment score > 0.8")
## [1] "Proportion of articles with sentiment score > 0.8"
print(length(which(sentiments_per_article > 0.8))/length(sentiments_per_article))
## [1] 0.3610451
print("Proportion of articles with sentiment score > 0.5")
## [1] "Proportion of articles with sentiment score > 0.5"
print(length(which(sentiments_per_article > 0.5))/length(sentiments_per_article))
## [1] 0.6437055
Sentiments of articles seem generally pretty positive, with around 36% of articles having a high sentiment score and 64% rated as at least more likely than not to be positive.
Does this mean that media coverage of the MTA is actually relatively positive?
Or could this simply be an issue of the Indico algorithm not working so well on this data set?
To fully evaluate the performance of the Indico algorithm here, we would need to manually review all the articles.
This is impractical for a data set of this size with only one reviewer.
However, we can sample the data and manually review at least a few articles.
Let’s sample three articles per sentiment level (say very likely negative, probably negative, probably positive, and very likely positive).
For this we’ll define very likely as >90% chance, and probably as >75-90% chance.
Print title, description, and URL for each article.
set.seed(1392)
samples_sentiment_levels <- c(sample(which(sentiments_per_article < 0.10),3),
sample(which(sentiments_per_article >= 0.10 & sentiments_per_article < 0.25),3),
sample(which(sentiments_per_article > 0.75 & sentiments_per_article <= 0.90),3),
sample(which(sentiments_per_article > 0.90),3))
data.frame(mta_news_article_info[samples_sentiment_levels,c("title","description","url")],
Sentiment.score = sentiments_per_article[samples_sentiment_levels],
stringsAsFactors=FALSE)
## title
## 215 Mar. 4: armed teachers, MTA funding and happiness
## 221 STASI: 'Sex and the City' star Nixon isn't ready to run New York
## 143 Let everyone ride the subway
## 128 City says funding Gov's subway fix plan will hurt capital budget
## 229 Man in MAGA hat cuffed for allegedly pushing immigrant on tracks
## 82 NYC clerics urge fairness for Uber in any congestion pricing plan
## 178 Car Free Day returning to two major Manhattan avenues in April
## 378 Ride-sharing apps are driving NYC cabbies into financial ruin
## 308 De Blasio agrees to pony up $418M for Cuomo’s subway rescue plan
## 324 Manhattanites are flocking to this Queens street
## 348 Say goodbye to bar carts on LIRR platforms
## 347 MTA chief asks subway riders to report rogue pets after pitbull attack
## description
## 215 Lansdowne, Va.: As a former NYC schools chancellor, in my experience teachers are caregivers first, disciplinarians second.
## 221 Let's be clear here: Just because she's a famous actress doesn't mean she should be governor.
## 143 Monday morning in Manhattan state Supreme Court Justice Shlomo Hagler’s courtroom, lawyers for the Metropolitan Transportation Authority
## 128 Picking up the hefty tab for the subways’ capital plan would blow a hole in the city’s capital budget, Mayor de Blasio’s office said.
## 229 Cops arrested the unhinged guy in a MAGA hat accused of shoving a Mexican immigrant onto the tracks at Union Square, police sources said.
## 82 A group of 14 New York City clergy members are urging legislative leaders not to adopt a congestion pricing plan.
## 178 The city will banish cars from stretches of two major Manhattan avenues to mark a Car Free Day in New York this April.
## 378 NYC cabbies are being driven to the edge of financial ruin and despair as ride-hail apps like Uber and Lyft continue to take their customers. Last week, in the fourth driver suicide since November, Nicanor Ochisor, 65, hanged himself inside his Queens home, d…
## 308 Mayor Bill de Blasio agreed on Saturday to Gov. Andrew Cuomo’s demand that city put up $418 million more for the subway, according to a City Hall spokesman. The Cuomo administration’s $168.3 billion state budget plan announced Friday night included…
## 324 Convenience is king, as the old adage goes — and Long Island City is the king of convenience. At least, that’s long been the neighborhood’s big selling point. Just across the East River from Manhattan, LIC is about as close to Midtown as you can get without a…
## 348 It’s last call for alcohol on the Long Island Rail Road. The eight bar carts that ply suburban commuters with rush-hour booze are going the way of the Harvey Wallbanger, The Post has learned. Come March 27, the five carts on the platforms at Penn Station, alo…
## 347 MTA Chairman Joe Lhota on Wednesday asked riders to snitch on all non-service dogs seen on the subways not properly enclosed in a carrier. “I encourage New Yorkers, if they see anything like that, to report it,” said Lhota. The chairman made the statement at …
## url
## 215 http://www.nydailynews.com/opinion/mar-4-armed-teachers-mta-funding-happiness-article-1.3851967
## 221 http://www.nydailynews.com/new-york/stasi-sex-city-star-nixon-isn-ready-run-new-york-article-1.3886451
## 143 http://www.nydailynews.com/opinion/ride-subway-article-1.3852057
## 128 http://www.nydailynews.com/new-york/city-funding-gov-subway-fix-plan-hurt-capital-budget-article-1.3862081
## 229 http://www.nydailynews.com/new-york/nyc-crime/man-maga-hat-cuffed-allegedly-pushing-immigrant-tracks-article-1.3956571
## 82 http://www.nydailynews.com/news/politics/nyc-clerics-urge-fairness-uber-congestion-pricing-plan-article-1.3799800
## 178 http://www.nydailynews.com/new-york/manhattan/car-free-day-returning-major-manhattan-avenues-april-article-1.3818102
## 378 https://nypost.com/2018/03/24/ride-sharing-apps-are-driving-nyc-cabbies-into-financial-ruin/
## 308 https://nypost.com/2018/03/31/de-blasio-agrees-to-pony-up-418m-for-cuomos-subway-rescue-plan/
## 324 https://nypost.com/2018/04/11/manhattanites-are-flocking-to-this-queens-street/
## 348 https://nypost.com/2018/03/17/say-bye-bye-to-bar-carts-on-lirr-platforms/
## 347 https://nypost.com/2018/04/25/mta-chief-asks-subway-riders-to-report-rogue-pets-after-pitbull-attack/
## Sentiment.score
## 215 0.05347117
## 221 0.04206641
## 143 0.04704979
## 128 0.19018018
## 229 0.24185193
## 82 0.13517369
## 178 0.79187888
## 378 0.87156713
## 308 0.82868457
## 324 0.98967540
## 348 0.90395868
## 347 0.90753043
The first thing we notice is that the Google News API may have picked up some articles that aren’t really related to the MTA (for example, an article about yellow taxis vs. Uber/Lyft, which is transportation-related but not MTA-related).
We also find that a few of the positive articles have enough connection to the MTA that it makes sense that they were tagged, but they are not necessarily directly related to MTA service. Eg a car free day relates to the subway because people took the subway more on those days. Or an article about people moving to Long Island City mentions the good public transportation connections for that neighborhood.
Some of the articles actually make some sense with their sentiment score if you read them. For example, the article about bar carts no longer being on the LIRR paints a positively nostalgic tone toward the bar carts, even as it is negative toward them being taken away.
Let’s choose a few more positive articles from each of the two most frequent sources to manually review.
set.seed(1392)
samples_sentiment_levels <- c(sample(which(sentiments_per_article < 0.10),3),
sample(which(sentiments_per_article >= 0.10 & sentiments_per_article < 0.25),3),
sample(which(sentiments_per_article > 0.75 & sentiments_per_article <= 0.90),3),
sample(which(sentiments_per_article > 0.90),3))
sample_positive_3_from_each_source <- sample(setdiff(which(sentiments_per_article > 0.90 & mta_news_article_info$source == "Nypost.com"),samples_sentiment_levels),3)
sample_positive_3_from_each_source <- c(sample_positive_3_from_each_source,sample(setdiff(which(sentiments_per_article > 0.90 & mta_news_article_info$source == "Nydailynews.com"),samples_sentiment_levels),3))
data.frame(mta_news_article_info[sample_positive_3_from_each_source,c("title","description","url")],Sentiment.score = sentiments_per_article[sample_positive_3_from_each_source])
## title
## 368 City ferry system sees huge swell of passengers in first year
## 283 Powerful nor’easter will make your commute ‘dangerous’
## 380 State legislators near $170B budget deal that includes taxi surcharge
## 204 De Blasio, pols urging Albany to add speed cameras near schools
## 95 Winners have been selected in MTA Genius Challenge
## 171 MTA operations head Phil Eng to replace embattled LIRR president
## description
## 368 Ferry nice! The city’s ferry system has proven to be such a hit that it now projects carrying up to 9 million passengers yearly by 2023 – nearly twice as many as first estimated. During a press conference at a ferry stop in Brooklyn’s Bay Ridge, Mayor de Blas…
## 283 Winter is making the most of its final two weeks — with the second powerful nor’easter in five days set to dump up to 10 inches of snow across the city on Wednesday. The storm, packing 45-mph winds, began overnight and will continue until at least 8 p.m., mak…
## 380 The state Legislature was nearing a deal Friday night on a $170 billion budget that for the first time imposes a surcharge on ride-sharing and yellow taxi rides in a large swath of Manhattan. The fee was scheduled to kick in on Jan. 1, 2019 for rides below 96…
## 204 Advocates and a slew of city elected officials pleaded with Albany to authorize the city to install more speed cameras around schools.
## 95 The names of the geniuses who can rescue the subway from dysfunction will be announced Friday, MTA Chairman Joe Lhota said.
## 171 The MTA’s operations chief, Phil Eng, will take over the troubled Long Island Rail Road service, sources told the Daily News.
## url
## 368 https://nypost.com/2018/05/03/city-ferry-system-sees-huge-swell-of-passengers-in-first-year/
## 283 https://nypost.com/2018/03/06/powerful-noreaster-will-make-your-commute-dangerous/
## 380 https://nypost.com/2018/03/30/state-legislators-near-170b-budget-deal-that-includes-taxi-surcharge/
## 204 http://www.nydailynews.com/news/politics/de-blasio-pols-urging-albany-add-speed-cameras-schools-article-1.3890480
## 95 http://www.nydailynews.com/new-york/winners-selected-mta-genius-challenge-article-1.3863549
## 171 http://www.nydailynews.com/new-york/mta-operations-head-phil-eng-replace-embattled-lirr-president-article-1.3930239
## Sentiment.score
## 368 0.9610190
## 283 0.9235188
## 380 0.9376095
## 204 0.9732591
## 95 0.9327753
## 171 0.9894193
Looking in detail at a few more articles, it appears that the Indico algorithm simply does not perform so well on this data set.
The article from the NY Post titled “Powerful nor’easter will make your commute ‘dangerous’” seems like it should certainly be negative, or at the very least neutral.
Another article titled “State legislators near $170B budget deal that includes taxi surcharge” has some positive aspects, but really does not seem overwhelmingly positive enough to warrant a 0.94 sentiment score.
Let’s see if we can use the emotion data to figure out what’s going on here.
For each of the five emotions tested (anger, joy, sadness, fear, and surprise), plot their probabilities vs. sentiment score, colored by source.
Let’s plot just the two main sources we have (NY Post and NY Daily News) for now, and exclude less frequent sources.
emotions_per_article <- data.frame(emotions_per_article,
Source = rep(mta_news_article_info$source,each=5),
stringsAsFactors=FALSE)
emotions_per_article_post_vs_daily_news <- emotions_per_article[emotions_per_article$Source == "Nypost.com" | emotions_per_article$Source == "Nydailynews.com",]
ggplot(emotions_per_article_post_vs_daily_news,
aes(Probability,Sentiment.score.article,colour=Source)) +
facet_wrap(~Emotion,scales="free_x") +
geom_point() +
xlab("Probability of emotion") +
ylab("Sentiment score of article")
The first thing we notice is that the probabilities for all five emotions are mostly relatively low. Anger and sadness have more points for probability > 0.5 than the other three emotions, but still most points are at < 0.5.
We also notice that the probabilities of each emotion do not seem very correlated with the sentiment scores.
Let’s check the actual correlation scores.
for(emotion in unique(emotions_per_article_post_vs_daily_news$Emotion))
{
print(emotion)
print(cor(emotions_per_article_post_vs_daily_news[emotions_per_article_post_vs_daily_news$Emotion == emotion,"Probability"],
emotions_per_article_post_vs_daily_news[emotions_per_article_post_vs_daily_news$Emotion == emotion,"Sentiment.score.article"]))
}
## [1] "anger"
## [1] -0.05546636
## [1] "joy"
## [1] 0.07008235
## [1] "sadness"
## [1] -0.05640409
## [1] "fear"
## [1] 0.1340254
## [1] "surprise"
## [1] -0.009517532
We confirm that the correlations are indeed not very good.
And strangely enough, the best correlation is very counterintuitive. A higher probability of fear is actually slightly positively correlated with a more positive sentiment score.
Overall, I think these results make some sense.
Even though news sources of course will often put their own “spin” on things, the low probability scores for the major emotions indicate a relatively neutral emotional tone for a lot of news articles.
Considering the Indico algorithm was trained on movie reviews (which are probably more likely to exhibit the author’s emotions on the subject), it is not all that surprising that it might not perform as well on less emotional news articles.
We confirmed that ridership has generally increased, while performance has generally decreased over time.
Our finding that changes in ridership were correlated with changes in performance was initially promising, but the strength of the correlation is not enough to be practically significant. It is more likely that other factors are the main drivers behind where people work and live, rather than whether the nearest subway line has had more delays over time.
Unfortunately we were unable to get good data on the tone of media reports about the MTA using the tool we tested (Indico sentiment analysis). Our finding that one major NY newspaper tended to have more positive sentiment scores than the other seemed promising at first. But in light of what seems to be questionable performance of Indico on news articles, it is unclear we can conclude much from this finding. A future direction might be to test other algorithms for sentiment analysis and see if they perform better. If one could use a more accurate algorithm to increase confidence in the results, it would be quite interesting to see if different local publications have a different tone toward local issues like this one.
As mentioned earlier, it would be quite interesting to track changes in tone of media reports about the MTA over time. This would be a good idea for a future direction of this work.
Let’s make a Leaflet map of all subway entrances on the L train.
Coordinates per subway entrance available as a CSV file here:
http://web.mta.info/developers/data/nyct/subway/StationEntrances.csv
library(leaflet)
station_coords <- read.csv("StationEntrances.csv",header=TRUE,check.names=FALSE,stringsAsFactors=FALSE)
L_train_station_coords <- station_coords %>% filter(Route_1 == "L" | Route_2 == "L" | Route_3 == "L" | Route_4 == "L")
L_train_station_coords %>% leaflet %>% addTiles() %>%
addAwesomeMarkers(popup=L_train_station_coords$Station_Name,
icon=makeAwesomeIcon(text="L",squareMarker=TRUE,iconColor="grey"))
Note: For some reason, it defaulted to red when I tried to switch markerColor to grey.
So I decided not to set any markerColor, as the other default of light blue is preferable to a red that is very similar to the color of the 1/2/3 line.
Settings to change the color and shape of the icon worked fine, however. So I made the icon a grey “L”.