In the past few years, the MTA fare has increased substantially compared to years past. Many would say that the fare is raising to no avail, as the service remains the same or worse. Several people have been quite vocal regarding their dissatisfaction with the MTA, and use Twitter as an outlet to voice their complaints. In the following analysis, I decided to investigate tweets with the hashtag MTA, in order to conduct sentiment analysis on the overall tone taken when speaking of the MTA on Twitter. Though there are several limitations on the conclusions I can draw from this preliminary analysis, I had a lot of fun doing it and it resonates with me personally.
In the above chunk I loaded the neccessary libraries. I also set up my path to Twitter which allowed me to scrape #MTA tweets.
MTAtweets <- twListToDF(searchTwitter("#MTA", n=10000, lang = "en", since = '2017-04-01'))
#Saving in a CSV
write.csv(MTAtweets, "/Users/sophia.halkitis/Desktop/R/Datasets/mtatweets.csv")
Then I imported all the most recent tweets with #MTA in them from the first of April, and saved them into a CSV file.
MTAtweets <- read.csv("/Users/sophia.halkitis/Desktop/R/Datasets/mtatweets.csv")%>%
filter(screenName!="infinitetransit")
#Removing emoji's
MTAtweets$text <- sapply(MTAtweets$text, function(row) iconv(row, "latin1", "ASCII", sub = ""))
nrow(MTAtweets)
[1] 2423
#Putting one token per cell
MTAtext <- MTAtweets %>%
unnest_tokens(word, text)
#Removing stop words
data(stop_words)
#Adding more words as stop words - identified these stop words when I ran the count of the most common words
morestopwords <- bind_rows(
data_frame(word = c("https","rt","t.co","mta","the","on","to", "na9jl98nil", "a0iswbh8xc", "2", "casatino"),
lexicon = c("custom")),
stop_words)
MTAtext <-MTAtext %>%
anti_join(morestopwords)
Joining, by = "word"
In the above chunk, I used the skills we learned in class and the Text Mining with R textbook to clean up the tweets. First I had to remove the emojis and stop words, then I had to make sure that the tweets were in the right format, such that there was only one word per cell and not one whole tweet per cell.
MTAtext%>%
count(word, sort = TRUE) %>%
filter(n > 100)%>%
mutate(word = reorder(word,n))%>%
ggplot(aes(word,n))+
geom_col(fill="orange")+
coord_flip()+
theme_minimal()
#Sentiment analysis with Bing Lexicon
MTAbing <- get_sentiments("bing")
MTAtext2 <- merge(MTAtext, MTAbing,
by.x = "word", by.y = "word",
all.x = FALSE, all.y = FALSE)
nrow(MTAtext2)
[1] 1725
In this chunk, I decided on the “bing” lexicon to determine each word to be either positive or negative. I merged the bing lexicon with my dataset of MTA tweets, and I made sure that all the words included in the analysis had a corresponding sentiment attributed to it, and if it did not it was dropped. My final analysis consisted of 1,954 tweets from the initial sample.
ggplot(MTAtext2) +
geom_bar(aes(x = sentiment), fill = "Orange") +
theme_minimal()
I then made a bar chart to compare the count of negative and positive sentiments that the lexicon identified. The chart shows that the number of negative words used in tweets with the MTA hashtag is more than double the number of positve words.
MTAtext2%>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(max.words = 100)
Then I made a word cloud of the positive and negative words from the tweets. As is apparent,the most commonly used word “delays,” is a negative word.
#Finds only tweets with the word "happy" in the "text" column and creates a subset of them
MTAtweets%>%
filter(str_detect(text, "happy"))%>%
subset(select=c("text"))%>%
print("text")
An unfortunate limitation of text analysis is that it is only able to identify word’s innate positive or negative sentiment, but is not able to identify the context in which the words are used which may be sarcastic or satirical. Because of this, I wanted to investigate further and provide a context for some of the commonly used posiitve words. So filtered the full tweets with the word “happy” in them. In the above chunk, we can see that the word “happy” is often used sarcastically under the hashtag #happymonday, like when people experience MTA delays and speak sarcastically of them.
MTAtweets%>%
filter(str_detect(text, "love"))%>%
subset(select=c("text"))%>%
print("text")
With the word “love,” many people are implying that they are fed up with how the MTA fare continues to increase with no corresponding increase in timeliness.
MTAtweets%>%
filter(str_detect(text, "enjoy"))%>%
subset(select=c("text"))%>%
print("text")
Similiarly, the word “enjoy” is often used sarcastically when people are complaining about the delays (see line 4).
As I predicted, much of the positive words were used sarcastically and ironically, and so ultimately negativity is even more prevelant than we can observe with the wordcloud.
As we can see, people are pretty unhappy with the MTA!