Introduction:
Initially, we planned on using the data from a blog post about using the data from a trending Twitter hashtag, #7FavPackages, to see which R packages were most popular as the basis of our lab. We thought this would be a great blog post to base our lab on because it showed that you could use Twitter’s search API as a tool to quickly access and anaylze a large sample of data that could be used to answer questions. The link to the blog post can be found here: http://varianceexplained.org/r/seven-fav-packages/.
Unfortunately, as we got further into the lab we realized that the data from that blog post wasn’t able to be reproduced because the author left out numerous details, such as how he created his bar graphs displaying packages that were mentioned along with how popular they were. Furthermore, due to Twitter changing their search API we weren’t able to get anywhere near as large a sample of tweets as the author was able to get (7 vs. the authors 700).
Due to this we decided to find a different blog post that used the same idea of analyzing data from Twitter to answer questions while also allowing us a slightly larger sample size. That blog post can be found here: https://www.r-bloggers.com/r-text-mining-on-twitter-prayformh370-malaysia-airlines/.
Analysis:
This reproduced analysis creates a wordcloud of the most popular words that are tweeted using the hashtag #PrayForMH370. Due to Twitter changing their search API, we only had access to tweets posted in the last 7 days which severely limited what we could reproduce in comparison to the original analysis.
mh370 <- searchTwitter("#PrayForMH370", n = 1000)
mh370_text = sapply(mh370, function(x) x$getText())
mh370_corpus = Corpus(VectorSource(mh370_text))
tdm = TermDocumentMatrix(
mh370_corpus,
control = list(
removePunctuation = TRUE,
removeNumbers = TRUE, tolower = TRUE)
)
m = as.matrix(tdm)
word_freqs = sort(rowSums(m), decreasing = TRUE)
dm = data.frame(word = names(word_freqs), freq = word_freqs)
wordcloud(dm$word, dm$freq, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
Follow-up:
For this portion of the lab we decided to modify the analysis and do a comparison of the original reproduced analysis against data gained from tweets about another major airplane crash that happened relatively recently. Our goal was to see if the tweets after each accident were similar even though both were caused for different reasons (one disappeared mid-flight, one was intentionally crashed into a mountainside by a suicidal pilot). Our expectation was that one flight would evoke tweets of sadness while the other would evoke anger since one flight was caused by accident and the other was intentional.
flight9525 <- searchTwitter("flight 9525", n = 500)
flight9525_text = sapply(flight9525, function(x) x$getText())
flight9525_corpus = Corpus(VectorSource(flight9525_text))
tdm = TermDocumentMatrix(
flight9525_corpus,
control = list(
removePunctuation = TRUE,
removeNumbers = TRUE, tolower = TRUE)
)
m = as.matrix(tdm)
word_freqs = sort(rowSums(m), decreasing = TRUE)
dm = data.frame(word = names(word_freqs), freq = word_freqs)
wordcloud(dm$word, dm$freq, random.order = FALSE, colors = brewer.pal(8, "Dark2"))
Disussion:
In the end, our findings turned out to be pretty inconclusive. Since Twitter changed their search API, we were only able to search for a small number of tweets which meant our sample size for words was extremely small. This caused our wordclouds to be significantly less detailed and accurate than we intended for them to be. When we started this project we went in thinking it’d be pretty straightforward and relatively easy. We were wrong. This lab turned out to be much more involved.
Going forward, there’s a lot that we would’ve done differently. For example, we would have tried to tackle a more accessible and consistent set of data–something that we could access without a series of Twitter authentication steps, and something that wouldn’t be as reliant on a third party as the Twitter search results were to Twitter’s search API. Overall, we think this lab was a great learning experience and helped us learn what we should and shouldn’t do for our future labs.