I chose to do my final project on Time’s Person of the Year award. As it was an upcoming announcement when I began and then was announced during the course of my work, it was an interesting and ever changing project. I chose to use data from a few different sources, including Google News and Trends, as well as Twitter.
To begin working with the Twitter data, I pulled in the tweets and initially formatted them as a data frame and then ran the regular expression formatting for the tweet’s text. I removed the hash tags from the text.
num_tweets <- 1000
tweets <- searchTwitter('#TIMEPOY', n = num_tweets)
POY_df <- twListToDF(tweets)
head(POY_df)
reg <- "([^A-Za-z\\d@']|'(?![A-Za-z\\d#@]))"
POY_words <- POY_df %>%
filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
As I did in my Assignment 4 report, I utilized a word cloud to get an initial visualization of the words in my tweets. It is a quick way to get a sense of the most commonly used words. As I didn’t want to include obvious words such as Time, person, or rt, I excluded these (and a few others) from my word cloud.
##Create WordCloud for the Text in #TIMEPOY Tweets
library(wordcloud)
POY_words %>% filter(word != "rt", word != "person", word != "TIMEPOY", word != "@time", word != "timepoy", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 50, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15, colors=brewer.pal(8,"Set1")))
Then I hoped to see a list of the most common words represented in a table, rather than as a word cloud. I again excluded some of the common basic words that were not of interest.
kable(POY_words %>% group_by(word)%>% filter(word != "rt", word != "person", word != "TIMEPOY", word != "@time", word != "timepoy", word != "https", word != "time") %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(15))
| word | n | frequency |
|---|---|---|
| trump | 407 | 0.0869101 |
| donald | 389 | 0.0830664 |
| time’s | 246 | 0.0525304 |
| @realdonaldtrump | 119 | 0.0254111 |
| cover | 106 | 0.0226351 |
| president | 102 | 0.0217809 |
| meet | 94 | 0.0200726 |
| chosen | 84 | 0.0179372 |
| read | 68 | 0.0145206 |
| history | 67 | 0.0143071 |
| america | 66 | 0.0140935 |
| helped | 66 | 0.0140935 |
| interview | 66 | 0.0140935 |
| voters | 65 | 0.0138800 |
| house | 64 | 0.0136665 |
| white | 64 | 0.0136665 |
I used the sentiment analysis we learned in previous units to illustrate the most common sentiments from the words in tweets containing #TimePOY. Generally the feelings were positive, but there are limitations to just using the words for sentiment analysis.
##Join Words from #TIMEPOY Tweets to Sentiments
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
head(nrc)
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
POY_sentiments <- POY_words %>% inner_join(nrc, by = "word")
POY_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n))
## # A tibble: 10 × 3
## sentiment n frequency
## <chr> <int> <dbl>
## 1 trust 523 0.250839329
## 2 positive 486 0.233093525
## 3 surprise 437 0.209592326
## 4 anticipation 321 0.153956835
## 5 joy 131 0.062829736
## 6 negative 78 0.037410072
## 7 sadness 37 0.017745803
## 8 anger 31 0.014868106
## 9 fear 26 0.012470024
## 10 disgust 15 0.007194245
I wanted to represent the data visually, making it easy to consume.
##Summarize Data for Bar Chart
summary <- POY_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)
##Create Table with Data and Plot
ggplot(summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Tweets Using #TimePOY")
Another aspect of the tweets I wanted to utilize was location (more on my interest below). I began by pulling location data for those with public profiles who are among the top 100 users in terms of number of tweets.
#Group Tweets by Users
library(twitteR)
POY_users <- POY_df %>% group_by(screenName) %>% summarize(n = n()) %>% mutate(percentage = n / sum(n)) %>% arrange(desc(n)) %>% top_n(100)
top_users <- POY_users %>%
filter(n>1)
head(top_users)
## # A tibble: 6 × 3
## screenName n percentage
## <chr> <int> <dbl>
## 1 Real_Infinity95 8 0.008
## 2 AziziOthmanMY 7 0.007
## 3 Mrollando 7 0.007
## 4 powerusergr 7 0.007
## 5 reesebenyaacov 7 0.007
## 6 TIME 7 0.007
getLocation <- function(x) {
y <- getUser(x)
location <- y$location
return(location)
}
top_users$screenName
## [1] "Real_Infinity95" "AziziOthmanMY" "Mrollando"
## [4] "powerusergr" "reesebenyaacov" "TIME"
## [7] "Willam_grey" "AlfidioValera" "kapilt41"
## [10] "realjoet" "sembronio" "AccessAiNews"
## [13] "aiuramasakaz" "AleixAlmirall" "Veep_MikePence"
## [16] "AakashGauttam" "AnsisEgle" "baikap"
## [19] "BhaktHercules" "BubashLance" "centroempleotoc"
## [22] "ColorMeRed" "desh_bhkt" "digitaljotter"
## [25] "elviador" "Firozkh15116079" "Flaumer"
## [28] "gedwa75" "HazemFKandil" "iodyssee"
## [31] "jogbosky" "JorgeRi37481209" "KACHARAGADLA"
## [34] "KofoAregbesola" "mgznrdr" "mnsdall"
## [37] "PartyAtHarambes" "rhiles2760" "RosaMariaV777"
## [40] "sandeepdixit10" "TheOfficialNews" "tphallett"
## [43] "ypstomer"
user_location <- sapply(top_users$screenName, function(x) getLocation(x))
I used the locations of the users from the tweets that I gathered above and linked the locations to latitude and longitude, using geocode. Initially I was interested to see if areas where a potential winner was from were more common that areas without anyone on the “short-list.” However, as I continued running the script after the winner was announced, I hoped to see if there was any trend in locations and if it was mostly users in the United States (as the winner was Donald Trump).
library(leaflet)
library(ggmap)
##Pull the Latitude and Longitude for the Locations
Countries <- geocode(user_location, output="latlon", source = "google")
##Remove NAs and Show Results
Countries <- na.omit(Countries)
kable(Countries)
| .id | lon | lat | |
|---|---|---|---|
| 1 | Real_Infinity95 | -99.9018131 | 31.968599 |
| 2 | AziziOthmanMY | 101.9757660 | 4.210484 |
| 3 | Mrollando | -77.0368707 | 38.907192 |
| 4 | powerusergr | 21.8243120 | 39.074208 |
| 5 | reesebenyaacov | 34.8516120 | 31.046051 |
| 9 | kapilt41 | 72.8776559 | 19.075984 |
| 11 | sembronio | 9.1859243 | 45.465422 |
| 12 | AccessAiNews | -0.1277583 | 51.507351 |
| 13 | aiuramasakaz | 139.2199432 | 35.374736 |
| 14 | AleixAlmirall | 2.1734035 | 41.385064 |
| 17 | AnsisEgle | 24.1051864 | 56.949649 |
| 18 | baikap | 106.9057439 | 47.886399 |
| 19 | BhaktHercules | 4.6644779 | 50.867894 |
| 21 | centroempleotoc | -74.0300122 | 5.026003 |
| 25 | elviador | -74.0059413 | 40.712784 |
| 28 | gedwa75 | 37.5649507 | 54.163768 |
| 30 | iodyssee | 30.1350140 | -1.963042 |
| 31 | jogbosky | 7.3985740 | 9.076479 |
| 33 | KACHARAGADLA | 78.4866710 | 17.385044 |
| 35 | mgznrdr | 138.2529240 | 36.204824 |
| 37 | PartyAtHarambes | -81.5365094 | 41.393110 |
| 38 | rhiles2760 | -77.3971839 | 34.552666 |
| 39 | RosaMariaV777 | -66.5901490 | 18.220833 |
| 40 | sandeepdixit10 | 77.8498292 | 28.406963 |
| 41 | TheOfficialNews | -86.7816016 | 36.162664 |
| 43 | ypstomer | 78.9628800 | 20.593684 |
map <- leaflet() %>% setView(lng ="0", lat ="0", zoom = 1)
map %>% addProviderTiles("CartoDB.Positron") %>% addMarkers(map, lng = Countries$lon, lat = Countries$lat)
I then utilized data from Google Trends to compare search information to the data derived from tweets using #TimePOY. I used data for the search term “Times Person of the Year.” I connected to this data using the gtrendsR library.
As the announcement approached and then was made, I wondered if there was an increase in searches. To see if this occurred, I plotted the data over the month leading up to this week’s announcement. As shown below, there was a huge increase as the announcement approached, spiking when it was made.
plot(trend)
I also thought it would be interesting to see if this spike of interest around the announcement was an annual occurrence, in line with the announcement each year. Due to this years controversial selection, I also wanted to see if there was an increase in 2016 when compared to previous years. By using data from 2004 onward (the largest amount available via gtrendsR), it is clear 2016 was a big year for the Time’s Person of the Year award.
long_trend <- gtrends(c("Times Person of the Year"))
plot(long_trend)
Another interesting view of Time’s Person of the Year is how it is covered. As a prominent award, with a controversial history (i.e. naming Adolf Hilter Person of the Year), I thought it would be interesting to see how it was covered in the news.
library(rvest)
library(stringr)
library(reshape2)
POY_url <- "https://news.google.com/news/story?ncl=dhb6eCtlVdAy2bMnmdMdK30rtTRsM&q=time+person+of+the+year&lr=English&hl=en&sa=X&ved=0ahUKEwjH5cCSluXQAhUGlpAKHSPYDtEQqgIIJjAA"
POY_history <- read_html(POY_url)
article_title <- POY_history %>% html_nodes(".titletext") %>% html_text()
source <- POY_history %>% html_nodes(".source-pref") %>% html_text()
head(article_title)
## [1] "Trump Criticizes Time's 'Person Of The Year' As 'Politically Correct'"
## [2] "Person of the Year"
## [3] "TIME's Person of the Year: Everything You Need to Know"
## [4] "Donald Trump says Time Person of the Year title should be man of the year"
## [5] "Trump has one problem with his Time 'Person of the Year' cover"
## [6] "It's Been 10 Years Since You Were Named TIME's Person of the Year"
head(source)
## [1] "Huffington Post" "TIME" "TIME"
## [4] "The Independent" "Business Insider" "TIME"
I felt a sentiment analysis would be a good way to scan the headlines regarding the announcement, and would also be interesting to compare to the tweets about the announcement’s sentiment.
library(tidytext)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
df <- data.frame(article_title)
word_df <- df %>% unnest_tokens(word, article_title, token = "regex", pattern = reg)
nrc <- sentiments %>%
filter(lexicon == "nrc") %>%
select(word, sentiment)
head(nrc)
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
article_sentiments <- word_df %>% inner_join(nrc, by = "word")
news_summary <- article_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)
I then plotted this data in the same way I did with the tweet’s using #TimePOY sentiment analysis.
ggplot(news_summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Articles About Time's Person of the Year")
Overall, I found some interesting information about Time’s Person of the Year. I was surprised by how similar the sentiments around tweets and news articles were, as well as how cyclical interest in the award was (as illustrated by the Google Trends data). I’ve really enjoyed digging aronud into the data that is available to everyone!