I chose to do my final project on Time’s Person of the Year award. As it was an upcoming announcement when I began and then was announced during the course of my work, it was an interesting and ever changing project. I chose to use data from a few different sources, including Google News and Trends, as well as Twitter.

Twitter Data and Tweets Using #TimePOY

To begin working with the Twitter data, I pulled in the tweets and initially formatted them as a data frame and then ran the regular expression formatting for the tweet’s text. I removed the hash tags from the text.

num_tweets <- 1000
tweets <- searchTwitter('#TIMEPOY', n = num_tweets)
POY_df <- twListToDF(tweets)
head(POY_df)

reg <- "([^A-Za-z\\d@']|'(?![A-Za-z\\d#@]))"
POY_words <- POY_df %>% 
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

As I did in my Assignment 4 report, I utilized a word cloud to get an initial visualization of the words in my tweets. It is a quick way to get a sense of the most commonly used words. As I didn’t want to include obvious words such as Time, person, or rt, I excluded these (and a few others) from my word cloud.

##Create WordCloud for the Text in #TIMEPOY Tweets
library(wordcloud)
POY_words %>% filter(word != "rt", word != "person", word != "TIMEPOY", word != "@time", word != "timepoy", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 50, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15, colors=brewer.pal(8,"Set1")))

Then I hoped to see a list of the most common words represented in a table, rather than as a word cloud. I again excluded some of the common basic words that were not of interest.

kable(POY_words %>% group_by(word)%>%  filter(word != "rt", word != "person", word != "TIMEPOY", word != "@time", word != "timepoy", word != "https", word != "time")  %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(15))
word n frequency
trump 407 0.0869101
donald 389 0.0830664
time’s 246 0.0525304
@realdonaldtrump 119 0.0254111
cover 106 0.0226351
president 102 0.0217809
meet 94 0.0200726
chosen 84 0.0179372
read 68 0.0145206
history 67 0.0143071
america 66 0.0140935
helped 66 0.0140935
interview 66 0.0140935
voters 65 0.0138800
house 64 0.0136665
white 64 0.0136665

Sentiment Analysis of Tweets

I used the sentiment analysis we learned in previous units to illustrate the most common sentiments from the words in tweets containing #TimePOY. Generally the feelings were positive, but there are limitations to just using the words for sentiment analysis.

##Join Words from #TIMEPOY Tweets to Sentiments
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)
head(nrc)
## # A tibble: 6 × 2
##        word sentiment
##       <chr>     <chr>
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear
POY_sentiments <- POY_words %>% inner_join(nrc, by = "word")

POY_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n))
## # A tibble: 10 × 3
##       sentiment     n   frequency
##           <chr> <int>       <dbl>
## 1         trust   523 0.250839329
## 2      positive   486 0.233093525
## 3      surprise   437 0.209592326
## 4  anticipation   321 0.153956835
## 5           joy   131 0.062829736
## 6      negative    78 0.037410072
## 7       sadness    37 0.017745803
## 8         anger    31 0.014868106
## 9          fear    26 0.012470024
## 10      disgust    15 0.007194245

I wanted to represent the data visually, making it easy to consume.

##Summarize Data for Bar Chart
summary <- POY_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)

##Create Table with Data and Plot
ggplot(summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Tweets Using #TimePOY")

Linking the Tweets to Locations Around the World

Another aspect of the tweets I wanted to utilize was location (more on my interest below). I began by pulling location data for those with public profiles who are among the top 100 users in terms of number of tweets.

#Group Tweets by Users
library(twitteR)
POY_users <- POY_df %>% group_by(screenName) %>%  summarize(n = n()) %>% mutate(percentage = n / sum(n)) %>% arrange(desc(n)) %>% top_n(100)

    top_users <- POY_users %>%
         filter(n>1)
head(top_users)
## # A tibble: 6 × 3
##        screenName     n percentage
##             <chr> <int>      <dbl>
## 1 Real_Infinity95     8      0.008
## 2   AziziOthmanMY     7      0.007
## 3       Mrollando     7      0.007
## 4     powerusergr     7      0.007
## 5  reesebenyaacov     7      0.007
## 6            TIME     7      0.007
getLocation <- function(x) {
    y <- getUser(x)
    location <- y$location
    return(location)
}
top_users$screenName
##  [1] "Real_Infinity95" "AziziOthmanMY"   "Mrollando"      
##  [4] "powerusergr"     "reesebenyaacov"  "TIME"           
##  [7] "Willam_grey"     "AlfidioValera"   "kapilt41"       
## [10] "realjoet"        "sembronio"       "AccessAiNews"   
## [13] "aiuramasakaz"    "AleixAlmirall"   "Veep_MikePence" 
## [16] "AakashGauttam"   "AnsisEgle"       "baikap"         
## [19] "BhaktHercules"   "BubashLance"     "centroempleotoc"
## [22] "ColorMeRed"      "desh_bhkt"       "digitaljotter"  
## [25] "elviador"        "Firozkh15116079" "Flaumer"        
## [28] "gedwa75"         "HazemFKandil"    "iodyssee"       
## [31] "jogbosky"        "JorgeRi37481209" "KACHARAGADLA"   
## [34] "KofoAregbesola"  "mgznrdr"         "mnsdall"        
## [37] "PartyAtHarambes" "rhiles2760"      "RosaMariaV777"  
## [40] "sandeepdixit10"  "TheOfficialNews" "tphallett"      
## [43] "ypstomer"
user_location <- sapply(top_users$screenName, function(x) getLocation(x))

I used the locations of the users from the tweets that I gathered above and linked the locations to latitude and longitude, using geocode. Initially I was interested to see if areas where a potential winner was from were more common that areas without anyone on the “short-list.” However, as I continued running the script after the winner was announced, I hoped to see if there was any trend in locations and if it was mostly users in the United States (as the winner was Donald Trump).

library(leaflet)
library(ggmap)

##Pull the Latitude and Longitude for the Locations
Countries <- geocode(user_location, output="latlon", source = "google")
 ##Remove NAs and Show Results
Countries <- na.omit(Countries)
kable(Countries)
.id lon lat
1 Real_Infinity95 -99.9018131 31.968599
2 AziziOthmanMY 101.9757660 4.210484
3 Mrollando -77.0368707 38.907192
4 powerusergr 21.8243120 39.074208
5 reesebenyaacov 34.8516120 31.046051
9 kapilt41 72.8776559 19.075984
11 sembronio 9.1859243 45.465422
12 AccessAiNews -0.1277583 51.507351
13 aiuramasakaz 139.2199432 35.374736
14 AleixAlmirall 2.1734035 41.385064
17 AnsisEgle 24.1051864 56.949649
18 baikap 106.9057439 47.886399
19 BhaktHercules 4.6644779 50.867894
21 centroempleotoc -74.0300122 5.026003
25 elviador -74.0059413 40.712784
28 gedwa75 37.5649507 54.163768
30 iodyssee 30.1350140 -1.963042
31 jogbosky 7.3985740 9.076479
33 KACHARAGADLA 78.4866710 17.385044
35 mgznrdr 138.2529240 36.204824
37 PartyAtHarambes -81.5365094 41.393110
38 rhiles2760 -77.3971839 34.552666
39 RosaMariaV777 -66.5901490 18.220833
40 sandeepdixit10 77.8498292 28.406963
41 TheOfficialNews -86.7816016 36.162664
43 ypstomer 78.9628800 20.593684
map <- leaflet() %>% setView(lng ="0", lat ="0", zoom = 1)
map %>% addProviderTiles("CartoDB.Positron") %>% addMarkers(map, lng = Countries$lon, lat = Countries$lat)

Google Trend Data on Time’s Person of the Year Selection

I then utilized data from Google Trends to compare search information to the data derived from tweets using #TimePOY. I used data for the search term “Times Person of the Year.” I connected to this data using the gtrendsR library.

As the announcement approached and then was made, I wondered if there was an increase in searches. To see if this occurred, I plotted the data over the month leading up to this week’s announcement. As shown below, there was a huge increase as the announcement approached, spiking when it was made.

plot(trend)

I also thought it would be interesting to see if this spike of interest around the announcement was an annual occurrence, in line with the announcement each year. Due to this years controversial selection, I also wanted to see if there was an increase in 2016 when compared to previous years. By using data from 2004 onward (the largest amount available via gtrendsR), it is clear 2016 was a big year for the Time’s Person of the Year award.

long_trend <- gtrends(c("Times Person of the Year"))

plot(long_trend)

Linking to Recent News Articles about Time’s POY

Another interesting view of Time’s Person of the Year is how it is covered. As a prominent award, with a controversial history (i.e. naming Adolf Hilter Person of the Year), I thought it would be interesting to see how it was covered in the news.

library(rvest)
library(stringr)
library(reshape2)

POY_url <- "https://news.google.com/news/story?ncl=dhb6eCtlVdAy2bMnmdMdK30rtTRsM&q=time+person+of+the+year&lr=English&hl=en&sa=X&ved=0ahUKEwjH5cCSluXQAhUGlpAKHSPYDtEQqgIIJjAA"
POY_history <- read_html(POY_url)

article_title <- POY_history %>% html_nodes(".titletext") %>% html_text()
source <- POY_history %>% html_nodes(".source-pref") %>% html_text()

head(article_title)
## [1] "Trump Criticizes Time's 'Person Of The Year' As 'Politically Correct'"    
## [2] "Person of the Year"                                                       
## [3] "TIME's Person of the Year: Everything You Need to Know"                   
## [4] "Donald Trump says Time Person of the Year title should be man of the year"
## [5] "Trump has one problem with his Time 'Person of the Year' cover"           
## [6] "It's Been 10 Years Since You Were Named TIME's Person of the Year"
head(source)
## [1] "Huffington Post"  "TIME"             "TIME"            
## [4] "The Independent"  "Business Insider" "TIME"

Sentiment Analysis of News Coverage

I felt a sentiment analysis would be a good way to scan the headlines regarding the announcement, and would also be interesting to compare to the tweets about the announcement’s sentiment.

library(tidytext)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
df <- data.frame(article_title)
word_df <- df %>% unnest_tokens(word, article_title, token = "regex", pattern = reg)

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)
head(nrc)
## # A tibble: 6 × 2
##        word sentiment
##       <chr>     <chr>
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear
article_sentiments <- word_df %>% inner_join(nrc, by = "word")

news_summary <- article_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)

I then plotted this data in the same way I did with the tweet’s using #TimePOY sentiment analysis.

ggplot(news_summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Articles About Time's Person of the Year")

Conclusion

Overall, I found some interesting information about Time’s Person of the Year. I was surprised by how similar the sentiments around tweets and news articles were, as well as how cyclical interest in the award was (as illustrated by the Google Trends data). I’ve really enjoyed digging aronud into the data that is available to everyone!