Final Project

I chose to do my final project on Time’s Person of the Year award. As it was an upcoming announcement when I began and then was announced during the course of my work, it was an interesting and ever changing project. I chose to use data from a few different sources, including Google News and Trends, as well as Twitter.

Twitter Data and Tweets Using #TimePOY

To begin working with the Twitter data, I pulled in the tweets and initially formatted them as a data frame and then ran the regular expression formatting for the tweet’s text. I removed the hash tags from the text.

num_tweets <- 1000
tweets <- searchTwitter('#TIMEPOY', n = num_tweets)
POY_df <- twListToDF(tweets)
head(POY_df)

reg <- "([^A-Za-z\\d@']|'(?![A-Za-z\\d#@]))"
POY_words <- POY_df %>% 
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&amp;", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

As I did in my Assignment 4 report, I utilized a word cloud to get an initial visualization of the words in my tweets. It is a quick way to get a sense of the most commonly used words. As I didn’t want to include obvious words such as Time, person, or rt, I excluded these (and a few others) from my word cloud.

##Create WordCloud for the Text in #TIMEPOY Tweets
library(wordcloud)
POY_words %>% filter(word != "rt", word != "person", word != "TIMEPOY", word != "@time", word != "timepoy", word != "https", word != "rt") %>% count(word) %>% arrange(desc(n)) %>% with (wordcloud(word, n, max.words = 50, scale=c(5,.5),min.freq=5, random.order=FALSE, rot.per=.15, colors=brewer.pal(8,"Set1")))

Then I hoped to see a list of the most common words represented in a table, rather than as a word cloud. I again excluded some of the common basic words that were not of interest.

kable(POY_words %>% group_by(word)%>%  filter(word != "rt", word != "person", word != "TIMEPOY", word != "@time", word != "timepoy", word != "https", word != "time")  %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(15))

word	n	frequency
trump	407	0.0869101
donald	389	0.0830664
time’s	246	0.0525304
@realdonaldtrump	119	0.0254111
cover	106	0.0226351
president	102	0.0217809
meet	94	0.0200726
chosen	84	0.0179372
read	68	0.0145206
history	67	0.0143071
america	66	0.0140935
helped	66	0.0140935
interview	66	0.0140935
voters	65	0.0138800
house	64	0.0136665
white	64	0.0136665

Sentiment Analysis of Tweets

I used the sentiment analysis we learned in previous units to illustrate the most common sentiments from the words in tweets containing #TimePOY. Generally the feelings were positive, but there are limitations to just using the words for sentiment analysis.

##Join Words from #TIMEPOY Tweets to Sentiments
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)
head(nrc)

## # A tibble: 6 × 2
##        word sentiment
##       <chr>     <chr>
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear

POY_sentiments <- POY_words %>% inner_join(nrc, by = "word")

POY_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n))

## # A tibble: 10 × 3
##       sentiment     n   frequency
##           <chr> <int>       <dbl>
## 1         trust   523 0.250839329
## 2      positive   486 0.233093525
## 3      surprise   437 0.209592326
## 4  anticipation   321 0.153956835
## 5           joy   131 0.062829736
## 6      negative    78 0.037410072
## 7       sadness    37 0.017745803
## 8         anger    31 0.014868106
## 9          fear    26 0.012470024
## 10      disgust    15 0.007194245

I wanted to represent the data visually, making it easy to consume.

##Summarize Data for Bar Chart
summary <- POY_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)

##Create Table with Data and Plot
ggplot(summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Tweets Using #TimePOY")

Linking the Tweets to Locations Around the World

Another aspect of the tweets I wanted to utilize was location (more on my interest below). I began by pulling location data for those with public profiles who are among the top 100 users in terms of number of tweets.

#Group Tweets by Users
library(twitteR)
POY_users <- POY_df %>% group_by(screenName) %>%  summarize(n = n()) %>% mutate(percentage = n / sum(n)) %>% arrange(desc(n)) %>% top_n(100)

    top_users <- POY_users %>%
         filter(n>1)
head(top_users)

## # A tibble: 6 × 3
##        screenName     n percentage
##             <chr> <int>      <dbl>
## 1 Real_Infinity95     8      0.008
## 2   AziziOthmanMY     7      0.007
## 3       Mrollando     7      0.007
## 4     powerusergr     7      0.007
## 5  reesebenyaacov     7      0.007
## 6            TIME     7      0.007

getLocation <- function(x) {
    y <- getUser(x)
    location <- y$location
    return(location)
}
top_users$screenName

##  [1] "Real_Infinity95" "AziziOthmanMY"   "Mrollando"      
##  [4] "powerusergr"     "reesebenyaacov"  "TIME"           
##  [7] "Willam_grey"     "AlfidioValera"   "kapilt41"       
## [10] "realjoet"        "sembronio"       "AccessAiNews"   
## [13] "aiuramasakaz"    "AleixAlmirall"   "Veep_MikePence" 
## [16] "AakashGauttam"   "AnsisEgle"       "baikap"         
## [19] "BhaktHercules"   "BubashLance"     "centroempleotoc"
## [22] "ColorMeRed"      "desh_bhkt"       "digitaljotter"  
## [25] "elviador"        "Firozkh15116079" "Flaumer"        
## [28] "gedwa75"         "HazemFKandil"    "iodyssee"       
## [31] "jogbosky"        "JorgeRi37481209" "KACHARAGADLA"   
## [34] "KofoAregbesola"  "mgznrdr"         "mnsdall"        
## [37] "PartyAtHarambes" "rhiles2760"      "RosaMariaV777"  
## [40] "sandeepdixit10"  "TheOfficialNews" "tphallett"      
## [43] "ypstomer"

user_location <- sapply(top_users$screenName, function(x) getLocation(x))

I used the locations of the users from the tweets that I gathered above and linked the locations to latitude and longitude, using geocode. Initially I was interested to see if areas where a potential winner was from were more common that areas without anyone on the “short-list.” However, as I continued running the script after the winner was announced, I hoped to see if there was any trend in locations and if it was mostly users in the United States (as the winner was Donald Trump).

library(leaflet)
library(ggmap)

##Pull the Latitude and Longitude for the Locations
Countries <- geocode(user_location, output="latlon", source = "google")
 ##Remove NAs and Show Results
Countries <- na.omit(Countries)
kable(Countries)

	.id	lon	lat
1	Real_Infinity95	-99.9018131	31.968599
2	AziziOthmanMY	101.9757660	4.210484
3	Mrollando	-77.0368707	38.907192
4	powerusergr	21.8243120	39.074208
5	reesebenyaacov	34.8516120	31.046051
9	kapilt41	72.8776559	19.075984
11	sembronio	9.1859243	45.465422
12	AccessAiNews	-0.1277583	51.507351
13	aiuramasakaz	139.2199432	35.374736
14	AleixAlmirall	2.1734035	41.385064
17	AnsisEgle	24.1051864	56.949649
18	baikap	106.9057439	47.886399
19	BhaktHercules	4.6644779	50.867894
21	centroempleotoc	-74.0300122	5.026003
25	elviador	-74.0059413	40.712784
28	gedwa75	37.5649507	54.163768
30	iodyssee	30.1350140	-1.963042
31	jogbosky	7.3985740	9.076479
33	KACHARAGADLA	78.4866710	17.385044
35	mgznrdr	138.2529240	36.204824
37	PartyAtHarambes	-81.5365094	41.393110
38	rhiles2760	-77.3971839	34.552666
39	RosaMariaV777	-66.5901490	18.220833
40	sandeepdixit10	77.8498292	28.406963
41	TheOfficialNews	-86.7816016	36.162664
43	ypstomer	78.9628800	20.593684

map <- leaflet() %>% setView(lng ="0", lat ="0", zoom = 1)
map %>% addProviderTiles("CartoDB.Positron") %>% addMarkers(map, lng = Countries$lon, lat = Countries$lat)

Google Trend Data on Time’s Person of the Year Selection

I then utilized data from Google Trends to compare search information to the data derived from tweets using #TimePOY. I used data for the search term “Times Person of the Year.” I connected to this data using the gtrendsR library.

As the announcement approached and then was made, I wondered if there was an increase in searches. To see if this occurred, I plotted the data over the month leading up to this week’s announcement. As shown below, there was a huge increase as the announcement approached, spiking when it was made.

plot(trend)

I also thought it would be interesting to see if this spike of interest around the announcement was an annual occurrence, in line with the announcement each year. Due to this years controversial selection, I also wanted to see if there was an increase in 2016 when compared to previous years. By using data from 2004 onward (the largest amount available via gtrendsR), it is clear 2016 was a big year for the Time’s Person of the Year award.

long_trend <- gtrends(c("Times Person of the Year"))

plot(long_trend)

Linking to Recent News Articles about Time’s POY

Another interesting view of Time’s Person of the Year is how it is covered. As a prominent award, with a controversial history (i.e. naming Adolf Hilter Person of the Year), I thought it would be interesting to see how it was covered in the news.

library(rvest)
library(stringr)
library(reshape2)

POY_url <- "https://news.google.com/news/story?ncl=dhb6eCtlVdAy2bMnmdMdK30rtTRsM&q=time+person+of+the+year&lr=English&hl=en&sa=X&ved=0ahUKEwjH5cCSluXQAhUGlpAKHSPYDtEQqgIIJjAA"
POY_history <- read_html(POY_url)

article_title <- POY_history %>% html_nodes(".titletext") %>% html_text()
source <- POY_history %>% html_nodes(".source-pref") %>% html_text()

head(article_title)

## [1] "Trump Criticizes Time's 'Person Of The Year' As 'Politically Correct'"    
## [2] "Person of the Year"                                                       
## [3] "TIME's Person of the Year: Everything You Need to Know"                   
## [4] "Donald Trump says Time Person of the Year title should be man of the year"
## [5] "Trump has one problem with his Time 'Person of the Year' cover"           
## [6] "It's Been 10 Years Since You Were Named TIME's Person of the Year"

head(source)

## [1] "Huffington Post"  "TIME"             "TIME"            
## [4] "The Independent"  "Business Insider" "TIME"

Sentiment Analysis of News Coverage

I felt a sentiment analysis would be a good way to scan the headlines regarding the announcement, and would also be interesting to compare to the tweets about the announcement’s sentiment.

library(tidytext)
reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
df <- data.frame(article_title)
word_df <- df %>% unnest_tokens(word, article_title, token = "regex", pattern = reg)

nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  select(word, sentiment)
head(nrc)

## # A tibble: 6 × 2
##        word sentiment
##       <chr>     <chr>
## 1    abacus     trust
## 2   abandon      fear
## 3   abandon  negative
## 4   abandon   sadness
## 5 abandoned     anger
## 6 abandoned      fear

article_sentiments <- word_df %>% inner_join(nrc, by = "word")

news_summary <- article_sentiments %>% group_by(sentiment) %>% summarize(n = n()) %>% mutate(frequency = n/ sum(n) ) %>% arrange(desc(n)) %>% top_n(8)

I then plotted this data in the same way I did with the tweet’s using #TimePOY sentiment analysis.

ggplot(news_summary, aes(x = sentiment, y= frequency, fill = n)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Sentiment", y= "Frequency of Senitment", title = "Overall Sentiments of Articles About Time's Person of the Year")

Conclusion

Overall, I found some interesting information about Time’s Person of the Year. I was surprised by how similar the sentiments around tweets and news articles were, as well as how cyclical interest in the award was (as illustrated by the Google Trends data). I’ve really enjoyed digging aronud into the data that is available to everyone!