library(topicmodels)
library(tm)
library(wordcloud)
library(tidytext)
library(ggplot2)
library(dplyr)
library(wesanderson)
library(gridExtra )

Abstract

Donald Trump has amassed nearly 80 million followers on Twitter, and is commonly known for his occasional erratic behavior on this social media platform. We attempted to parse out some meaningful insight from the 45th president’s tweets by building a two topic Latent Dirichlet Allocation model over four time periods. The model was able to distinguish polarized topics and associated non-neutral terms with the respective topics. However, our Bayesian analysis is relatively shallow, and therefore further data preprocessing could result in more insightful extractions from the tweets.

Introduction

Latent Dirichlet Allocation (LDA) is a method topic modeling and natural language processing that can be applied to Bayesian statistics. The concept was first introduced in machine learning back in 2003 (Blei et. at. 2003) and continues to be used widely today. We plan on discussing the theory behind this concept as well as implementing an example in R using various packages. Our practical example uses a dataset from Kaggle and contains Twitter data from Donald Trump’s tweets. The goal is to extract different topics as well as the words associated with those topics in the corpus of tweets. Our main deliverable will be a chart of topics and words in those topics from various tweets, similar to the visualization from the original paper.

Methods

LDA has various Bayesian applications, many of which have been discussed in class. The Dirichlet distribution is a conjugate prior for the Multinomial distribution, which is used to model the various topics and words. (The conjugacy relationship between the beta and binomial distributions is analogous to the relationship to the dirichlet and multinomial, where dirichlet is a multivariate beta distribution). This hierarchical model is illustrated in the Plate notation below. Alpha and beta are hyperpriors for per-document topic and per-topic word distribution, respectively1.

Plate Notation

Word

\[w_{ij} \sim Multinomial(\phi_{z_{ij}})\]

Topic for each word

\[z_{ij} \sim Multinomial(\theta_i)\]

Per-Topic Word Distribution

\[\phi_k \sim Dirichlet(\beta)\]

Per-Document Topic Distribution

\[\theta_i \sim Dirichlet(\alpha)\]

To obtain the simulated results from MCMC, the LDA function we are using from the topicmodels package has an option to use different types of MCMC as the algorithmic method 2.

Data

For the dataset that we are using, it is a set of all of Donald Trump’s tweets from the beginning of his account to January 20th, 2020, which consists of about 40,000 tweets. Each tweet is a separate observation, marked by a timestamp and the number of re-tweets, likes, and mentions. We will be treating each tweet as a “document” and each word in the tweet as a “word.” We hope to use LDA to reveal common topics contained in Donald Trump’s online lexicon. However, before performing LDA on the tweets, we had to separate the tweet column in the dataset and create a corpus object using the VCorpus function.

Our analysis is split into 4 different time periods, since we want to understand how Trump’s vocabulary changes over time. We split our corpus into the following time frames.

  • Pre-Election Cycle (2009-2014)

  • Election Cycle (2015-2016)

  • Incumbent (2017-2019)

  • Coronavirus (2020)

Load Data

tweets = read.csv("/Users/gabrieltaylor/Downloads/realdonaldtrump.csv")
head(tweets$content)
## [1] "Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!"              
## [2] "Donald Trump will be appearing on The View tomorrow morning to discuss Celebrity Apprentice and his new book Think Like A Champion!"
## [3] "Donald Trump reads Top Ten Financial Tips on Late Show with David Letterman: http://tinyurl.com/ooafwn - Very funny!"               
## [4] "New Blog Post: Celebrity Apprentice Finale and Lessons Learned Along the Way: http://tinyurl.com/qlux5e"                            
## [5] "\"My persona will never be that of a wallflower - I’d rather build walls than cling to them\" --Donald J. Trump"                    
## [6] "Miss USA Tara Conner will not be fired - \"I've always been a believer in second chances.\" says Donald Trump"

Each row consists of a single tweet.

Keep Meaningful Tweets Only

tweets = tweets[which(nchar(tweets$content) > 30), ]

We decided to only keep tweets consisting of 30 characters or more in effort to limit the amount of ‘noisy’ tweets. In other words, we don’t want to extract a lot of words from short tweets that may not contain a useful message.

Quick EDA

A notable observation is that from 2013-2015 he tweeted the most characters throughout each year compared to all other years, but had the lowest chacters per tweet for those years. Since those years preceded his election run, a possible explanation for this unique observation is that he had a significant amount of engagement with his followers. For example, most replies to followers consist of a few characters, such as “Thank you Sean Hannity!” or “I don’t play that much golf!”. Therefore, by constantly replying to followers with low character count tweets, he drives his average character count per tweet down, but still adds characters to his total character count for the year.

Split Data Into Relevant Time Periods

pre_incumbent = tweets %>%
  group_by(date) %>%
  filter(date < 2015)

election_cycle = tweets %>%
  group_by(date) %>%
  filter(date > 2014) %>%
  filter(date < 2017)

incumbent = tweets %>%
  group_by(date) %>%
  filter(date > 2016) %>%
  filter(date < 2020)

coronavirus = tweets %>%
  group_by(date) %>%
  filter(date > 2019) 

Stop Words

stop = c("veri", "realdonaldtrump", "whi", "can", "get", "mani", "onli", "will", "trump", "just", "even", "donald", "peopl", "militari", "crazi", "complet", "approv", "parti", "realli", "presid", "countri", "radic", "noth", "berni", "feder", "senat", "hous", "becaus", "ani", "alway", "confer", "twitter", "status", "com", "www", "http", "https", "pic", "bit", "thank", "great", "(cont)", "make", "many", "want", "going", "keep", "now", "new", "many", "via", "mini", "people", "thanks", "one", "like", "see", "run", "nothing", "must", "years", "president", "don", "never", "think", "show", "need", "united", "time", "much", "back", "today", "need", "day", "way", "done", "donaldtrump", "donaldjtrump", "course", "watch", "tonight", "democrat", "said", "american", "made", "know", "really", "mike", "ever", "year", "2015", "2016", "also", "working", "states", "country", "makeamericagreatagain", "maga", "work", "total", "far", "look", "join", "trump2016", "give", "night", "right", "last", "state", "north", "first", "needs", "job")

We compiled a list of stop words that we felt did not meet a criteria that would actually give helpful insight to the tweeting behavior of Trump. Most stop words included those with neutral connotations, such as “going” or “will”. In addition to neutral words, we removed words that were split as a result of cleaning, such as “complet” or “approv”.We also decided to remove “make”, “great”, and “again” because these words are among the most frequent words because of the slogan “Make America Great Again”, but we felt the words were hindering the model to be able to discern between two topics. Lastly, we removed replicates of the same word. For example, we only kept “democrat” between “democrats” and “democrat”.

Modeling

After creating the corpus, the following text processing functions were used in the tm package to pre-process the data

  • Change all alphabetic text to lowercase

  • Stem the document (change to just root word, i.e. walking to walk)

  • Remove numbers

  • Remove stop words

  • Remove punctuation

  • Remove white space

  • Remove tweets with low character count (under 15) to fix blank space errors in LDA

For this project, we just kept the corpus to single words and did not explore phrases and n-grams.

For each time period, we then ran the LDA() function with 2 topics and extracted the top 10 words that represent each topic. Those words were then displayed in a bar graph, sorted by beta value.

Create Corpuses

text_corpus_pi = VCorpus(VectorSource(pre_incumbent$content))
text_corpus_ec = VCorpus(VectorSource(election_cycle$content))
text_corpus_i = VCorpus(VectorSource(incumbent$content))
text_corpus_c = VCorpus(VectorSource(coronavirus$content))

Clean Corpuses

corpuses = list(text_corpus_pi, 
                text_corpus_ec, 
                text_corpus_i, 
                text_corpus_c)

cleaners = list(stemDocument, 
                removeNumbers,
                stripWhitespace)
cleaned_corpuses = list()

for(i in 1:length(corpuses)){
  cleaned_corpus = tm_map(corpuses[[i]], content_transformer(tolower))
  cleaned_corpus = tm_map(cleaned_corpus, removeWords, stop)
  cleaned_corpus = tm_map(cleaned_corpus, removeWords, stopwords())
  
  for(j in 1:length(cleaners)){
    cleaned_corpuses[[i]] = tm_map(cleaned_corpus, cleaners[[j]])
  }
}

Construct Document Term Matrices

text_dtms = list()

for(i in 1:length(cleaned_corpuses))
  text_dtms[[i]] = DocumentTermMatrix(cleaned_corpuses[[i]])

Build LDA Models

text_ldas = list()

for(i in 1:length(text_dtms)){
  text_ldas[[i]] = LDA(text_dtms[[i]], k = 2, method = "VEM", control = NULL)
}

Write Text Topics

text_topics = list()

for(i in 1:length(text_ldas)){
  text_topics[[i]] = tidy(text_ldas[[i]], matrix = "beta")
}

Results

Due to the unsupervised nature of LDA, there is not necessarily a performance metric we are using to evaluate the topic model. Therefore, the results have more of an open-ended interpretation3.

Word Clouds

The following graphs are wordclouds that display the most commonly used words that appear in Donald Trump’s tweets for each time period.

Two Topic Model Results

The following graphs illustrate the words that are most likely to appear in Donald Trump’s tweets for each topic within each respective time period.

Discussion

Our motivation for choosing a two topic model was to investigate polarizing terms, such as “republican” and “democrat”. For example, if “republican” and “democrat” showed up in opposite topics, then the algorithm was able to separate more extreme viewpoints. We were also interested in observing other terms that would associated with the respective polarized terms. For example we might observe “bad” as being associated with “democrat,” while “good” as being associated with “republican”.

Before Trump’s election cycle in the “Pre-Incumbency” time period, the model was able to discern tweets related to Trump’s opponents because “obama”, “barackobama”, “mittromney”, and “china” are all associated with the same topic. However, counterintuitively, we observe words with positive connotations, such as “love” and “good” in the same topic.

During the election cycle, Hillary Clinton is the most popular term, as she appears in both topics. Her popularity in Trump’s tweets matches historical events because she was the main opposition to Trump during the election season. In one topic, “hillary” appears with her given nickname “crooked”, and another topic she shows up with her formal name “clinton”. Again, we counterintuitively observe the “crooked hillary” is associated with “good” and “true”. We also observe that the model grouped “foxnews” and “cnn” with “debate” and “poll”, which is coherent because both news outlets cover the debates and polls regarding the election status.

Since Trump has taken office, he has had a very hostile and combative relationship with the press. He is famous for coining the term “Fake News” in effort to discredit any press that takes a negative perspective on his decisions or behavior. Due to the recent impeachment trials, Trump’s reputation is constantly being attacked in left leaning news outlets. Our model was able to magnify this relationship since the spawn of the coronavirus. We observe “fake”, “news”, “democrats”, and “impeachment” as being the four most likely terms to be included in a tweet for a specific topic. This observation is significant because we can infer that Trump is leveraging his “fake news” tactic to discredit any claim that democrats make regarding the impeachment. In addition to the “fake news” attack being directed at democrats and the impeachment, “coronavirus” and “hoax” are in the same topic. While Trump is not quoted explicitly stating that coronavirus is a hoax4, he has said that coronavirus is the democrats’ ‘new hoax’, which again falls under “fake news”. Another notable result is that the other topic contains “republican” with “good” and “foxnews”, which is exactly what we were hoping to find because the model was able to separate two polarized topics associated with non-neutral words.

In general, our analysis is a very basic version of LDA to demonstrate its practical application to Bayesian statistics. If more extensive text mining preprocessing techniques were used, we could see more polarizing words separated by topic.

References

tm R package

Rpubs Code Source


  1. LDA Wikipedia

  2. topicmodels R package

  3. Original Paper

  4. Did President Trump Refer to the Coronavirus as a ‘Hoax’?