Final Project: Midterm Election Results - Sentiment Analysis Using Twitter’s API

B. Sosnovski, E. Azrilyan and R. Mercier

11/23/2018

Project Description:

Use Twitter API for sentiment analysis on the 2018 midterm election results for the 11th Congressional District of NY, which includes all of Staten Island and parts of Southern Brooklyn.

Members: B. Sosnovski, E. Azrilyan, and R. Mercier.

Motivation: Sentiment analysis plays an essential role during elections. Political strategists can use the public’s opinions, emotions, and attitudes to convert those into votes in 2020.

Can we gauge the public’s sentiment related to the results of the midterm election?

Data: To conduct our analysis, we will harvest data using one of Twitter’s APIs. Data will be restricted to a specific date range for NY’s 11th Congressional District election race.

Work Flow:

  1. Acquire data.
  2. Fetch, clean, transform and tokenize the data.
  3. Perform feature selection to keep only the meaningful tweets for the analysis.
  4. Analysis: classify results as positive or negative.

Our collection of texts from tweets can be divided into natural groups so we can understand them separately. Topic modeling is a method for the unsupervised classification of such documents, finding natural groups of items.

We will perform a probabilistic topic model using Latent Dirichlet Allocation (LDA). LDA can be applied for reading general tendencies from Twitter posts or comments into specific topics that can be classified toward positive and negative sentiments.

We also use the statistic Term Frequency-Inverse Document Frequency (TF-IDF) in our analysis, which attempts to find the words that are important (i.e., common) in a text, but not too common.

Tools:

  • Tweeter Premium Search API - 30-day endpoint (Sandbox), which provides tweets from the previous 30 days.
  • R Packages

Caution:

This document contains some explicit language. Due to time constraints, we could not work on removing the explicit language in the tweets. We apologize in advance if this offends someone.

Twitter Premium API

First, we needed to obtain Twitter Premium API access. The following steps were taken to set up a Twitter account and be able to use the Twitter API.

  1. Created a Twitter account.
  2. Logged in with the Twitter credentials on https://dev.twitter.com/apps and applied for a developer account.
  3. After receiving approval from Twitter, applied to create a new app, filled out the form, and agreed to the terms.
  4. Created the Keys and Access Tokens.

Twitter Dashboard for Developers

Load Libraries

library(httr)
library(base64enc)
library(jsonlite)
library(stringr)
suppressMessages(library(tidyverse))
library(tidytext)
library(knitr)
library(XML)
suppressMessages(library(RCurl))
library(methods)
suppressMessages(library(tm))
suppressMessages(library(wordcloud))
library(topicmodels)

API Credentials

Read the key, key secret, access token, and access token secret from a text file to maintain the information confidential.

api <- read.table("Twitter_API_Key.txt", header = TRUE, stringsAsFactors = FALSE)
names(api)
dim(api)
App_Name <- api$app_name
Consumer_Key <- api$key
Consumer_Secret <- api$secret_Key
Access_Token <- api$access_token
Access_Secret <- api$access_token_secret

API Authentication

We faced a challenge in this part of the project because much of the documentation available on accessing the Twitter APIs using R is about accessing Twitter’s Basic Search API, not the Premium Search API. The Premium API was launched in Nov 2017 and is relatively new to the community. The basic Twitter API only gives access to the previous 7 days of tweets. To conduct our analysis, we needed access to tweets posted earlier than the last 7 days.

The following chunk of code was retrieved from https://twittercommunity.com/t/how-to-use-premium-api-for-the-first-time-beginner/105346/10. This was the only mention we could find about accessing the Premium API.

# base64 encoding
kands <- paste(Consumer_Key, Consumer_Secret, sep=":")
base64kands <- base64encode(charToRaw(kands))
base64kandsb <- paste("Basic", base64kands, sep=" ")

# request bearer token
resToken <- POST(url = "https://api.twitter.com/oauth2/token",
                 add_headers("Authorization" = base64kandsb, "Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"),
                 body = "grant_type=client_credentials")

# get bearer token
bearer <- content(resToken)
bearerToken <- bearer[["access_token"]]
bearerTokenb <- paste("Bearer", bearerToken, sep=" ")

Data Acquisition

Since the Twitter Premium API - Sandbox (free version) limits access to the tweets posted for the last 30 days, it is vital to save the search results into CSV files. This way, we can access the results afterward, even when the data is no longer available via the API.

Converting the data received from the API to a data frame may include columns with lists as observations. In this case, the R function “write.csv” or “write_csv” return an error. The function below identifies which columns of the data frame contain lists. The information removed in the process is not essential for our project.

# Function to identify which columns are lists
list_col <- function(df){
        n <- length(df)
        vec <- vector('numeric')
        for (i in 1:n){
                cl <- df[,i]
                if(class(cl)=="list"){
                        vec <- c(vec,i)
                }
        }
        return(vec)
}

Requests for data will likely generate more data than can be returned in a single response (our limit is up to 100 tweets in a single response). When a response is paginated, the API response will provide a “Next” token specified in the response’s body that indicates whether any other pages are available. These “Next” page tokens can then be used to make further requests. The following function automates the API requests with or without pagination.

The code below requests tweets which include the following search terms from 2018/11/05 to 2018/11/19:

  • “#maxrose”

  • “#dandonovan”

  • @RepDanDonovan

  • @MaxRose4NY

# the query includes terms "#maxrose", "#dandonovan", "@RepDanDonovan", "@MaxRose4NY" 
# data range: from 2018/11/05 to 2018/11/19

sbody = "{\"query\": \"#maxrose OR #dandonovan OR @RepDanDonovan OR @MaxRose4NY\",\"maxResults\": 100, \"fromDate\":\"201811050000\", \"toDate\":\"201811190000"
ebody = "\"}"

request <- function(start_body,end_body){
         full_body <- str_c(start_body, end_body, sep = "")
         nxt <-""
         pageNo <- 1
         
         while(!is.null(nxt)){
                resTweets <- POST(url = "https://api.twitter.com/1.1/tweets/search/30day/dev.json",
                  add_headers("authorization" = bearerTokenb, "content-Type" = "application/json"),
                  body = full_body)
                
                #checking if the type of response is JSON
                # if (http_type(resTweets) != "application/json") {
                #         stop("API did not return json", call. = FALSE)}
                
                #checking if the request was successful
                # if (http_error(resTweets)) {
                #         stop(sprintf("Twitter API request failed! Status = %s. Check what went wrong.\n", 
                #                      status_code(resTweets)),
                #              call. = FALSE)}else{
                #                      message("Retrieving page ",pageNo)}
                
                # Parse the data
                tweets_df <- fromJSON(content(resTweets, "text"),flatten = TRUE) %>% data.frame()
        
                # Saving data only from the tweets' texts in separate files
                text_df <- tweets_df$results.text
                file1 <- str_c("text",pageNo,".csv")
                write.csv(text_df, file1, row.names=FALSE)
        
                # Remove the list-columns
                vec<-list_col(tweets_df)
                tweets_df <- tweets_df[,-vec]

                # Saving the whole data
                file2 <- str_c("tweet",pageNo,".csv")
                write.csv(tweets_df, file2, row.names=FALSE)
        
                # Read the "next" token received in the response
                nxt <- tweets_df$next.[[1]]
        
                if(!is.null(nxt)){
                        # insert the next token in the body of the request
                        full_body <- str_c(start_body, "\", \"next\":\"", nxt, end_body, sep = "")
                        pageNo <- pageNo+1}
                
                # To avoid exceeding the API's limit per minute
                Sys.sleep(3)
        } #end of while loop
         
}#end of function

request(start_body = sbody,end_body = ebody)

The screenshot below shows Twitter API in action; the following message appears for every page:

The total number of files obtained from the Twitter API is 177.

For our project, we will use file numbers 67 to 177. These correspond to tweets from Nov 5 to some tweets from Nov 10.

Tweets Preprocessing

The CSV files containing the Twitter data were uploaded to GitHub.

The following function creates a vector with all links to be accessed to retrieve data.

start_url <- "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet"
end_url <- ".csv"

# the selected files to be used in this project
vec <- seq(67,177)

# function
pages <- function(vec){
        n <- length(vec)
        urls <- vector('character')
        for (i in 1:n){
                temp <- str_c(start_url,vec[i],end_url, collapse = "")
                urls <- c(urls, temp)
        }
        return(urls)
}

urls <-pages(vec)
head(urls)
## [1] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet67.csv"
## [2] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet68.csv"
## [3] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet69.csv"
## [4] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet70.csv"
## [5] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet71.csv"
## [6] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet72.csv"

The URLs created will open connections to the files, read them into data frames, select the columns of interest and bind them.

n <-length(urls)
Stream <-data.frame()

for (i in 1:n){
        csvfile <- url(urls[i])
        df <- read.csv(csvfile,header = TRUE, fileEncoding = "ASCII", stringsAsFactors = FALSE)
        df <- df %>% select(results.created_at,results.text,results.user.name,results.user.location)
        Stream <- rbind(Stream,df)
}

str(Stream)
## 'data.frame':    11072 obs. of  4 variables:
##  $ results.created_at   : chr  "Sat Nov 10 01:25:50 +0000 2018" "Sat Nov 10 01:25:31 +0000 2018" "Sat Nov 10 01:24:59 +0000 2018" "Sat Nov 10 01:24:31 +0000 2018" ...
##  $ results.text         : chr  "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ ...
##  $ results.user.name    : chr  "Cate   Resist👏🏾Every👏🏿Damned👏🏼Day👏🏽" "Pennell Somsen" "Barbara Ward #FBR 🌊" "Randy #RESIST" ...
##  $ results.user.location: chr  NA "Mérida, Yucatán & Harlem, New York" "New Hampshire, USA" NA ...

For our analysis, we are interested in the tweets of people other than the candidates themselves, so we exclude them from the candidates’ usernames.

Stream <- Stream %>% filter(!results.user.name %in% c("Max Rose","Dan Donovan"))

We also change the data format to make analysis easier.

# Change the format of the dates
Stream$results.created_at <- as.POSIXct(Stream$results.created_at, format = "%a %b %d %H:%M:%S +0000 %Y")
str(Stream)
## 'data.frame':    11037 obs. of  4 variables:
##  $ results.created_at   : POSIXct, format: "2018-11-10 01:25:50" "2018-11-10 01:25:31" ...
##  $ results.text         : chr  "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ ...
##  $ results.user.name    : chr  "Cate   Resist👏🏾Every👏🏿Damned👏🏼Day👏🏽" "Pennell Somsen" "Barbara Ward #FBR 🌊" "Randy #RESIST" ...
##  $ results.user.location: chr  NA "Mérida, Yucatán & Harlem, New York" "New Hampshire, USA" NA ...

Tweets Cleaning

The following steps are to cleanse the texts in the tweets:

  • Remove ASCII symbols and Twitter’s user handles (@user)
  • Remove punctuations, numbers, digits, and special characters
  • Remove white spaces and stop words
  • Remove hashtags, tags, URLs, Twitter short words, etc.
  • Convert the corpus to lowercase
# Filtering off the retweets from the data (keep only original posts)
Stream <- Stream %>% 
  filter(!str_detect(results.text, "^RT"))

Mycorpus <- Corpus(VectorSource(Stream$results.text))

#Various cleansing functions:
#ASCII Symbols
remove_ASCIIs <- function(x) gsub("[^\x01-\x7F]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_ASCIIs)))

#@'s 
remove_ATs <- function(x) gsub("@\\w+", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_ATs)))

#All Punctuations
remove_Puncts <- function(x) gsub("[[:punct:]]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_Puncts)))

#All Digits
remove_Digits <- function(x) gsub("[[:digit:]]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_Digits)))

#3-Step HTTP Process
remove_HTTPSs <- function(x) gsub("http\\w+", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs)))
remove_HTTPSs2 <- function(x) gsub("[ \t]{2,}", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs2)))
remove_HTTPSs3 <- function(x) gsub("^\\s+|\\s+$", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs3)))

#Whitespaces
remove_WhiteSpace <- function(x) gsub("[ \t]{2,}", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_WhiteSpace)))

#Lower Case
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(tolower)))

#im's 
remove_IMs <- function(x) gsub("im", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_IMs)))

#stopwards
Mycorpus <- suppressWarnings(tm_map(Mycorpus, removeWords,stopwords("English")))

# View the corpus
inspect(Mycorpus[1:10])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
## 
##  [1]                                                                                 
##  [2]                                                                                 
##  [3] shootings like  one  thousand oaks   often followed   pattern  inaction  january
##  [4]                                                                                 
##  [5]  already know  reason   protest george soros wants  bring social                
##  [6] celebratings victory nyyour passion  grassroots change  refreshing welcome      
##  [7] great  see  soon   congressmanbrave  nasty weather   greeting  new constituents 
##  [8]   honored thatis  district congressmana true gentleman   laugh  awe like  fool  
##  [9] yep   costs money  park   lot spent  ton  money visiti                          
## [10]  happy  can callmy new congressman

We look at the word cloud with the terms in the corpus.

#setting the same seed each time ensures a consistent look across clouds
set.seed(7)
suppressWarnings(wordcloud(Mycorpus, random.order=F, scale=c(3, 0.5), min.freq = 5, col=rainbow(50)))

Because we need to tokenize the text for the analysis, we replace Twitter’s texts in the original data frame with clean data from the corpus.

# Original data frame
head(Stream$results.text, n=10)
##  [1] "@crhousel @MaxRose4NY 😀"                                                                                                                    
##  [2] "@Prometheus_2018 @MaxRose4NY :) 👋🏽"                                                                                                        
##  [3] ".@MaxRose4NY, shootings like the one in Thousand Oaks are too often followed by a pattern of inaction. In January,… https://t.co/2IWYBjWaGZ" 
##  [4] "@crhousel @MaxRose4NY ✊🏽"                                                                                                                  
##  [5] "@rldaug @BernardKerik @RepDanDonovan we already know the reason for the protest. George Soros wants to bring social… https://t.co/40WPd3PlDq"
##  [6] "🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!! Welcome to the… https://t.co/0eeEdAhnwt"  
##  [7] "Great to see my soon to be congressman @MaxRose4NY brave the nasty weather and out greeting his new constituents at… https://t.co/dLggg3T55B"
##  [8] "I am honored that @MaxRose4NY is my district congressman,  a true gentleman. I would laugh in awe like a fool if I… https://t.co/zYR0AWfuI8" 
##  [9] "@phildemeo @NYCMayor @NYGovCuomo @MaxRose4NY Yep AND it costs money to park in the lot. Spent a ton of money visiti… https://t.co/Gh0IZ1jf7I"
## [10] "@KatieVasquezTV @MaxRose4NY So happy I can call @MaxRose4NY my new congressman"
# Clean corpus
df <- data.frame(text = get("content", Mycorpus))
head(df, n=10)
##                                                                                text
## 1                                                                                  
## 2                                                                                  
## 3  shootings like  one  thousand oaks   often followed   pattern  inaction  january
## 4                                                                                  
## 5                   already know  reason   protest george soros wants  bring social
## 6      celebratings victory nyyour passion  grassroots change  refreshing welcome  
## 7  great  see  soon   congressmanbrave  nasty weather   greeting  new constituents 
## 8    honored thatis  district congressmana true gentleman   laugh  awe like  fool  
## 9                            yep   costs money  park   lot spent  ton  money visiti
## 10                                                happy  can callmy new congressman
Stream$results.text <- as.character(df$text)

# Remove the rows that contain empty strings in the text column after the cleanup
Stream <- Stream %>% filter(results.text !="")

# Add row numbers and move to the front of the data frame
Stream <- Stream %>% mutate(id = row_number()) %>% select(id, everything())

head(Stream, n=10)
##    id  results.created_at
## 1   1 2018-11-10 01:19:14
## 2   2 2018-11-10 01:10:24
## 3   3 2018-11-10 01:08:59
## 4   4 2018-11-10 01:06:50
## 5   5 2018-11-10 00:57:02
## 6   6 2018-11-09 23:57:43
## 7   7 2018-11-09 23:51:45
## 8   8 2018-11-09 23:51:36
## 9   9 2018-11-09 23:48:42
## 10 10 2018-11-09 23:45:15
##                                                                        results.text
## 1  shootings like  one  thousand oaks   often followed   pattern  inaction  january
## 2                   already know  reason   protest george soros wants  bring social
## 3      celebratings victory nyyour passion  grassroots change  refreshing welcome  
## 4  great  see  soon   congressmanbrave  nasty weather   greeting  new constituents 
## 5    honored thatis  district congressmana true gentleman   laugh  awe like  fool  
## 6                            yep   costs money  park   lot spent  ton  money visiti
## 7                                                 happy  can callmy new congressman
## 8                wasnt   diverse electorate  s brooklynstaten island  pushedover  e
## 9                                                                         excellent
## 10                                                                thats lovely  see
##        results.user.name          results.user.location
## 1         Kristen Caruso                           <NA>
## 2  ❌ 🗳🇺🇸 2020🗽🗳 ⭐⭐⭐                           <NA>
## 3    ⚡️StarfireResists⚡️               Geeks Resist, HQ
## 4       Timothy O'Reilly                   Brooklyn, NY
## 5             Dina Cameo                  New York City
## 6                   Alex                           <NA>
## 7                     TG The Divided States of America 
## 8          D. Changstein                           <NA>
## 9     Covfefe Deplorable                           <NA>
## 10                Eileen                           <NA>
# Tokenize the new clean text from the data frame
Streamnew <- Stream %>%  
  unnest_tokens(word, results.text)

Looking back at the word cloud, it seems that the stop words from the tm package didn’t filter off all undesired words from the tweets. So we continue the cleaning further using the stop words from the tidytext package.

# Remove stop words and other words
data(stop_words)
Streamnew <- Streamnew %>% anti_join(stop_words)
## Joining, by = "word"

Sentiment Analysis

Here we get the counts of the most frequent words found in our data.

Streamnew %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "Count",
       y = "Unique words",
       title = "Count of unique words found in tweets")
## Selecting by n

We will use the “bing” sentiment data, which classifies words as positive or negative. We are joining the list of words extracted from the tweets with this sentiment data.

# join sentiment classification to the tweet words
bing_word_counts <- Streamnew %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"

The code below creates a plot of positive and negative words.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(title = "Midterm Election Sentiment.",
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

According to the sentiment lexicon “bing,” most words tend to be positive for the election results.

Let’s look at how our results were impacted by time passing; we will look at our data at specific times.

The code below adds a column to classify tweets as “Early, Med, or Late”: - Early: November 5th and 6th - Med: November 7th - Late: After Nov 8th

Streamnew$Timing <- ifelse(Streamnew$results.created_at <= '2018-11-07', 'Early',
                  ifelse(Streamnew$results.created_at >= '2018-11-07' & Streamnew$results.created_at <= '2018-11-08', 'Med',
                         ifelse(Streamnew$results.created_at >= '2018-11-08', 'Late', 'other')
                  ))

The code below joins our data frame with sentiment data and plots positive and negative words in the “Early, Med, and Late” timing categories.

# join sentiment classification to the tweet words
Elec_sentiment_2018 <- Streamnew %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, Timing, sort = TRUE) %>%
  group_by(sentiment) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  group_by(Timing, sentiment) %>%
  top_n(n = 5, wt = n) %>%
  arrange(Timing, sentiment, n)
## Joining, by = "word"
Elec_sentiment_2018 %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Timing, scales = "free_y", ncol = 2) +
  labs(title = "Sentiment during the 2018 Midterm Election - NY 11th Cong. District.",
       y = "Number of Times Word Appeared in Tweets",
       x = NULL) +
  coord_flip()

Topic Modeling with LDA

dtm <- DocumentTermMatrix(Mycorpus)
dtm
## <<DocumentTermMatrix (documents: 4111, terms: 6281)>>
## Non-/sparse entries: 22854/25798337
## Sparsity           : 100%
## Maximal term length: 39
## Weighting          : term frequency (tf)

The sparse matrix contains rows without entries (words), and this causes errors in the LDA function.

To deal with this issue, we compute the sum of words by row and subset the dtm matrix by rows with sum >0.

rowTotals <- apply(dtm , 1, sum)
dtm.new   <- dtm[rowTotals> 0, ]    
dtm.new
## <<DocumentTermMatrix (documents: 3831, terms: 6281)>>
## Non-/sparse entries: 22854/24039657
## Sparsity           : 100%
## Maximal term length: 39
## Weighting          : term frequency (tf)
# Set a seed so that the output of the model is predictable
# Model finds 4 topics
lda <- LDA(dtm.new, k = 4, control = list(seed = 1234))
lda
## A LDA_VEM topic model with 4 topics.
term <- terms(lda, 10) # first 10 terms of every topic
term
##       Topic 1           Topic 2           Topic 3           Topic 4   
##  [1,] "vote"            "congratulations" "staten"          "island"  
##  [2,] "island"          "just"            "congrats"        "staten"  
##  [3,] "congratulations" "island"          "max"             "vote"    
##  [4,] "happy"           "will"            "island"          "maxrose" 
##  [5,] "new"             "congress"        "just"            "brooklyn"
##  [6,] "voted"           "amp"             "proud"           "max"     
##  [7,] "proud"           "won"             "like"            "new"     
##  [8,] "good"            "voted"           "congratulations" "rose"    
##  [9,] "win"             "now"             "blue"            "voting"  
## [10,] "district"        "great"           "brooklyn"        "proud"
topics <- tidy(lda, matrix = "beta")
topics
## # A tibble: 25,124 × 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 followed 0.000152   
##  2     2 followed 0.0000559  
##  3     3 followed 0.000216   
##  4     4 followed 0.0000935  
##  5     1 inaction 0.0000886  
##  6     2 inaction 0.0000285  
##  7     3 inaction 0.000131   
##  8     4 inaction 0.0000968  
##  9     1 january  0.000299   
## 10     2 january  0.000000454
## # … with 25,114 more rows

The table above shows that the LDA model computes the probability of that term being generated from that topic. For example, the word “followed” has probabilities 0.000152, 0.0000559, 0.000216, and 0.0000935 of being generated from topics 1, 2, 3, and 4, respectively.

The following is a visualization of the results for the top 10 terms.

top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

The four topics appear to be similar. That is not surprising given that there is a limited number of topics one would expect to see discussed in comments relating to the election results.

Some of the top frequent positive words from the sentiment analysis above, such as “congratulations” and “proud,” appear in most topics. But none of the top negative words appear in the topics’ lists.

Modeling with TF-IDF

The Term Frequency-Inverse Document Frequency (TF-IDF) measures how important a word is to a document in a corpus of documents, as in our case, to one tweet in a collection of tweets.

TF-IDF, in general, finds:

  • Words that are very common in a specific document (tweet) are probably important to the topic of that document.

  • Words that are very common in all documents probably aren’t important to the topics of any of them.

So a term will receive a high weight if it’s common in a specific document and uncommon across all documents.

# get the count of each word in each tweet
mywords <- Stream %>%  
        unnest_tokens(word, results.text) %>% 
        anti_join(stop_words) %>% 
        count(id,word, sort = TRUE) %>% 
        ungroup()
## Joining, by = "word"
# get the number of words per tweet
total_words <- mywords %>% 
      group_by(id) %>% 
      summarize(total = sum(n))
        
# combine the two data frames we just made
mywords <- left_join(mywords, total_words)
## Joining, by = "id"
head(mywords, 20)
##      id     word  n total
## 1  1602       ha 24    27
## 2  3615     vote  9    15
## 3  3584      amp  7     7
## 4   980     dont  4     9
## 5  2384      max  4    17
## 6  2520      omg  4     4
## 7  3716       ny  4     6
## 8    42     vote  3     7
## 9   749   carpet  3    10
## 10 1520     york  3     6
## 11 1934      omg  3     3
## 12 2123      max  3     6
## 13 2123     rose  3     6
## 14 2421 district  3     6
## 15 2559      omg  3     4
## 16 2725      max  3     4
## 17 2769      omg  3     3
## 18 3538       ha  3     3
## 19 3607     vote  3     8
## 20    6    money  2     9
# get the tf_idf & order the words by the degree of relevance
tf_idf1 <- mywords %>%
      bind_tf_idf(word, id, n) %>%
      select(-total) %>%
      arrange(desc(tf_idf)) %>%
      mutate(word = factor(word, levels = rev(unique(word))))

tf_idf2 <- mywords %>%
      bind_tf_idf(word, id, n) %>%
      select(-total) %>%
      arrange(tf_idf) %>%
      mutate(word = factor(word, levels = rev(unique(word))))

head(tf_idf1,20)
##     id                    word n tf      idf   tf_idf
## 1   57                graffiti 1  1 8.235361 8.235361
## 2  106                  referr 1  1 8.235361 8.235361
## 3  147             dayayewhere 1  1 8.235361 8.235361
## 4  209  awesomecongratulations 1  1 8.235361 8.235361
## 5  210               yuprocked 1  1 8.235361 8.235361
## 6  246                  didwow 1  1 8.235361 8.235361
## 7  275                  ohhhhh 1  1 8.235361 8.235361
## 8  280            fantasticall 1  1 8.235361 8.235361
## 9  282                 goodjob 1  1 8.235361 8.235361
## 10 289 congratulationslastword 1  1 8.235361 8.235361
## 11 312               absoutely 1  1 8.235361 8.235361
## 12 314                    fake 1  1 8.235361 8.235361
## 13 387                    refe 1  1 8.235361 8.235361
## 14 462                    pkij 1  1 8.235361 8.235361
## 15 481                  intend 1  1 8.235361 8.235361
## 16 631                     pos 1  1 8.235361 8.235361
## 17 635                     uhh 1  1 8.235361 8.235361
## 18 652                recogniz 1  1 8.235361 8.235361
## 19 672                centtttt 1  1 8.235361 8.235361
## 20 814                     twe 1  1 8.235361 8.235361
head(tf_idf2,20)
##      id   word n         tf      idf    tf_idf
## 1  2384 staten 1 0.05882353 2.226547 0.1309734
## 2   502 staten 1 0.07142857 2.226547 0.1590391
## 3   620 staten 1 0.07142857 2.226547 0.1590391
## 4  1418 staten 1 0.07142857 2.226547 0.1590391
## 5   502 island 1 0.07142857 2.321858 0.1658470
## 6   620 island 1 0.07142857 2.321858 0.1658470
## 7  3615     ny 1 0.06666667 2.558607 0.1705738
## 8  1420 staten 1 0.07692308 2.226547 0.1712729
## 9  1954 staten 1 0.07692308 2.226547 0.1712729
## 10 3193 staten 1 0.07692308 2.226547 0.1712729
## 11 3536 staten 1 0.07692308 2.226547 0.1712729
## 12 1954 island 1 0.07692308 2.321858 0.1786044
## 13 3193 island 1 0.07692308 2.321858 0.1786044
## 14 3536 island 1 0.07692308 2.321858 0.1786044
## 15 1390     ny 1 0.07142857 2.558607 0.1827576
## 16  308 staten 1 0.08333333 2.226547 0.1855456
## 17  553 staten 1 0.08333333 2.226547 0.1855456
## 18 1113 staten 1 0.08333333 2.226547 0.1855456
## 19 1199 staten 1 0.08333333 2.226547 0.1855456
## 20 1907 staten 1 0.08333333 2.226547 0.1855456

If the TF-IDF is closer to zero, it indicates that the word is ubiquitous. Thus, the terms “staten,” “island,” and “ny” are not so important for the whole corpus. On the other hand, the words with high TF-IDF are important. As one can see, they have high TF-IDF due to typos and the concatenation of words together. Nevertheless, note that most of them can be considered in favorable terms.

Conclusions

The project aims to determine if people’s feelings and opinions about the election result for the 11th Congressional District of NY are generally positive or negative.

According to the sentiment lexicon “bing,” most of the words overall tend to be positive for the election results.

The LDA analysis also seems to confirm this since the topics presented by the model present frequent positive words such as “congratulations” and “proud.” And overall negative words don’t appear among the topics modeled from the tweets.

Finally, the TF-IDF analysis shows that the important words for the corpus are generally positive, even though some are formed by concatenations.

Overall the sentiment present in the tweets is described in favorable terms.

Reference