Final Project: Midterm Election Results - Sentiment Analysis Using Twitter’s API

B. Sosnovski, E. Azrilyan and R. Mercier

11/23/2018

Project Description:

Using Twitter API for sentiment analysis on the 2018 midterm election results for the 11th Congressional District, which includes all of Staten Island and parts of Southern Brooklyn.

Members: B. Sosnovski, E. Azrilyan and R. Mercier.

Motivation: Sentiment analysis plays an important role during elections. Political strategists can use the public’s opinions, emotions, and attitudes to convert those into votes in 2020.

Can we gauge the public’s sentiment related to the results of the midterm election?

Data: To conduct our analysis, we will harvest data using one of the Twitter’s API. Data will be restricted to a certain date range for a NY’s 11th Congressional District election race.

Work Flow:

  1. Acquire data.
  2. Fetch, clean, transform and tokenize the data.
  3. Perform feature selection to keep only the tweets that are meaningful for the analysis.
  4. Analysis: classify results as positive or negative.

Our collection of texts from tweets can be divided into natural groups so that we can understand them separately. Topic modeling is a method for unsupervised classification of such documents, which finds natural groups of items.

We will perform a probabilistic topic model using Latent Dirichlet Allocation (LDA). LDA can be applied for reading general tendency from Twitter posts or comments into certain topics that can be classified toward positive and negative sentiment.

We also use the statistic Term Frequency-Inverse Document Frequency (TF-IDF) in our analysis, which attempts to find the words that are important (i.e., common) in a text, but not too common.

Caution:

This document contains some explicit language. We were unable to work on removing the explicti languague in the tweets due to time constrains. We apologize in advance if this offends someone.

Tools:

  • Tweeter Premium Search API - 30-day endpoint (Sandbox), which provides tweets from the previous 30 days.
  • R Packages

Twitter Premium API

First we needed to obtain the Twitter Premium API access. The following steps were taken to set up a twitter account and be able to use the Twitter API.

  1. Created a Twitter account.
  2. Logged in with the twitter credentials on https://dev.twitter.com/apps and applied for a developer account.
  3. After receiving approval from Twitter, submitted an application to create new app, filled out the form and agreed to the terms.
  4. Created the Keys and Access Tokens.
Twitter Dashboard for Developers

Twitter Dashboard for Developers

Load Libraries

library(httr)
library(base64enc)
library(jsonlite)
library(stringr)
suppressMessages(library(tidyverse))
## Warning: package 'dplyr' was built under R version 3.5.1
library(tidytext)
library(knitr)
library(XML)
suppressMessages(library(RCurl))
library(methods)
suppressMessages(library(tm))
suppressMessages(library(wordcloud))
library(topicmodels)

API Credentials

Read key, key secret, access token and and access token secret from a text file to mantain the information confidential.

api <- read.table("Twitter_API_Key.txt", header = TRUE, stringsAsFactors = FALSE)
names(api)
dim(api)
App_Name <- api$app_name
Consumer_Key <- api$key
Consumer_Secret <- api$secret_Key
Access_Token <- api$access_token
Access_Secret <- api$access_token_secret

API Authentication

We faced a challenge in this part of the project because much of the documentation available on how to access the Twitter APIs using R is about accessing the Twitter’s Basic Search API and not the Premium Search API. The Premium API was launched in Nov 2017 and is relatively new to the community. The basic Twitter API only gives access to the previous 7 days of tweets. To conduct our analysis, we needed access to tweets posted earlier than the prior 7 days.

The following chunk of code was retrieved from https://twittercommunity.com/t/how-to-use-premium-api-for-the-first-time-beginner/105346/10. This was the only mention that we could find about accessing the Premium API.

# base64 encoding
kands <- paste(Consumer_Key, Consumer_Secret, sep=":")
base64kands <- base64encode(charToRaw(kands))
base64kandsb <- paste("Basic", base64kands, sep=" ")
# request bearer token
resToken <- POST(url = "https://api.twitter.com/oauth2/token",
                 add_headers("Authorization" = base64kandsb, "Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"),
                 body = "grant_type=client_credentials")
# get bearer token
bearer <- content(resToken)
bearerToken <- bearer[["access_token"]]
bearerTokenb <- paste("Bearer", bearerToken, sep=" ")

Data Acquisition

Since the Twitter Premium API - Sandbox (free version) limits access to the tweets posted for the last 30 days only, it is important to save the results of the search into csv files. This way, we can have access to the results afterwards, even when the data was no longer available via the API.

When converting the data received from the API to a data frame, it may include columns with lists as observations. In this case, the R function “write.csv” or “write_csv” return an error. The function below identifies which columns of the data frame contain lists. The information removed in the process is not important for our project.

# Function to identify which columns are lists
list_col <- function(df){
        n <- length(df)
        vec <- vector('numeric')
        for (i in 1:n){
                cl <- df[,i]
                if(class(cl)=="list"){
                        vec <- c(vec,i)
                }
        }
        return(vec)
}

Requests for data will likely generate more data than can be returned in a single response (our limit is up to 100 tweets in a single response). When a response is paginated, the API response will provide a “Next” token specified in the body of the response that indicates whether any further pages are available. These “Next” page tokens can then be used to make further requests.The following function automates the API requests with or without pagination.

The code below requests tweets which include the following search terms from 2018/11/05 to 2018/11/19:

  • “#maxrose”

  • “#dandonovan”

  • @RepDanDonovan

  • @MaxRose4NY

# the query includes terms "#maxrose", "#dandonovan", "@RepDanDonovan", "@MaxRose4NY" 
# data range: from 2018/11/05 to 2018/11/19
sbody = "{\"query\": \"#maxrose OR #dandonovan OR @RepDanDonovan OR @MaxRose4NY\",\"maxResults\": 100, \"fromDate\":\"201811050000\", \"toDate\":\"201811190000"
ebody = "\"}"
request <- function(start_body,end_body){
         full_body <- str_c(start_body, end_body, sep = "")
         nxt <-""
         pageNo <- 1
         
         while(!is.null(nxt)){
                resTweets <- POST(url = "https://api.twitter.com/1.1/tweets/search/30day/dev.json",
                  add_headers("authorization" = bearerTokenb, "content-Type" = "application/json"),
                  body = full_body)
                
                #checking if the type of response is json
                # if (http_type(resTweets) != "application/json") {
                #         stop("API did not return json", call. = FALSE)}
                
                #checking if the resquest was successful
                # if (http_error(resTweets)) {
                #         stop(sprintf("Twitter API request failed! Status = %s. Check what went wrong.\n", 
                #                      status_code(resTweets)),
                #              call. = FALSE)}else{
                #                      message("Retrieving page ",pageNo)}
                
                # Parse the data
                tweets_df <- fromJSON(content(resTweets, "text"),flatten = TRUE) %>% data.frame()
        
                # Saving data only from the texts of the tweets in separate files
                text_df <- tweets_df$results.text
                file1 <- str_c("text",pageNo,".csv")
                write.csv(text_df, file1, row.names=FALSE)
        
                # Remove the list-columns
                vec<-list_col(tweets_df)
                tweets_df <- tweets_df[,-vec]
                # Saving the whole data
                file2 <- str_c("tweet",pageNo,".csv")
                write.csv(tweets_df, file2, row.names=FALSE)
        
                # Read the "next"" token received in the response
                nxt <- tweets_df$next.[[1]]
        
                if(!is.null(nxt)){
                        # insert the next token in the body of the request
                        full_body <- str_c(start_body, "\", \"next\":\"", nxt, end_body, sep = "")
                        pageNo <- pageNo+1}
                
                # To avoid to exceed the API's limit per minute
                Sys.sleep(3)
        } #end of while loop
         
}#end of function
request(start_body = sbody,end_body = ebody)

The screenshot below shows Twitter API in action, the following message appears for every page:

The total number of files obtained from the Twitter API is 177.

For our project, we will use files numbers from 67 to 177. These corresponds to tweets from Nov 5 to some tweets from Nov 10.

Tweets Preprocessing

The CSV files containing the Twitter data were uploaded to Github.

The following function creates a vector with all links to be accessed to retrieve data.

start_url <- "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet"
end_url <- ".csv"
# the selected files to be used in this project
vec <- seq(67,177)
# function
pages <- function(vec){
        n <- length(vec)
        urls <- vector('character')
        for (i in 1:n){
                temp <- str_c(start_url,vec[i],end_url, collapse = "")
                urls <- c(urls, temp)
        }
        return(urls)
}
urls <-pages(vec)
head(urls)
## [1] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet67.csv"
## [2] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet68.csv"
## [3] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet69.csv"
## [4] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet70.csv"
## [5] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet71.csv"
## [6] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet72.csv"

The urls created will be used to open connections to the files, read them into data frames, select the columns of interest and bind the data frames.

n <-length(urls)
Stream <-data.frame()
for (i in 1:n){
        csvfile <- url(urls[i])
        df <- read.csv(csvfile,header = TRUE, fileEncoding = "ASCII", stringsAsFactors = FALSE)
        df <- df %>% select(results.created_at,results.text,results.user.name,results.user.location)
        Stream <- rbind(Stream,df)
}
str(Stream)
## 'data.frame':    11072 obs. of  4 variables:
##  $ results.created_at   : chr  "Sat Nov 10 01:25:50 +0000 2018" "Sat Nov 10 01:25:31 +0000 2018" "Sat Nov 10 01:24:59 +0000 2018" "Sat Nov 10 01:24:31 +0000 2018" ...
##  $ results.text         : chr  "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ ...
##  $ results.user.name    : chr  "Cate   Resist\U0001f44f\U0001f3feEvery\U0001f44f\U0001f3ffDamned\U0001f44f\U0001f3fcDay\U0001f44f\U0001f3fd" "Pennell Somsen" "Barbara Ward #FBR \U0001f30a" "Randy #RESIST" ...
##  $ results.user.location: chr  NA "Mérida, Yucatán & Harlem, New York" "New Hampshire, USA" NA ...

For our analysis, we are interested in the tweets sent by people other than the candidates themselves, so we exclude the tweets from candidates’ usernames.

Stream <- Stream %>% filter(!results.user.name %in% c("Max Rose","Dan Donovan"))

We also change data format to make analysis easier.

# Change format of dates
Stream$results.created_at <- as.POSIXct(Stream$results.created_at, format = "%a %b %d %H:%M:%S +0000 %Y")
str(Stream)
## 'data.frame':    11037 obs. of  4 variables:
##  $ results.created_at   : POSIXct, format: "2018-11-10 01:25:50" "2018-11-10 01:25:31" ...
##  $ results.text         : chr  "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ "RT @crhousel: \U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Chang"| __truncated__ ...
##  $ results.user.name    : chr  "Cate   Resist\U0001f44f\U0001f3feEvery\U0001f44f\U0001f3ffDamned\U0001f44f\U0001f3fcDay\U0001f44f\U0001f3fd" "Pennell Somsen" "Barbara Ward #FBR \U0001f30a" "Randy #RESIST" ...
##  $ results.user.location: chr  NA "Mérida, Yucatán & Harlem, New York" "New Hampshire, USA" NA ...

Tweets Cleaning

The next steps are to cleanse the texts in the tweets:

  • Remove ASCII symbols and Twitter’s user handles (@user)
  • Remove punctuations, numbers, digits and special characters
  • Remove white spaces and stop words
  • Remove hashtags, tags, urls, Twitter short words, etc.
  • Convert the corpus to lower case
# Filtering off the retweets from the data (keep only original posts)
Stream <- Stream %>% 
  filter(!str_detect(results.text, "^RT"))
Mycorpus <- Corpus(VectorSource(Stream$results.text))
#Various cleansing funtions:
#ASCII Symbols
remove_ASCIIs <- function(x) gsub("[^\x01-\x7F]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_ASCIIs)))
#@'s 
remove_ATs <- function(x) gsub("@\\w+", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_ATs)))
#All Punctuations
remove_Puncts <- function(x) gsub("[[:punct:]]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_Puncts)))
#All Digits
remove_Digits <- function(x) gsub("[[:digit:]]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_Digits)))
#3-Step HTTP Process
remove_HTTPSs <- function(x) gsub("http\\w+", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs)))
remove_HTTPSs2 <- function(x) gsub("[ \t]{2,}", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs2)))
remove_HTTPSs3 <- function(x) gsub("^\\s+|\\s+$", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs3)))
#Whitespaces
remove_WhiteSpace <- function(x) gsub("[ \t]{2,}", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_WhiteSpace)))
#Lower Case
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(tolower)))
#im's 
remove_IMs <- function(x) gsub("im", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_IMs)))
#stopwards
Mycorpus <- suppressWarnings(tm_map(Mycorpus, removeWords,stopwords("English")))
# View the corpus
inspect(Mycorpus[1:10])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
## 
##  [1]                                                                                 
##  [2]                                                                                 
##  [3] shootings like  one  thousand oaks   often followed   pattern  inaction  january
##  [4]                                                                                 
##  [5]  already know  reason   protest george soros wants  bring social                
##  [6] celebratings victory nyyour passion  grassroots change  refreshing welcome      
##  [7] great  see  soon   congressmanbrave  nasty weather   greeting  new constituents 
##  [8]   honored thatis  district congressmana true gentleman   laugh  awe like  fool  
##  [9] yep   costs money  park   lot spent  ton  money visiti                          
## [10]  happy  can callmy new congressman

We take a look at the word cloud with the terms in the corpus.

#setting the same seed each time ensures consistent look across clouds
set.seed(7)
suppressWarnings(wordcloud(Mycorpus, random.order=F, scale=c(3, 0.5), min.freq = 5, col=rainbow(50)))

Because we need to tokenize the text for the analysis, we replace Twitter’s texts in original data frame with the clean data from corpus.

# Original data frame
head(Stream$results.text, n=10)
##  [1] "@crhousel @MaxRose4NY \U0001f600"                                                                                                                          
##  [2] "@Prometheus_2018 @MaxRose4NY :) \U0001f44b\U0001f3fd"                                                                                                      
##  [3] ".@MaxRose4NY, shootings like the one in Thousand Oaks are too often followed by a pattern of inaction. In January,… https://t.co/2IWYBjWaGZ"               
##  [4] "@crhousel @MaxRose4NY \u270a\U0001f3fd"                                                                                                                    
##  [5] "@rldaug @BernardKerik @RepDanDonovan we already know the reason for the protest. George Soros wants to bring social… https://t.co/40WPd3PlDq"              
##  [6] "\U0001f389\U0001f389Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!! Welcome to the… https://t.co/0eeEdAhnwt"
##  [7] "Great to see my soon to be congressman @MaxRose4NY brave the nasty weather and out greeting his new constituents at… https://t.co/dLggg3T55B"              
##  [8] "I am honored that @MaxRose4NY is my district congressman,  a true gentleman. I would laugh in awe like a fool if I… https://t.co/zYR0AWfuI8"               
##  [9] "@phildemeo @NYCMayor @NYGovCuomo @MaxRose4NY Yep AND it costs money to park in the lot. Spent a ton of money visiti… https://t.co/Gh0IZ1jf7I"              
## [10] "@KatieVasquezTV @MaxRose4NY So happy I can call @MaxRose4NY my new congressman"
# Clean corpus
df <- data.frame(text = get("content", Mycorpus))
head(df, n=10)
##                                                                                text
## 1                                                                                  
## 2                                                                                  
## 3  shootings like  one  thousand oaks   often followed   pattern  inaction  january
## 4                                                                                  
## 5                   already know  reason   protest george soros wants  bring social
## 6      celebratings victory nyyour passion  grassroots change  refreshing welcome  
## 7  great  see  soon   congressmanbrave  nasty weather   greeting  new constituents 
## 8    honored thatis  district congressmana true gentleman   laugh  awe like  fool  
## 9                            yep   costs money  park   lot spent  ton  money visiti
## 10                                                happy  can callmy new congressman
Stream$results.text <- as.character(df$text)
# Remove the rows that contain empty strings in the text column after the cleanup
Stream <- Stream %>% filter(results.text !="")
# Add row numbers and move to the front of the data frame
Stream <- Stream %>% mutate(id = row_number()) %>% select(id, everything())
head(Stream, n=10)
##    id  results.created_at
## 1   1 2018-11-10 01:19:14
## 2   2 2018-11-10 01:10:24
## 3   3 2018-11-10 01:08:59
## 4   4 2018-11-10 01:06:50
## 5   5 2018-11-10 00:57:02
## 6   6 2018-11-09 23:57:43
## 7   7 2018-11-09 23:51:45
## 8   8 2018-11-09 23:51:36
## 9   9 2018-11-09 23:48:42
## 10 10 2018-11-09 23:45:15
##                                                                        results.text
## 1  shootings like  one  thousand oaks   often followed   pattern  inaction  january
## 2                   already know  reason   protest george soros wants  bring social
## 3      celebratings victory nyyour passion  grassroots change  refreshing welcome  
## 4  great  see  soon   congressmanbrave  nasty weather   greeting  new constituents 
## 5    honored thatis  district congressmana true gentleman   laugh  awe like  fool  
## 6                            yep   costs money  park   lot spent  ton  money visiti
## 7                                                 happy  can callmy new congressman
## 8                wasnt   diverse electorate  s brooklynstaten island  pushedover  e
## 9                                                                         excellent
## 10                                                                thats lovely  see
##                                                                    results.user.name
## 1                                                                     Kristen Caruso
## 2  \u274c \U0001f5f3\U0001f1fa\U0001f1f8 2020\U0001f5fd\U0001f5f3 \u2b50\u2b50\u2b50
## 3                                                        \u26a1️StarfireResists\u26a1️
## 4                                                                   Timothy O'Reilly
## 5                                                                         Dina Cameo
## 6                                                                               Alex
## 7                                                                                 TG
## 8                                                                      D. Changstein
## 9                                                                 Covfefe Deplorable
## 10                                                                            Eileen
##             results.user.location
## 1                            <NA>
## 2                            <NA>
## 3                Geeks Resist, HQ
## 4                    Brooklyn, NY
## 5                   New York City
## 6                            <NA>
## 7  The Divided States of America 
## 8                            <NA>
## 9                            <NA>
## 10                           <NA>
# Tokenize the new clean text from the data frame
Streamnew <- Stream %>%  
  unnest_tokens(word, results.text)

Looking back at the word cloud, it seems that the stop words from the tm package didn’t filtered off all undesired words from the tweets. So we continue the cleaning further using the stop words from tidytext package.

# Remove stop words and other words
data(stop_words)
Streamnew <- Streamnew %>% anti_join(stop_words)
## Joining, by = "word"

Sentiment Analysis

Here we get the counts of most frequent words found in our data.

Streamnew %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "Count",
       y = "Unique words",
       title = "Count of unique words found in tweets")
## Selecting by n

We are going to use the “bing” sentiment data which classifies words as positive or negative. We are joining the list of words extracted from the tweets with this sentiment data.

# join sentiment classification to the tweet words
bing_word_counts <- Streamnew %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"

The code below creates a plot of positive and negative words.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(title = "Midterm Election Sentiment.",
       y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

According to the sentiment lexicon “bing”, it seems that most of the words tend to be positive for the results of the election.

Let’s take a look at how our results were impacted by time passing, we will take a look at our data at specific times.

The code below adds a column to classify tweets as “Early, Med, or Late”: - Early: November 5th and 6th - Med: November 7th - Late: After Nov 8th

Streamnew$Timing <- ifelse(Streamnew$results.created_at <= '2018-11-07', 'Early',
                  ifelse(Streamnew$results.created_at >= '2018-11-07' & Streamnew$results.created_at <= '2018-11-08', 'Med',
                         ifelse(Streamnew$results.created_at >= '2018-11-08', 'Late', 'other')
                  ))

The code below joins our data frame with sentiment data and plots positive and negative words in “Early, Med, and Late” timing categories.

# join sentiment classification to the tweet words
Elec_sentiment_2018 <- Streamnew %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, Timing, sort = TRUE) %>%
  group_by(sentiment) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  group_by(Timing, sentiment) %>%
  top_n(n = 5, wt = n) %>%
  arrange(Timing, sentiment, n)
## Joining, by = "word"
Elec_sentiment_2018 %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Timing, scales = "free_y", ncol = 2) +
  labs(title = "Sentiment during the 2018 Midterm Election - NY 11th Cong. District.",
       y = "Number of Times Word Appeared in Tweets",
       x = NULL) +
  coord_flip()

Topic Modeling with LDA

dtm <- DocumentTermMatrix(Mycorpus)
dtm
## <<DocumentTermMatrix (documents: 4111, terms: 6281)>>
## Non-/sparse entries: 22854/25798337
## Sparsity           : 100%
## Maximal term length: 39
## Weighting          : term frequency (tf)

The sparse matrix contains rows without entries(words) and this causes error in LDA function.

To deal with this issue, we compute the sum of words by row and subset the dtm matrix by rows with sum >0.

rowTotals <- apply(dtm , 1, sum)
dtm.new   <- dtm[rowTotals> 0, ]    
dtm.new
## <<DocumentTermMatrix (documents: 3831, terms: 6281)>>
## Non-/sparse entries: 22854/24039657
## Sparsity           : 100%
## Maximal term length: 39
## Weighting          : term frequency (tf)
# Set a seed so that the output of the model is predictable
# Model finds 4 topics
lda <- LDA(dtm.new, k = 4, control = list(seed = 1234))
lda
## A LDA_VEM topic model with 4 topics.
term <- terms(lda, 10) # first 10 terms of every topic
term
##       Topic 1           Topic 2           Topic 3           Topic 4   
##  [1,] "vote"            "congratulations" "staten"          "island"  
##  [2,] "island"          "just"            "congrats"        "staten"  
##  [3,] "congratulations" "island"          "max"             "vote"    
##  [4,] "happy"           "will"            "island"          "maxrose" 
##  [5,] "new"             "congress"        "just"            "brooklyn"
##  [6,] "voted"           "amp"             "proud"           "max"     
##  [7,] "proud"           "won"             "like"            "new"     
##  [8,] "good"            "voted"           "congratulations" "rose"    
##  [9,] "win"             "now"             "blue"            "voting"  
## [10,] "district"        "great"           "brooklyn"        "proud"
topics <- tidy(lda, matrix = "beta")
topics
## # A tibble: 25,124 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 followed 0.000152   
##  2     2 followed 0.0000559  
##  3     3 followed 0.000216   
##  4     4 followed 0.0000935  
##  5     1 inaction 0.0000886  
##  6     2 inaction 0.0000285  
##  7     3 inaction 0.000131   
##  8     4 inaction 0.0000968  
##  9     1 january  0.000299   
## 10     2 january  0.000000454
## # ... with 25,114 more rows

In the table above, we see that the LDA model computes the probability of that term being generated from that topic. For example the term “followed” has probabilities 0.000152, 0.0000559, 0.000216 and 0.0000935 of being generated from topic 1, 2, 3 and 4, respectively.

The following is a visualization of the results for the top 10 terms.

top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

The four topics appear to be similar. That is not surprising given that there is a limited number of topics one would expect to see discussed in comments relating to the election results.

Note that some of the top frequent positive words from the sentiment analysis above, such as “congratulations” and “proud”, appear in most of the topics. But none of the top negative words appear in the topics’ lists.

Modeling with TF-IDF

The statistic Term Frequency-Inverse Document Frequency (TF-IDF) measures how important a word is to a document in a corpus of documents, as in our case, to one tweet in a collection of tweets.

TF-IDF in general finds:

  • Words that are very common in a specific document (tweet) are probably important to the topic of that document.

  • Words that are very common in all documents probably aren’t important to the topics of any of them.

So a term will receive a high weight if it’s common in a specific document and also uncommon across all documents.

# get the count of each word in each individual tweet
mywords <- Stream %>%  
        unnest_tokens(word, results.text) %>% 
        anti_join(stop_words) %>% 
        count(id,word, sort = TRUE) %>% 
        ungroup()
## Joining, by = "word"
# get the number of words per tweet
total_words <- mywords %>% 
      group_by(id) %>% 
      summarize(total = sum(n))
        
# combine the two dataframes we just made
mywords <- left_join(mywords, total_words)
## Joining, by = "id"
mywords
## # A tibble: 18,675 x 4
##       id word       n total
##    <int> <chr>  <int> <int>
##  1  1602 ha        24    27
##  2  3615 vote       9    15
##  3  3584 amp        7     7
##  4   980 dont       4     9
##  5  2384 max        4    17
##  6  2520 omg        4     4
##  7  3716 ny         4     6
##  8    42 vote       3     7
##  9   749 carpet     3    10
## 10  1520 york       3     6
## # ... with 18,665 more rows
# get the tf_idf & order the words by degree of relevence
tf_idf1 <- mywords %>%
      bind_tf_idf(word, id, n) %>%
      select(-total) %>%
      arrange(desc(tf_idf)) %>%
      mutate(word = factor(word, levels = rev(unique(word))))
tf_idf2 <- mywords %>%
      bind_tf_idf(word, id, n) %>%
      select(-total) %>%
      arrange(tf_idf) %>%
      mutate(word = factor(word, levels = rev(unique(word))))
print(tf_idf1, n=10)
## # A tibble: 18,675 x 6
##       id word                        n    tf   idf tf_idf
##    <int> <fct>                   <int> <dbl> <dbl>  <dbl>
##  1    57 graffiti                    1     1  8.24   8.24
##  2   106 referr                      1     1  8.24   8.24
##  3   147 dayayewhere                 1     1  8.24   8.24
##  4   209 awesomecongratulations      1     1  8.24   8.24
##  5   210 yuprocked                   1     1  8.24   8.24
##  6   246 didwow                      1     1  8.24   8.24
##  7   275 ohhhhh                      1     1  8.24   8.24
##  8   280 fantasticall                1     1  8.24   8.24
##  9   282 goodjob                     1     1  8.24   8.24
## 10   289 congratulationslastword     1     1  8.24   8.24
## # ... with 1.866e+04 more rows
print(tf_idf2, n=10)
## # A tibble: 18,675 x 6
##       id word       n     tf   idf tf_idf
##    <int> <fct>  <int>  <dbl> <dbl>  <dbl>
##  1  2384 staten     1 0.0588  2.23  0.131
##  2   502 staten     1 0.0714  2.23  0.159
##  3   620 staten     1 0.0714  2.23  0.159
##  4  1418 staten     1 0.0714  2.23  0.159
##  5   502 island     1 0.0714  2.32  0.166
##  6   620 island     1 0.0714  2.32  0.166
##  7  3615 ny         1 0.0667  2.56  0.171
##  8  1420 staten     1 0.0769  2.23  0.171
##  9  1954 staten     1 0.0769  2.23  0.171
## 10  3193 staten     1 0.0769  2.23  0.171
## # ... with 1.866e+04 more rows

If the TF-IDF is closer to zero, it indicates that the word is extremely common. Thus, the words “staten”, “island” and “ny” are not so important for the whole corpus. On the other hand, the words with high TF-IDF are important. As one can see, they have high TF-IDF due to typos and concatenation of words together. Nevertheless, note that majority of them can be considered in positive terms.

Conclusions

The project’s goal is to find out if people’s feelings and/or opinions about the election result for the 11th Congressional District of NY are in general positive or negative.

According to the sentiment lexicon “bing”, it seems that most of the words overall tend to be positive for the results of the election.

The LDA analysis also seems to confirm this since the topics presented by the model present top frequent positive words as “congratulations” and “proud”. And overall negative words don’t appear among the topics modeled from the tweets.

Finally, the TF-IDF analysis shows that the important words for the corpus are in general positive, even though some are formed by concatenations.

Overall the sentiment present in the tweets are described in positive terms.

Reference