Project Description:
Use Twitter API for sentiment analysis on the 2018 midterm election results for the 11th Congressional District of NY, which includes all of Staten Island and parts of Southern Brooklyn.
Members: B. Sosnovski, E. Azrilyan, and R. Mercier.
Motivation: Sentiment analysis plays an essential role during elections. Political strategists can use the public’s opinions, emotions, and attitudes to convert those into votes in 2020.
Can we gauge the public’s sentiment related to the results of the midterm election?
Data: To conduct our analysis, we will harvest data using one of Twitter’s APIs. Data will be restricted to a specific date range for NY’s 11th Congressional District election race.
Work Flow:
- Acquire data.
- Fetch, clean, transform and tokenize the data.
- Perform feature selection to keep only the meaningful tweets for the analysis.
- Analysis: classify results as positive or negative.
Our collection of texts from tweets can be divided into natural groups so we can understand them separately. Topic modeling is a method for the unsupervised classification of such documents, finding natural groups of items.
We will perform a probabilistic topic model using Latent Dirichlet Allocation (LDA). LDA can be applied for reading general tendencies from Twitter posts or comments into specific topics that can be classified toward positive and negative sentiments.
We also use the statistic Term Frequency-Inverse Document Frequency (TF-IDF) in our analysis, which attempts to find the words that are important (i.e., common) in a text, but not too common.
Tools:
- Tweeter Premium Search API - 30-day endpoint (Sandbox), which provides tweets from the previous 30 days.
- R Packages
Caution:
This document contains some explicit language. Due to time constraints, we could not work on removing the explicit language in the tweets. We apologize in advance if this offends someone.
Load Libraries
library(httr)
library(base64enc)
library(jsonlite)
library(stringr)
suppressMessages(library(tidyverse))
library(tidytext)
library(knitr)
library(XML)
suppressMessages(library(RCurl))
library(methods)
suppressMessages(library(tm))
suppressMessages(library(wordcloud))
library(topicmodels)API Credentials
Read the key, key secret, access token, and access token secret from a text file to maintain the information confidential.
api <- read.table("Twitter_API_Key.txt", header = TRUE, stringsAsFactors = FALSE)
names(api)
dim(api)
App_Name <- api$app_name
Consumer_Key <- api$key
Consumer_Secret <- api$secret_Key
Access_Token <- api$access_token
Access_Secret <- api$access_token_secretAPI Authentication
We faced a challenge in this part of the project because much of the documentation available on accessing the Twitter APIs using R is about accessing Twitter’s Basic Search API, not the Premium Search API. The Premium API was launched in Nov 2017 and is relatively new to the community. The basic Twitter API only gives access to the previous 7 days of tweets. To conduct our analysis, we needed access to tweets posted earlier than the last 7 days.
The following chunk of code was retrieved from https://twittercommunity.com/t/how-to-use-premium-api-for-the-first-time-beginner/105346/10. This was the only mention we could find about accessing the Premium API.
# base64 encoding
kands <- paste(Consumer_Key, Consumer_Secret, sep=":")
base64kands <- base64encode(charToRaw(kands))
base64kandsb <- paste("Basic", base64kands, sep=" ")
# request bearer token
resToken <- POST(url = "https://api.twitter.com/oauth2/token",
add_headers("Authorization" = base64kandsb, "Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8"),
body = "grant_type=client_credentials")
# get bearer token
bearer <- content(resToken)
bearerToken <- bearer[["access_token"]]
bearerTokenb <- paste("Bearer", bearerToken, sep=" ")Data Acquisition
Since the Twitter Premium API - Sandbox (free version) limits access to the tweets posted for the last 30 days, it is vital to save the search results into CSV files. This way, we can access the results afterward, even when the data is no longer available via the API.
Converting the data received from the API to a data frame may include columns with lists as observations. In this case, the R function “write.csv” or “write_csv” return an error. The function below identifies which columns of the data frame contain lists. The information removed in the process is not essential for our project.
# Function to identify which columns are lists
list_col <- function(df){
n <- length(df)
vec <- vector('numeric')
for (i in 1:n){
cl <- df[,i]
if(class(cl)=="list"){
vec <- c(vec,i)
}
}
return(vec)
}Requests for data will likely generate more data than can be returned in a single response (our limit is up to 100 tweets in a single response). When a response is paginated, the API response will provide a “Next” token specified in the response’s body that indicates whether any other pages are available. These “Next” page tokens can then be used to make further requests. The following function automates the API requests with or without pagination.
The code below requests tweets which include the following search terms from 2018/11/05 to 2018/11/19:
“#maxrose”
“#dandonovan”
“@RepDanDonovan”
“@MaxRose4NY”
# the query includes terms "#maxrose", "#dandonovan", "@RepDanDonovan", "@MaxRose4NY"
# data range: from 2018/11/05 to 2018/11/19
sbody = "{\"query\": \"#maxrose OR #dandonovan OR @RepDanDonovan OR @MaxRose4NY\",\"maxResults\": 100, \"fromDate\":\"201811050000\", \"toDate\":\"201811190000"
ebody = "\"}"
request <- function(start_body,end_body){
full_body <- str_c(start_body, end_body, sep = "")
nxt <-""
pageNo <- 1
while(!is.null(nxt)){
resTweets <- POST(url = "https://api.twitter.com/1.1/tweets/search/30day/dev.json",
add_headers("authorization" = bearerTokenb, "content-Type" = "application/json"),
body = full_body)
#checking if the type of response is JSON
# if (http_type(resTweets) != "application/json") {
# stop("API did not return json", call. = FALSE)}
#checking if the request was successful
# if (http_error(resTweets)) {
# stop(sprintf("Twitter API request failed! Status = %s. Check what went wrong.\n",
# status_code(resTweets)),
# call. = FALSE)}else{
# message("Retrieving page ",pageNo)}
# Parse the data
tweets_df <- fromJSON(content(resTweets, "text"),flatten = TRUE) %>% data.frame()
# Saving data only from the tweets' texts in separate files
text_df <- tweets_df$results.text
file1 <- str_c("text",pageNo,".csv")
write.csv(text_df, file1, row.names=FALSE)
# Remove the list-columns
vec<-list_col(tweets_df)
tweets_df <- tweets_df[,-vec]
# Saving the whole data
file2 <- str_c("tweet",pageNo,".csv")
write.csv(tweets_df, file2, row.names=FALSE)
# Read the "next" token received in the response
nxt <- tweets_df$next.[[1]]
if(!is.null(nxt)){
# insert the next token in the body of the request
full_body <- str_c(start_body, "\", \"next\":\"", nxt, end_body, sep = "")
pageNo <- pageNo+1}
# To avoid exceeding the API's limit per minute
Sys.sleep(3)
} #end of while loop
}#end of function
request(start_body = sbody,end_body = ebody)The screenshot below shows Twitter API in action; the following message appears for every page:
The total number of files obtained from the Twitter API is 177.
For our project, we will use file numbers 67 to 177. These correspond to tweets from Nov 5 to some tweets from Nov 10.
Tweets Preprocessing
The CSV files containing the Twitter data were uploaded to GitHub.
The following function creates a vector with all links to be accessed to retrieve data.
start_url <- "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet"
end_url <- ".csv"
# the selected files to be used in this project
vec <- seq(67,177)
# function
pages <- function(vec){
n <- length(vec)
urls <- vector('character')
for (i in 1:n){
temp <- str_c(start_url,vec[i],end_url, collapse = "")
urls <- c(urls, temp)
}
return(urls)
}
urls <-pages(vec)
head(urls)## [1] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet67.csv"
## [2] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet68.csv"
## [3] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet69.csv"
## [4] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet70.csv"
## [5] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet71.csv"
## [6] "https://raw.githubusercontent.com/bsosnovski/FinalProject/master/tweet72.csv"
The URLs created will open connections to the files, read them into data frames, select the columns of interest and bind them.
n <-length(urls)
Stream <-data.frame()
for (i in 1:n){
csvfile <- url(urls[i])
df <- read.csv(csvfile,header = TRUE, fileEncoding = "ASCII", stringsAsFactors = FALSE)
df <- df %>% select(results.created_at,results.text,results.user.name,results.user.location)
Stream <- rbind(Stream,df)
}
str(Stream)## 'data.frame': 11072 obs. of 4 variables:
## $ results.created_at : chr "Sat Nov 10 01:25:50 +0000 2018" "Sat Nov 10 01:25:31 +0000 2018" "Sat Nov 10 01:24:59 +0000 2018" "Sat Nov 10 01:24:31 +0000 2018" ...
## $ results.text : chr "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ ...
## $ results.user.name : chr "Cate Resist👏🏾Every👏🏿Damned👏🏼Day👏🏽" "Pennell Somsen" "Barbara Ward #FBR 🌊" "Randy #RESIST" ...
## $ results.user.location: chr NA "Mérida, Yucatán & Harlem, New York" "New Hampshire, USA" NA ...
For our analysis, we are interested in the tweets of people other than the candidates themselves, so we exclude them from the candidates’ usernames.
Stream <- Stream %>% filter(!results.user.name %in% c("Max Rose","Dan Donovan"))We also change the data format to make analysis easier.
# Change the format of the dates
Stream$results.created_at <- as.POSIXct(Stream$results.created_at, format = "%a %b %d %H:%M:%S +0000 %Y")
str(Stream)## 'data.frame': 11037 obs. of 4 variables:
## $ results.created_at : POSIXct, format: "2018-11-10 01:25:50" "2018-11-10 01:25:31" ...
## $ results.text : chr "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ "RT @crhousel: 🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!"| __truncated__ ...
## $ results.user.name : chr "Cate Resist👏🏾Every👏🏿Damned👏🏼Day👏🏽" "Pennell Somsen" "Barbara Ward #FBR 🌊" "Randy #RESIST" ...
## $ results.user.location: chr NA "Mérida, Yucatán & Harlem, New York" "New Hampshire, USA" NA ...
Tweets Cleaning
The following steps are to cleanse the texts in the tweets:
- Remove ASCII symbols and Twitter’s user handles (@user)
- Remove punctuations, numbers, digits, and special characters
- Remove white spaces and stop words
- Remove hashtags, tags, URLs, Twitter short words, etc.
- Convert the corpus to lowercase
# Filtering off the retweets from the data (keep only original posts)
Stream <- Stream %>%
filter(!str_detect(results.text, "^RT"))
Mycorpus <- Corpus(VectorSource(Stream$results.text))
#Various cleansing functions:
#ASCII Symbols
remove_ASCIIs <- function(x) gsub("[^\x01-\x7F]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_ASCIIs)))
#@'s
remove_ATs <- function(x) gsub("@\\w+", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_ATs)))
#All Punctuations
remove_Puncts <- function(x) gsub("[[:punct:]]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_Puncts)))
#All Digits
remove_Digits <- function(x) gsub("[[:digit:]]", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_Digits)))
#3-Step HTTP Process
remove_HTTPSs <- function(x) gsub("http\\w+", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs)))
remove_HTTPSs2 <- function(x) gsub("[ \t]{2,}", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs2)))
remove_HTTPSs3 <- function(x) gsub("^\\s+|\\s+$", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_HTTPSs3)))
#Whitespaces
remove_WhiteSpace <- function(x) gsub("[ \t]{2,}", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_WhiteSpace)))
#Lower Case
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(tolower)))
#im's
remove_IMs <- function(x) gsub("im", "", x)
Mycorpus <- suppressWarnings(tm_map(Mycorpus, content_transformer(remove_IMs)))
#stopwards
Mycorpus <- suppressWarnings(tm_map(Mycorpus, removeWords,stopwords("English")))
# View the corpus
inspect(Mycorpus[1:10])## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 10
##
## [1]
## [2]
## [3] shootings like one thousand oaks often followed pattern inaction january
## [4]
## [5] already know reason protest george soros wants bring social
## [6] celebratings victory nyyour passion grassroots change refreshing welcome
## [7] great see soon congressmanbrave nasty weather greeting new constituents
## [8] honored thatis district congressmana true gentleman laugh awe like fool
## [9] yep costs money park lot spent ton money visiti
## [10] happy can callmy new congressman
We look at the word cloud with the terms in the corpus.
#setting the same seed each time ensures a consistent look across clouds
set.seed(7)
suppressWarnings(wordcloud(Mycorpus, random.order=F, scale=c(3, 0.5), min.freq = 5, col=rainbow(50)))Because we need to tokenize the text for the analysis, we replace Twitter’s texts in the original data frame with clean data from the corpus.
# Original data frame
head(Stream$results.text, n=10)## [1] "@crhousel @MaxRose4NY 😀"
## [2] "@Prometheus_2018 @MaxRose4NY :) 👋🏽"
## [3] ".@MaxRose4NY, shootings like the one in Thousand Oaks are too often followed by a pattern of inaction. In January,… https://t.co/2IWYBjWaGZ"
## [4] "@crhousel @MaxRose4NY ✊🏽"
## [5] "@rldaug @BernardKerik @RepDanDonovan we already know the reason for the protest. George Soros wants to bring social… https://t.co/40WPd3PlDq"
## [6] "🎉🎉Celebrating @MaxRose4NY ‘s Victory #NY11 !! Your Passion for Grassroots Change is Refreshing!! Welcome to the… https://t.co/0eeEdAhnwt"
## [7] "Great to see my soon to be congressman @MaxRose4NY brave the nasty weather and out greeting his new constituents at… https://t.co/dLggg3T55B"
## [8] "I am honored that @MaxRose4NY is my district congressman, a true gentleman. I would laugh in awe like a fool if I… https://t.co/zYR0AWfuI8"
## [9] "@phildemeo @NYCMayor @NYGovCuomo @MaxRose4NY Yep AND it costs money to park in the lot. Spent a ton of money visiti… https://t.co/Gh0IZ1jf7I"
## [10] "@KatieVasquezTV @MaxRose4NY So happy I can call @MaxRose4NY my new congressman"
# Clean corpus
df <- data.frame(text = get("content", Mycorpus))
head(df, n=10)## text
## 1
## 2
## 3 shootings like one thousand oaks often followed pattern inaction january
## 4
## 5 already know reason protest george soros wants bring social
## 6 celebratings victory nyyour passion grassroots change refreshing welcome
## 7 great see soon congressmanbrave nasty weather greeting new constituents
## 8 honored thatis district congressmana true gentleman laugh awe like fool
## 9 yep costs money park lot spent ton money visiti
## 10 happy can callmy new congressman
Stream$results.text <- as.character(df$text)
# Remove the rows that contain empty strings in the text column after the cleanup
Stream <- Stream %>% filter(results.text !="")
# Add row numbers and move to the front of the data frame
Stream <- Stream %>% mutate(id = row_number()) %>% select(id, everything())
head(Stream, n=10)## id results.created_at
## 1 1 2018-11-10 01:19:14
## 2 2 2018-11-10 01:10:24
## 3 3 2018-11-10 01:08:59
## 4 4 2018-11-10 01:06:50
## 5 5 2018-11-10 00:57:02
## 6 6 2018-11-09 23:57:43
## 7 7 2018-11-09 23:51:45
## 8 8 2018-11-09 23:51:36
## 9 9 2018-11-09 23:48:42
## 10 10 2018-11-09 23:45:15
## results.text
## 1 shootings like one thousand oaks often followed pattern inaction january
## 2 already know reason protest george soros wants bring social
## 3 celebratings victory nyyour passion grassroots change refreshing welcome
## 4 great see soon congressmanbrave nasty weather greeting new constituents
## 5 honored thatis district congressmana true gentleman laugh awe like fool
## 6 yep costs money park lot spent ton money visiti
## 7 happy can callmy new congressman
## 8 wasnt diverse electorate s brooklynstaten island pushedover e
## 9 excellent
## 10 thats lovely see
## results.user.name results.user.location
## 1 Kristen Caruso <NA>
## 2 ❌ 🗳🇺🇸 2020🗽🗳 ⭐⭐⭐ <NA>
## 3 ⚡️StarfireResists⚡️ Geeks Resist, HQ
## 4 Timothy O'Reilly Brooklyn, NY
## 5 Dina Cameo New York City
## 6 Alex <NA>
## 7 TG The Divided States of America
## 8 D. Changstein <NA>
## 9 Covfefe Deplorable <NA>
## 10 Eileen <NA>
# Tokenize the new clean text from the data frame
Streamnew <- Stream %>%
unnest_tokens(word, results.text)Looking back at the word cloud, it seems that the stop words from the tm package didn’t filter off all undesired words from the tweets. So we continue the cleaning further using the stop words from the tidytext package.
# Remove stop words and other words
data(stop_words)
Streamnew <- Streamnew %>% anti_join(stop_words)## Joining, by = "word"
Sentiment Analysis
Here we get the counts of the most frequent words found in our data.
Streamnew %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in tweets")## Selecting by n
We will use the “bing” sentiment data, which classifies words as positive or negative. We are joining the list of words extracted from the tweets with this sentiment data.
# join sentiment classification to the tweet words
bing_word_counts <- Streamnew %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining, by = "word"
The code below creates a plot of positive and negative words.
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(title = "Midterm Election Sentiment.",
y = "Contribution to sentiment",
x = NULL) +
coord_flip()## Selecting by n
According to the sentiment lexicon “bing,” most words tend to be positive for the election results.
Let’s look at how our results were impacted by time passing; we will look at our data at specific times.
The code below adds a column to classify tweets as “Early, Med, or Late”: - Early: November 5th and 6th - Med: November 7th - Late: After Nov 8th
Streamnew$Timing <- ifelse(Streamnew$results.created_at <= '2018-11-07', 'Early',
ifelse(Streamnew$results.created_at >= '2018-11-07' & Streamnew$results.created_at <= '2018-11-08', 'Med',
ifelse(Streamnew$results.created_at >= '2018-11-08', 'Late', 'other')
))The code below joins our data frame with sentiment data and plots positive and negative words in the “Early, Med, and Late” timing categories.
# join sentiment classification to the tweet words
Elec_sentiment_2018 <- Streamnew %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, Timing, sort = TRUE) %>%
group_by(sentiment) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
group_by(Timing, sentiment) %>%
top_n(n = 5, wt = n) %>%
arrange(Timing, sentiment, n)## Joining, by = "word"
Elec_sentiment_2018 %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Timing, scales = "free_y", ncol = 2) +
labs(title = "Sentiment during the 2018 Midterm Election - NY 11th Cong. District.",
y = "Number of Times Word Appeared in Tweets",
x = NULL) +
coord_flip()Topic Modeling with LDA
dtm <- DocumentTermMatrix(Mycorpus)
dtm## <<DocumentTermMatrix (documents: 4111, terms: 6281)>>
## Non-/sparse entries: 22854/25798337
## Sparsity : 100%
## Maximal term length: 39
## Weighting : term frequency (tf)
The sparse matrix contains rows without entries (words), and this causes errors in the LDA function.
To deal with this issue, we compute the sum of words by row and subset the dtm matrix by rows with sum >0.
rowTotals <- apply(dtm , 1, sum)
dtm.new <- dtm[rowTotals> 0, ]
dtm.new## <<DocumentTermMatrix (documents: 3831, terms: 6281)>>
## Non-/sparse entries: 22854/24039657
## Sparsity : 100%
## Maximal term length: 39
## Weighting : term frequency (tf)
# Set a seed so that the output of the model is predictable
# Model finds 4 topics
lda <- LDA(dtm.new, k = 4, control = list(seed = 1234))
lda## A LDA_VEM topic model with 4 topics.
term <- terms(lda, 10) # first 10 terms of every topic
term## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "vote" "congratulations" "staten" "island"
## [2,] "island" "just" "congrats" "staten"
## [3,] "congratulations" "island" "max" "vote"
## [4,] "happy" "will" "island" "maxrose"
## [5,] "new" "congress" "just" "brooklyn"
## [6,] "voted" "amp" "proud" "max"
## [7,] "proud" "won" "like" "new"
## [8,] "good" "voted" "congratulations" "rose"
## [9,] "win" "now" "blue" "voting"
## [10,] "district" "great" "brooklyn" "proud"
topics <- tidy(lda, matrix = "beta")
topics## # A tibble: 25,124 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 followed 0.000152
## 2 2 followed 0.0000559
## 3 3 followed 0.000216
## 4 4 followed 0.0000935
## 5 1 inaction 0.0000886
## 6 2 inaction 0.0000285
## 7 3 inaction 0.000131
## 8 4 inaction 0.0000968
## 9 1 january 0.000299
## 10 2 january 0.000000454
## # … with 25,114 more rows
The table above shows that the LDA model computes the probability of that term being generated from that topic. For example, the word “followed” has probabilities 0.000152, 0.0000559, 0.000216, and 0.0000935 of being generated from topics 1, 2, 3, and 4, respectively.
The following is a visualization of the results for the top 10 terms.
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()The four topics appear to be similar. That is not surprising given that there is a limited number of topics one would expect to see discussed in comments relating to the election results.
Some of the top frequent positive words from the sentiment analysis above, such as “congratulations” and “proud,” appear in most topics. But none of the top negative words appear in the topics’ lists.
Modeling with TF-IDF
The Term Frequency-Inverse Document Frequency (TF-IDF) measures how important a word is to a document in a corpus of documents, as in our case, to one tweet in a collection of tweets.
TF-IDF, in general, finds:
Words that are very common in a specific document (tweet) are probably important to the topic of that document.
Words that are very common in all documents probably aren’t important to the topics of any of them.
So a term will receive a high weight if it’s common in a specific document and uncommon across all documents.
# get the count of each word in each tweet
mywords <- Stream %>%
unnest_tokens(word, results.text) %>%
anti_join(stop_words) %>%
count(id,word, sort = TRUE) %>%
ungroup()## Joining, by = "word"
# get the number of words per tweet
total_words <- mywords %>%
group_by(id) %>%
summarize(total = sum(n))
# combine the two data frames we just made
mywords <- left_join(mywords, total_words)## Joining, by = "id"
head(mywords, 20)## id word n total
## 1 1602 ha 24 27
## 2 3615 vote 9 15
## 3 3584 amp 7 7
## 4 980 dont 4 9
## 5 2384 max 4 17
## 6 2520 omg 4 4
## 7 3716 ny 4 6
## 8 42 vote 3 7
## 9 749 carpet 3 10
## 10 1520 york 3 6
## 11 1934 omg 3 3
## 12 2123 max 3 6
## 13 2123 rose 3 6
## 14 2421 district 3 6
## 15 2559 omg 3 4
## 16 2725 max 3 4
## 17 2769 omg 3 3
## 18 3538 ha 3 3
## 19 3607 vote 3 8
## 20 6 money 2 9
# get the tf_idf & order the words by the degree of relevance
tf_idf1 <- mywords %>%
bind_tf_idf(word, id, n) %>%
select(-total) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
tf_idf2 <- mywords %>%
bind_tf_idf(word, id, n) %>%
select(-total) %>%
arrange(tf_idf) %>%
mutate(word = factor(word, levels = rev(unique(word))))
head(tf_idf1,20)## id word n tf idf tf_idf
## 1 57 graffiti 1 1 8.235361 8.235361
## 2 106 referr 1 1 8.235361 8.235361
## 3 147 dayayewhere 1 1 8.235361 8.235361
## 4 209 awesomecongratulations 1 1 8.235361 8.235361
## 5 210 yuprocked 1 1 8.235361 8.235361
## 6 246 didwow 1 1 8.235361 8.235361
## 7 275 ohhhhh 1 1 8.235361 8.235361
## 8 280 fantasticall 1 1 8.235361 8.235361
## 9 282 goodjob 1 1 8.235361 8.235361
## 10 289 congratulationslastword 1 1 8.235361 8.235361
## 11 312 absoutely 1 1 8.235361 8.235361
## 12 314 fake 1 1 8.235361 8.235361
## 13 387 refe 1 1 8.235361 8.235361
## 14 462 pkij 1 1 8.235361 8.235361
## 15 481 intend 1 1 8.235361 8.235361
## 16 631 pos 1 1 8.235361 8.235361
## 17 635 uhh 1 1 8.235361 8.235361
## 18 652 recogniz 1 1 8.235361 8.235361
## 19 672 centtttt 1 1 8.235361 8.235361
## 20 814 twe 1 1 8.235361 8.235361
head(tf_idf2,20)## id word n tf idf tf_idf
## 1 2384 staten 1 0.05882353 2.226547 0.1309734
## 2 502 staten 1 0.07142857 2.226547 0.1590391
## 3 620 staten 1 0.07142857 2.226547 0.1590391
## 4 1418 staten 1 0.07142857 2.226547 0.1590391
## 5 502 island 1 0.07142857 2.321858 0.1658470
## 6 620 island 1 0.07142857 2.321858 0.1658470
## 7 3615 ny 1 0.06666667 2.558607 0.1705738
## 8 1420 staten 1 0.07692308 2.226547 0.1712729
## 9 1954 staten 1 0.07692308 2.226547 0.1712729
## 10 3193 staten 1 0.07692308 2.226547 0.1712729
## 11 3536 staten 1 0.07692308 2.226547 0.1712729
## 12 1954 island 1 0.07692308 2.321858 0.1786044
## 13 3193 island 1 0.07692308 2.321858 0.1786044
## 14 3536 island 1 0.07692308 2.321858 0.1786044
## 15 1390 ny 1 0.07142857 2.558607 0.1827576
## 16 308 staten 1 0.08333333 2.226547 0.1855456
## 17 553 staten 1 0.08333333 2.226547 0.1855456
## 18 1113 staten 1 0.08333333 2.226547 0.1855456
## 19 1199 staten 1 0.08333333 2.226547 0.1855456
## 20 1907 staten 1 0.08333333 2.226547 0.1855456
If the TF-IDF is closer to zero, it indicates that the word is ubiquitous. Thus, the terms “staten,” “island,” and “ny” are not so important for the whole corpus. On the other hand, the words with high TF-IDF are important. As one can see, they have high TF-IDF due to typos and the concatenation of words together. Nevertheless, note that most of them can be considered in favorable terms.
Conclusions
The project aims to determine if people’s feelings and opinions about the election result for the 11th Congressional District of NY are generally positive or negative.
According to the sentiment lexicon “bing,” most of the words overall tend to be positive for the election results.
The LDA analysis also seems to confirm this since the topics presented by the model present frequent positive words such as “congratulations” and “proud.” And overall negative words don’t appear among the topics modeled from the tweets.
Finally, the TF-IDF analysis shows that the important words for the corpus are generally positive, even though some are formed by concatenations.
Overall the sentiment present in the tweets is described in favorable terms.
Reference
Jeff Gentry. “twitteR - Twitter client for R.” March 18, 2014. R package version 1.1.9. https://www.rdocumentation.org/packages/twitteR/versions/1.1.9
hupseb. “How to use premium API for the first time (beginner)?” Post #10, May 13, 2018. Twitter Developers Forums. https://twittercommunity.com/t/how-to-use-premium-api-for-the-first-time-beginner/105346/10
Julia Silge and David Robinson. “Text Mining with R. A Tidy Approach.” Sep 23, 2018. https://www.tidytextmining.com/index.html
Hadley Wickham. “Best practices for API packages.” Aug 20, 2017. R package httr Vignette. https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html
Leah Wasser, Carson Farmer. “Lesson 6. Sentiment Analysis of Colorado Flood Tweets in R” https://www.earthdatascience.org/courses/earth-analytics/get-data-using-apis/sentiment-analysis-of-twitter-data-r/
Rachael Tatman. “NLP in R: Topic Modelling.” https://www.kaggle.com/rtatman/nlp-in-r-topic-modelling