Twitter API Analysis

John Watters

4/8/2021

Unsupervised Machine Learning is commonly associated with any type of algorithm that can infer or learn patterns from “untagged” data. Sentiment Analysis is one such tool that can analyze text for positive, negative, or neutral words or phrases based on training tools or packages. Using these tools, a machine can automatically detect sentiment (or attitude, further broken down into emotions) without human input.

This method can be extremely valuable in extracting actionable insights for individuals or organizations by analyzing more than just review scores, sales figures, etc, which can lead to upselling, identifying triggers, reducing churn, and more generally tracking satisfaction or opinion.

Today, I will produce a quick primer on how to do a Twitter API pull on topics of interest, visualize the top terms, and then conduct a quick sentiment and topic analysis on these tags.

1 Package Loading and Overview

Loading the required packages and setting working directory:

remove(list = ls())
suppressWarnings(suppressMessages(library(twitteR))) #twitter API access
suppressWarnings(suppressMessages(library(stringr))) #string work functions
suppressWarnings(suppressMessages(library(dplyr))) #data frame tool
suppressWarnings(suppressMessages(library(ngram))) #tokenizing tool
suppressWarnings(suppressMessages(library(tidytext))) #text mining tool
suppressWarnings(suppressMessages(library(tinytex))) #latex helper
suppressWarnings(suppressMessages(library(tm))) #text mining framework
suppressWarnings(suppressMessages(library(wordcloud))) #visualization tool
suppressWarnings(suppressMessages(library(RColorBrewer))) #colors
suppressWarnings(suppressMessages(library(ldatuning))) #topic fitting tool
suppressWarnings(suppressMessages(library(quanteda))) #text analysis tool
suppressWarnings(suppressMessages(library(tidyverse))) #data rep package
suppressWarnings(suppressMessages(library(ggplot2))) #visualizations
setwd("E:/R")

2 Twitter API Authentication and Connection

Authenitcating API, results and keys are masked. This is done with the ‘setup_twitter_oath’ command. The details included are consumer key & secret and access token & secret. These will be provided when you create a Twitter developer account. Note: If your output is a markdown, ensure you have ‘echo’ set to false else someone else can see your keys.

## [1] "Using direct authentication"

3 Twitter Pull

Pulling tweets with the ‘searchTwitter’ function on a couple different topics: 3000 for anything containing “Gaetz” for sentiment analysis on the ongoing allegations and investigation into the congressman and 3000 containing “Tax” to analyze sentiment on Biden’s corporate tax hike proposal. The date range for the latter is important, as the proposal only started to gain traction a few days ago in social media. If the range is left blank, the API will only draw from a week’s worth of tweets for normal users. If you want to analyze a social issue that is months old, you will not be able to do it with the same functions.

set.seed(12345)
gaetz <- searchTwitter('gaetz', n=3000, lang = 'en', since = "2021-04-01", until = "2021-04-08")
taxes <- searchTwitter('tax', n=3000, lang = 'en', since = "2021-04-05")

## [1] "Rate limited .... blocking for a minute and retrying up to 119 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 118 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 117 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 116 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 115 times ..."
## [1] "Rate limited .... blocking for a minute and retrying up to 114 times ..."

4 Text Conversion and Preparation Techniques

Next I will be converting the lists obtained to data frames. When working with tweets, these data frames will be split into 16 variables, to include text, id, favorited count, retweet count, etc, shown below in the ls (list) function. I will also include a write.csv function for each data frame for later use. This is recommended if your project will go beyond a “good time frame” for the topic you’re looking to analyze, or if you don’t want to dynamically shift your stop words or topics list:

gaetzDF = twListToDF(gaetz)
taxesDF = twListToDF(taxes)
#To show variables:
ls(gaetzDF)

##  [1] "created"       "favoriteCount" "favorited"     "id"           
##  [5] "isRetweet"     "latitude"      "longitude"     "replyToSID"   
##  [9] "replyToSN"     "replyToUID"    "retweetCount"  "retweeted"    
## [13] "screenName"    "statusSource"  "text"          "truncated"

#Writing csv's:

#write.csv(gaetzDF, file = "gaetzDF.csv")
#write.csv(taxesDF, file = "taxesDF.csv")

#Unnesting tokens:
gaetzcsv = unnest_tokens(gaetzDF, input = text, output = word, format = "text",
              drop=TRUE, to_lower=TRUE)
gaetzWC = gaetzcsv$word
head(gaetzWC)

## [1] "allinwithchris" "scott_maxwell"  "chrislhayes"    "ishapiro"      
## [5] "derricknaacp"   "oliviatroye"

wordcount(gaetzWC)

## [1] 62034

taxescsv = unnest_tokens(taxesDF, input = text, output = word, format = "text",
              drop=TRUE, to_lower=TRUE)
taxesWC = taxescsv$word
head(taxesWC)

## [1] "rt"             "davidanicholas" "you"            "know"          
## [5] "you"            "have"

wordcount(taxesWC)

## [1] 62047

The next essential step, especially so when working with social media text, is to clean the text data before working with it any further. This is true for anything: visualizations, sentiment analysis, etc. I want to remove all special characters, numbers or punctuation from the text data we pulled in the chunk above. There are multiple ways to do this, the simplest of which with Tweets is the use of ‘gsub’ which functions as a string substitution operation specifying the pattern and replacement.

The following is an example of ‘gsub’ operations being performed on both (note the output will be masked due to very large lists being produced with each operation):

#For the Gaetz tweets:
#gaetzWC <- gsub("https\\S*", "", gaetzWC) 
#gaetzWC <- gsub("@\\S*", "", gaetzWC) 
#gaetzWC <- gsub("amp", "", gaetzWC) 
#gaetzWC <- gsub("[\r\n]", "", gaetzWC)
#gaetzWC <- gsub("[[:punct:]]", "", gaetzWC)
#gaetzWC <- gsub("[[:digit:]]", "", gaetzWC)

#For the tax tweets:
#taxesWC <- gsub("https\\S*", "", taxesWC) 
#taxesWC <- gsub("@\\S*", "", taxesWC) 
#taxesWC <- gsub("amp", "", taxesWC) 
#taxesWC <- gsub("[\r\n]", "", taxesWC)
#taxesWC <- gsub("[[:punct:]]", "", taxesWC)
#taxesWC <- gsub("[[:digit:]]", "", taxesWC)

4.1 Corpus Creation

Now, an alternative route is to create a corpus (electronic, unstructured set of texts for analysis). This can be done using the original twitter pull and the cleaning process can be done with fewer commands. You can continue with the text that has been cleaned so far, but I will create the document term matrix for words clouds below using the corpus lists:

#Gaetz tweets:
gaetz.text <- sapply(gaetz, function(x) x$getText())
gaetz.text <- iconv(gaetz.text, 'UTF-8', 'ASCII')
gaetz.corpus <- Corpus(VectorSource(gaetz.text))
#Tax tweets:
taxes.text <- sapply(taxes, function(x) x$getText())
taxes.text <- iconv(taxes.text, 'UTF-8', 'ASCII')
taxes.corpus <- Corpus(VectorSource(taxes.text))

With the corpus, the cleaning process is much quicker. You can remove special characters, white space, URL’s, etc, and can determine your own additional stop words as part of the TermDocumentMatrix command. Your discretion is important here depending on the type of topic you are searching for:

#Starting with the Gaetz text and piping to remove numbers, 
#punctuation, and white space in one command, then following 
#with a transformation to lower case and removing stop words:
gaetz.corpus <- gaetz.corpus %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)

## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents

## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents

gaetz.corpus <- tm_map(gaetz.corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(gaetz.corpus, content_transformer(tolower)):
## transformation drops documents

gaetz.corpus <- tm_map(gaetz.corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(gaetz.corpus, removeWords, stopwords("english")):
## transformation drops documents

#Creating a function to remove URL's:
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
gaetz.corpus <- tm_map(gaetz.corpus, content_transformer(removeURL))

## Warning in tm_map.SimpleCorpus(gaetz.corpus, content_transformer(removeURL)):
## transformation drops documents

#Creating a term document matrix:
#(Note: When searching for a topic like Matt Gaetz, you'll
#want to remove 'Matt' and 'Gaetz' to prevent redundancy, as 
#well as any other uncommon or irrelevent terms):
gaetzdtm <- TermDocumentMatrix(gaetz.corpus,
                               control = list(
                                      removePunctuation = T, 
                                      stopwords = c('matt','gaetz', 'gossgoss', 
                                                    'goss',
                                                    'httpstcorzorzrpupp',
                                                    stopwords('english')),
                                                    removeNumbers=T, 
                                                    tolower=T))
gaetzmatrix <- as.matrix(gaetzdtm) 
gaetzwords <- sort(rowSums(gaetzmatrix),decreasing=TRUE)
gaetzdf <- data.frame(word=names(gaetzwords),freq=gaetzwords)
#Checking to see the top 5 words by frequency:
head(gaetzdf, 5)

#Now for the tax text:
taxes.corpus <- taxes.corpus %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)

## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops documents

## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents

taxes.corpus <- tm_map(taxes.corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(taxes.corpus, content_transformer(tolower)):
## transformation drops documents

taxes.corpus <- tm_map(taxes.corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(taxes.corpus, removeWords, stopwords("english")):
## transformation drops documents

taxes.corpus <- tm_map(taxes.corpus, content_transformer(removeURL))

## Warning in tm_map.SimpleCorpus(taxes.corpus, content_transformer(removeURL)):
## transformation drops documents

taxesdtm <- TermDocumentMatrix(taxes.corpus,
                               control = list(
                                      removePunctuation = T, 
                                      stopwords = c('tax','taxes','pdxeleven','joebiden',
                                                    'gregkellyusa','httpstcofhlmxwzjvi',
                                                    stopwords('english')),
                                                    removeNumbers=T, 
                                                    tolower=T))
taxesmatrix <- as.matrix(taxesdtm) 
taxwords <- sort(rowSums(taxesmatrix),decreasing=TRUE) 
taxesdf <- data.frame(word = names(taxwords),freq=taxwords)
#Checking this top 5 by frequency:
head(taxesdf, 5)

5 Word Cloud Visualization

One of the common trends when visualization text data is the word cloud. The word cloud is beneficial in that it can visualize what your audience or target demographic is thinking or feeling about a topic or event of choice (Matt Gaetz and Taxes, in our case). If used over a period of time, you can visualize a measured change, and by using word clouds, you can typically increase audience engagement with very little effort.

#Word cloud for Gaetz:
#(Note that minimum frequency should be adjusted relative to your df)
#Also setting seed for reproducability:
set.seed(1234)
wordcloud(words = gaetzdf$word, freq = gaetzdf$freq, min.freq = 4, scale=c(3.0,0.25),           
          max.words=200, random.order=FALSE, rot.per=0.35,            
          colors=brewer.pal(8, "Dark2"))

#Word cloud for taxes:
wordcloud(words = taxesdf$word, freq = taxesdf$freq, min.freq = 1, scale=c(3.0,0.25),           
          max.words=200, random.order=FALSE, rot.per=0.35,            
          colors=brewer.pal(8, "Dark2"))

6 Sentinment Analysis Prep

Sentiment analysis, at its core, is the process in which you determine whether writing or text is positive, negative, or neutral in nature. The systems that conduct the analysis combine machine learning and natural language processing to assign weighted scores to the various parts of text and can greatly assist with gaining and delivering useful and actionable insights. While it does have drawbacks, sentiment analysis plays an important role in today’s organizations.

While the word clouds above may offer a bit of insight into our two pulls with the frequency sizing, let’s compare initial thoughts to some deeper sentiment analysis:

#Since we have already converted our tweet list to a data frame prior to 
#our write.csv command, we will use that object again here to start:
gaetzun = gaetzDF %>% select(text)
taxesun = taxesDF %>% select(text)

#Unnesting tokens again (single row transformation):
gaetzun = gaetzun %>% unnest_tokens(word, text)
dim(gaetzun)

## [1] 62034     1

taxesun = taxesun %>% unnest_tokens(word, text)
dim(taxesun)

## [1] 62047     1

#we'll apply some stop words to those just like we did for the clouds:
custom_stop_words <- tibble(word = c("matt","gaetz","gossgoss","goss","httpstcorzorzrpupp","rt",
                                     "the","a","of","is","for","to","kylegriffin1","and","https","t.co","in"))
custom_stop_words2 <- tibble(word = c("tax","pdxeleven","rt", "i", "on", "be",
                                     "the","a","of","is","for","to", "1","kylegriffin1","and","https","t.co","in"))

gaetzdf2 = gaetzun %>% anti_join(get_stopwords())

## Joining, by = "word"

dim(gaetzdf2)

## [1] 40701     1

gaetzdf2 = gaetzun %>% anti_join(custom_stop_words)

## Joining, by = "word"

dim(gaetzdf2)

## [1] 44146     1

taxesdf2 = taxesun %>% anti_join(get_stopwords())

## Joining, by = "word"

taxesdf2 = taxesun %>% anti_join(custom_stop_words2)

## Joining, by = "word"

dim(taxesdf2)

## [1] 46086     1

7 Visualizing Sentiment - Gaetz

In the following sentiment visual analysis, I will be using the NRC Word-Emotion Association Lexicon. This is a list of English words with their respective associations with the following eight emotions: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust, and two sentiment categories of negative and positive.

This lexicon has contributed to a massive amount of analysis by individuals and organization since its introduction in early 2010 and was developed by Peter Turney and Saif M. Mohammad.

#Visualizing once again our top words (15):
gaetzdf2 %>% 
        count(word, sort = TRUE) %>%
        top_n(15) %>%
          ggplot(aes(fct_reorder(word,n), n, fill = as.factor(n)))+ 
            geom_col() + 
            coord_flip() + ggtitle("Top 20 Words in Tweets about Matt Gaetz")+
            theme(legend.position = "none")

## Selecting by n

#We must add line numbers to aid in analysis segmentation:
linenumber1 = 1:nrow(gaetzdf2)
gaetzdf2$linenumber = linenumber1

##Splitting words into 11 sections
          section = rep(c(1,2,3,4,5,6,7,8,9,10,11),
                        each = 10, 
                        times = round(nrow(gaetzdf2)/100))
          section = as.data.frame(section)
          section2 = slice(section, 1:nrow(gaetzdf2))
          
          gaetzdf2 = gaetzdf2 %>% arrange(linenumber) %>%
            cbind.data.frame(section2)
          
#visualize sentiments by words:
gaetzdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(linenumber, sentiment) %>% 
  ggplot(aes(x = linenumber, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ggtitle("Sentiments in Tweets about Gaetz by Words")+
  theme(legend.position = "none")

#Now visualizing by sections:
gaetzdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(section, sentiment) %>% 
  ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ ggtitle("Sentiments in Tweets about Gaetz by Sections")+
  theme(legend.position = "none")

#Negative and positive sentiment frequency:
gaetzdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
            count(linenumber, sentiment) %>% 
            spread(sentiment, n, fill=0) %>% 
            mutate(sent=positive-negative)  %>%
            ggplot(aes(x = linenumber, y = sent, fill = as.factor(sent))) + 
            geom_col()+ggtitle("Negative and Positive Sentiments in Tweets about Gaetz")+
            theme(legend.position = "none")

#Emotion and sentiment changes over time about Gaetz:
gaetzdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
                  count(section, sentiment) %>% 
              ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
              geom_col(position = "stack")+ ggtitle("Emotion Change in Tweets about Gaetz")

#Top 10 words that contribut to sentiment about for Gaetz:
gaetzdf2 %>% 
        inner_join(get_sentiments("bing")) %>% 
          count(word, sentiment, sort=TRUE) %>% 
            group_by(sentiment) %>% 
              top_n(10) %>% 
                ungroup() %>%
                  mutate(word=reorder(word, n)) %>% 
                  ggplot(aes(word, n)) + 
                geom_col(aes(fill=sentiment)) +
              facet_wrap(~sentiment, scale="free_y") + 
            coord_flip()+ggtitle("Top 10 Words Contribute to Sentiment for Gaetz")+
          theme(legend.position = "none")

## Joining, by = "word"
## Selecting by n

8 Visualizing Sentiment - Taxes

#Visualizing top words about taxes:
taxesdf2 %>% 
        count(word, sort = TRUE) %>%
        top_n(15) %>%
          ggplot(aes(fct_reorder(word,n), n, fill = as.factor(n)))+ 
            geom_col() + 
            coord_flip() + ggtitle("Top 20 Words in Tweets about Taxes")+
            theme(legend.position = "none")

## Selecting by n

#The top 15 words are not extremely helpful here, but let's continue.

#Tax line numbers:
linenumber2 = 1:nrow(taxesdf2)
taxesdf2$linenumber = linenumber2

##Splitting words into 11 sections:
          section = rep(c(1,2,3,4,5,6,7,8,9,10,11),
                        each = 100, 
                        times = round(nrow(taxesdf2)/100))
          section = as.data.frame(section)
          section2 = slice(section, 1:nrow(taxesdf2))
          
          taxesdf2 = taxesdf2 %>% arrange(linenumber) %>%
            cbind.data.frame(section2)
          
#visualize sentiments by words:
taxesdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(linenumber, sentiment) %>% 
  ggplot(aes(x = linenumber, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ggtitle("Sentiments in Tweets about Taxes by Words")+
  theme(legend.position = "none")

#Now visualizing by sections:
taxesdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(section, sentiment) %>% 
  ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ ggtitle("Sentiments in Tweets about Taxes by Sections")+
  theme(legend.position = "none")

#Negative and positive sentiment frequency:
taxesdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
            count(linenumber, sentiment) %>% 
            spread(sentiment, n, fill=0) %>% 
            mutate(sent=positive-negative)  %>%
            ggplot(aes(x = linenumber, y = sent, fill = as.factor(sent))) + 
            geom_col()+ggtitle("Negative and Positive Sentiments in Tweets about Taxes")+
            theme(legend.position = "none")

#Emotion and sentiment changes over time about Gaetz:
taxesdf2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
                  count(section, sentiment) %>% 
              ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
              geom_col(position = "stack")+ ggtitle("Emotion Change in Tweets about Taxes")

#Top 10 words that contribut to sentiment about for Gaetz:
taxesdf2 %>% 
        inner_join(get_sentiments("bing")) %>% 
          count(word, sentiment, sort=TRUE) %>% 
            group_by(sentiment) %>% 
              top_n(10) %>% 
                ungroup() %>%
                  mutate(word=reorder(word, n)) %>% 
                  ggplot(aes(word, n)) + 
                geom_col(aes(fill=sentiment)) +
              facet_wrap(~sentiment, scale="free_y") + 
            coord_flip()+ggtitle("Top 10 Words Contribute to Sentiment about Taxes")+
          theme(legend.position = "none")

## Joining, by = "word"
## Selecting by n

9 Topic Analysis - Gaetz

“Topic analysis” is a method that allows us to sort large sets of data effortlessly by identifying the most common and important themes or topics. The visualizations we will produce are directly related to topic numbers and the terms within each topic. We will be using the quanteda package for this portion:

word_counts = gaetzdf2 %>% count(linenumber, word)
names(word_counts)[1] = "id"

#Casting a data frame to a term matrix:
dtm = word_counts %>% cast_dtm(id,word, n)
dfm = word_counts %>% cast_dfm(id,word, n)
    
#Finding topic numbers:
topics_found = ldatuning::FindTopicsNumber(
          dtm,
          topics = seq(from = 2, to = 7, by = 1),
          metrics = c("Griffiths2004", "CaoJuan2009",
                          "Arun201", "Deveaud2014"),
          )

##  unknown!

#Visualizing             
FindTopicsNumber_plot(topics_found)

#Looking at the metrics, our optimal number of topics
#is difficult to determine depending on timing. Most commonly,
#the metric outputs are crossing or bending somewhere between
#4 and 5 topics. We'll opt for 4 here:

gaetz_lda = topicmodels::LDA(dtm, k = 4, method = "Gibbs")

              TopicTerms <- topicmodels::terms(gaetz_lda, 5)
              TNames <- apply(TopicTerms, 2, paste, collapse=" ")
              ( topicNames = as.data.frame(TNames) )

#The topic names are mostly sensible and interesting. Moving on.

#Obtaining contribution levels of words within topics:
gaetz_topics_b = tidy(gaetz_lda, matrix = "beta")
              
#Displaying top 10 visually:
          gaetz_top_terms = gaetz_topics_b %>%
            group_by(topic) %>%
              top_n(10, beta) %>%
                ungroup() %>%
                  arrange(topic, -beta)
              
gaetz_top_terms

#Visualizing these topics:
gaetz_top_terms %>%
                mutate(term = reorder(term, beta)) %>%
                ggplot(aes(term, beta, fill = factor(topic))) +
                geom_col(show.legend = FALSE) +
                facet_wrap(~ topic, scales = "free") + ggtitle("Top Terms in Tweets about Gaetz") +
                coord_flip()

#As we can see, some of the topics seem more meaningful than others.

10 Topic Analysis - Taxes

word_counts = taxesdf2 %>% count(linenumber, word)
names(word_counts)[1] = "id"

#Casting a data frame to a term matrix:
dtm2 = word_counts %>% cast_dtm(id,word, n)
dfm2 = word_counts %>% cast_dfm(id,word, n)
    
#Finding topic numbers:
topics_found2 = ldatuning::FindTopicsNumber(
          dtm2,
          topics = seq(from = 2, to = 7, by = 1),
          metrics = c("Griffiths2004", "CaoJuan2009",
                          "Arun201", "Deveaud2014"),
          )

##  unknown!

#Visualizing             
FindTopicsNumber_plot(topics_found2)

#Looking at the metrics, our optimal number of topics
#for taxes looks like 3 or 4 depending on the time of the pull,
#so we'll go with 4 to give a side-by-side with the 'Gaetz' pull.
taxes_lda = topicmodels::LDA(dtm2, k = 4, method = "Gibbs")

              TopicTerms <- topicmodels::terms(taxes_lda, 5)
              TNames <- apply(TopicTerms, 2, paste, collapse=" ")
              ( topicNames = as.data.frame(TNames) )

#The topic names are a little less meaningful, but let's continue.

#Obtaining contribution levels of words within topics:
taxes_topics_b = tidy(taxes_lda, matrix = "beta")
              
#Displaying top 10 visually:
          taxes_top_terms = taxes_topics_b %>%
            group_by(topic) %>%
              top_n(10, beta) %>%
                ungroup() %>%
                  arrange(topic, -beta)
              
taxes_top_terms

#Visualizing these topics:
taxes_top_terms %>%
                mutate(term = reorder(term, beta)) %>%
                ggplot(aes(term, beta, fill = factor(topic))) +
                geom_col(show.legend = FALSE) +
                facet_wrap(~ topic, scales = "free") + ggtitle("Top Terms in Tweets about Taxes") +
                coord_flip()

#Similarly, some of these topics were more relevant or meaningful than others.

11 Final Thoughts & Citations

Sentiment analysis is a wonderful tool, but extracting broad data from social media platforms may not always be helpful, especially if your stop words and filters aren’t set up appropriately. While this was a fun undertaking, I think the next step toward producing value would be tying the sentiment trends to actions or reactions by corporations, entities, or the general public.

Many lessons were learned in this analysis, most notably the need to write these data frames to CSV files if you have any intention of capturing certain periods of time with the twitteR package (or want to maintain a more rigid stop word collection on said data). I cannot recommend this if you aren’t specifically looking to store data to analyze historical reactions, however, as social media pulls are intended to be dynamic in nature. Analyzing topics over time can produce trends in sentiment you won’t see by performing one snapshot, as well, which may lead to more valuable insight on public opinions or thoughts.

I hope some part of this analysis was helpful.

By: John Watters ( https://www.linkedin.com/in/john-watters-04a75754/ ) For questions, reach out at jwatt1@unh.newhaven.edu

Special Thanks to Dr. Armando E. Rodriguez ( https://www.newhaven.edu/faculty-staff-profiles/armando-rodriguez.php ) and the rest of the University of New Haven Department of Economics and Bussiness Analytics.

Additional Resources: https://rpubs.com/marschmi/RMarkdown https://cran.r-project.org/web/packages/twitteR/twitteR.pdf

Twitter Sentiment Analysis