1. Introduction

Hello and welcome!

Text mining is another word for Text Analytics. It is an artificial intelligence (AI) technology that uses natural language processing (NLP) technique to transform the unstructured texts into normalized, structured data suitable for business analysts to extract Business Intelligence.

In this project, I will attempt to pull text data from Twitter using Application Programming Interface (API) then process the data and extract meaningful information. I will pull the most recent 2000 tweets referring to president"Biden" and president "Trump" (2000 each). Subsequently, word clouds will be drawn to get a sense of what are the most frequent terms in the tweets. In the end, I will conduct sentiment analysis to assess the emotions behind the tweets and run topic analysis to see how many and what are the topics discussed in these tweets.

If you have any questions about this project, you can reach at yduan@newhaven.edu.

Please utilize the navigator (outline) on the left to jump to any of the analyses.

1.1 Installing and Loading Libraries

If you don’t have the following libraries, please use install.packages() function to install these libraries.

install.packages(‘tm’)
install.packages(‘twitteR’)
install.packages(‘wordcloud’)
install.packages(‘RColorBrewer’)
install.packages(“ldatuning”)
install.packages(“quanteda”)
install.packages(“dplyr”)
install.packages(“tidytext”)
install.packages(“tidyverse”)
install.packages(“ggplot2”)

library(dplyr) ## accessing to select () and pipes 
library(twitteR) ## working with twitter 
library(tm) ## text manipulation for text mining
library(wordcloud) ## creating word cloud
library(RColorBrewer) ## color palette for word cloud 
library(ldatuning) ## finding number of topics 
library(quanteda) ## topic analysis 
library(tidytext) ## text manipulation for text mining 
library(tidyverse) ## text manipulation for text mining 
library(ggplot2) ## data visualization

2. Connecting to Twitter

In order to pull data from Twitter, you will need to use API to connect your R with Twitter. There are many YouTube tutorials https://www.youtube.com/watch?v=vlvtqp44xoQ. The process is very simple.

Basically, you need to get 4 “passwords” that are unique to you so that R can use these passwords to log you in to Twitter. I hid my passwords in this demonstration, however, you should have 4 passwords after you successfully create/generate passwords following the YouTube tutorial.

consumer_key
consumer_secret
access_token
access_secret

Below are some dummy codes to demonstrate what your codes suppose to look like:

# consumer_key <- 'acbefg'
# consumer_secret <- 'adfasd'
# access_token <- 'adsfasdga'
# access_secret <- 'fgsdfgsdfg'

2.1 Connection and Data

Once you have the Twitter “tokens” (passwords), you can use twitteR package’s setup_twitter_oauth() function to establish a connection from R to Twitter.

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

## [1] "Using direct authentication"

3. Word Cloud

I will be pulling 2000 tweets each using #Trump and # Biden. Then, I will store them into variables so that I can use them later on for other data analysis. I only need to do this once.

## this is searching tweets -- for 2000 of them , in English language 
trump.tw <- searchTwitter('trump', n= 2000, lang = 'en')
biden.tw <- searchTwitter('biden', n= 2000, lang = 'en')

3.1 Creating Corpus - Tweets (Trump)

Now I am getting the text out of the tweets and grabbing text data from the tweets and save it to an object. Then I am using iconv function to make sure the texts are in UTF8 format so that we don’t see special characters and weird symbols or accent marks. ASCII is a variable to remove the emoticons that are in the texts.

For additional information/reading about UTF-8 , see here: https://en.wikipedia.org/wiki/UTF-8

trump.text <- sapply(trump.tw, function(x) x$getText())
trump.text <- iconv(trump.text, 'UTF-8', 'ASCII')
trump.corpus <- Corpus(VectorSource(trump.text))

3.2 Data Cleaning - Tweets (Trump)

Once I have the corpus to work with, I need to clean the data so that all the words are in a workable format, meaning without special characters, punctuation, white spaces, or have tons of stop words like “a”, “the”, “an” etc.

removePunctuation , stripWhitespace, removeURL are some functions that I am using here.

## remove punctuation 
trump.corpus <- tm_map(trump.corpus, removePunctuation)
# eliminate extra white spaces
trump.corpus <- tm_map(trump.corpus, stripWhitespace)

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
trump.corpus <- tm_map(trump.corpus, content_transformer(removeURL))

## creating document term matrix, pass it through the corpus, 
## control function take a list (actions that you want to perform

term.doc.matrix <- TermDocumentMatrix(trump.corpus, 
                                      control = list(
                                      removePunctuation = T, 
                                      stopwords = c('trump','president','trumps','https','just',
                                                    'httpstcokgckgcfx','etc',
                                                    'donald','white','house','now',"biden",
                                                    stopwords('english')),
                                                    removeNumbers=T, 
                                                    tolower=T))
## removed some customized stop words, and numbers and lowered all the words to lower case

## this object is not in a matrix so I need to turn it into matrix 
term.doc.matrix <- as.matrix(term.doc.matrix)

## getting the word counts 
word.freq <- sort(rowSums(term.doc.matrix), decreasing = T)
dm <- data.frame(word=names(word.freq), freq = word.freq)
head(dm, 5)

##            word freq
## gaetz     gaetz   76
## america america   72
## matt       matt   69
## wrong     wrong   64
## coming   coming   64

3.3 Plot Word Cloud - Trump

## creating a word cloud using library 
## also used color palette here. 
wordcloud(dm$word, dm$freq, random.order = F, min.freq = 1,
          max.words=500, rot.per=0.35,color = brewer.pal(8, "Dark2"))

3.4 Creating Corpus - Tweets (Biden)

Now I am doing the same process for tweets about Biden. Since I am mostly performing the same procedures, I won’t have too much explanations.

biden.text <- sapply(biden.tw, function(x) x$getText())
biden.text <- iconv(biden.text, 'UTF-8', 'ASCII')
biden.corpus <- Corpus(VectorSource(biden.text))

3.5 Data Cleaning - Tweets (Biden)

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
biden.corpus <- tm_map(biden.corpus, content_transformer(removeURL))

## customized stop words for biden are abit different from trumps. 
term.doc.matrix2 <- TermDocumentMatrix(biden.corpus, 
                                      control = list(removePunctuation = T, 
                                      stopwords = c('biden','bidens','president','trump',
                                                    'https','white','house','httpstcozrqolm','joe',
                                                    stopwords('english')),
                                      removeNumbers=T, 
                                      tolower=T))
term.doc.matrix2 <- as.matrix(term.doc.matrix2)
word.freq2 <- sort(rowSums(term.doc.matrix2), decreasing = T)
dm2 <- data.frame(word=names(word.freq2), freq = word.freq2)
head(dm2, 5)

##                          word freq
## fallontonight   fallontonight  194
## amp                       amp  102
## taylor                 taylor   99
## taylorsversion taylorsversion   97
## tcongpsyd           tcongpsyd   97

3.6 Plot Word Cloud - Tweets (Biden)

## this is rotation of words
## different color 
wordcloud(dm2$word, dm2$freq, random.order = F, min.freq = 1,max.words=500, rot.per=0.35, 
          color = brewer.pal(8, "Spectral"))

4. Sentiment Analysis

Sentiment Analysis is also known as opinion mining. It is the process of determining the emotional tone behind a series of words. It is used to gain an understanding of the attitudes, opinions and emotions expressed in a set of words.

Below, I will save the tweets pulled out mentioning Biden and Trump then put them in a data frame. After that i will clean the data again. I will show you the number of words left after each step. Then visualize them by sections. I am performing the same steps for both presidents. So I will only add comments for one of them.

## twlisttoDF is under TwitterR
trump.df <- twListToDF(trump.tw)
biden.df <- twListToDF(biden.tw)

4.1 Cleaning Data - Tweets (Trump & Biden)

dim() shows the number of rows and column in the data frame.

### 
trump.df = trump.df %>% select(text)
biden.df = biden.df %>% select(text)

## breaking words down to single rows 
## unnest_tokens function is under tidytext package
trump.df = trump.df %>% unnest_tokens(word, text)
dim(trump.df)

## [1] 40351     1

biden.df = biden.df %>% unnest_tokens(word, text)
dim(biden.df)

## [1] 39231     1

# created a set of accustomed words to remove from the dataframes. 
custom_stop_words <- tibble(word = c("trump","rt","https",
                                     "t.co","predident","biden","one","donald"))
custom_stop_words2 <- tibble(word = c("biden","joe","white",
                                     "house","administration",
                                     "rt", "https", "t.co","biden's","trump"))

trump.df2 = trump.df %>% anti_join(get_stopwords())
dim(trump.df2)

## [1] 26962     1

trump.df2 = trump.df2 %>% anti_join(custom_stop_words)
dim(trump.df2)

## [1] 22223     1

### Biden 
biden.df2 = biden.df %>% anti_join(get_stopwords())
dim(biden.df2)

## [1] 27093     1

biden.df2 = biden.df2 %>% anti_join(custom_stop_words2)
dim(biden.df2)

## [1] 21709     1

4.2 Top 10 Words in - Tweets ( Biden & Trump)

# fct_reorder is under tidyverse
# display top 10 words in tweets 
          trump.df2 %>% 
            count(word, sort = TRUE) %>%
            top_n(20) %>%
              ggplot(aes(fct_reorder(word,n), n, fill = as.factor(n)))+ 
                geom_col() + 
                  coord_flip() + ggtitle("Top 20 Words in Tweets about Trump")+
            theme(legend.position = "none")

          biden.df2 %>% 
            count(word, sort = TRUE) %>%
            top_n(20) %>%
              ggplot(aes(fct_reorder(word,n), n, fill = as.factor(n)))+ 
                geom_col() + 
                  coord_flip() + ggtitle("Top 20 Words in Tweets about Biden")+
            theme(legend.position = "none")

4.3 Visualize Sentiment - Tweets (Trump)

## adding line numbers to data frame 
linenumber1 = 1:nrow(trump.df2)
trump.df2$linenumber = linenumber1

## breaking down all the words in 11 sections 
          section = rep(c(1,2,3,4,5,6,7,8,9,10,11), each = 10, 
                        times = round(nrow(trump.df2)/100))
          section = as.data.frame(section)
          section2 = slice(section, 1:nrow(trump.df2))
          
          trump.df2 = trump.df2 %>% arrange(linenumber) %>%
            cbind.data.frame(section2)
## visualize sentiments by words 
trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(linenumber, sentiment) %>% 
  ggplot(aes(x = linenumber, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ggtitle("Sentiments in Tweets about Trump by Words")+
  theme(legend.position = "none")

## visualize sentiments by sections  
trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(section, sentiment) %>% 
  ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ ggtitle("Sentiments in Tweets about Trump by Sections")+
  theme(legend.position = "none")

## visualize sentiment by negative and positive 
            trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
            count(linenumber, sentiment) %>% 
            spread(sentiment, n, fill=0) %>% 
            mutate(sent=positive-negative)  %>%
            ggplot(aes(x = linenumber, y = sent, fill = as.factor(sent))) + 
            geom_col()+ggtitle("Negative and Positive Sentiments in Tweets about Trump")+
            theme(legend.position = "none")

4.4 Words Contibute to Sentiment - Tweets (Trump)

# how does emotion % sentiment change in tweets 
              trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
                  count(section, sentiment) %>% 
              ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
              geom_col(position = "stack")+ ggtitle("Emotion % Change in Tweets about Trump")

## top 10 words that contribute to sentiment 
              trump.df2 %>% 
            inner_join(get_sentiments("bing")) %>% 
              count(word, sentiment, sort=TRUE) %>% 
                group_by(sentiment) %>% 
                  top_n(10) %>% 
                    ungroup() %>%
                      mutate(word=reorder(word, n)) %>% 
                      ggplot(aes(word, n)) + 
                    geom_col(aes(fill=sentiment)) +
                  facet_wrap(~sentiment, scale="free_y") + 
                coord_flip()+ggtitle("Top 10 Words Contribute to Sentiment -Trump")+
              theme(legend.position = "none")

4.5 Visualize Sentiment - Tweets (Biden)

## adding line numbers to data frame 
linenumber2 = 1:nrow(biden.df2)
biden.df2$linenumber = linenumber2

## breaking down all the words in 11 sections 
          section = rep(c(1,2,3,4,5,6,7,8,9,10,11), each = 10, 
                        times = round(nrow(biden.df2)/100))
      
          section = as.data.frame(section)
          section2 = slice(section, 1:nrow(biden.df2))
        
          biden.df2 = biden.df2 %>% arrange(linenumber) %>%
            cbind.data.frame(section2)
          
## visualize sentiments by words 
biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(linenumber, sentiment) %>% 
  ggplot(aes(x = linenumber, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ggtitle("Sentiments in Tweets about Biden by Words")+
  theme(legend.position = "none")

## visualize sentiments by sections
biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
  count(section, sentiment) %>% 
  ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
  geom_col()+ggtitle("Sentiments in Tweets about Biden by Sections")+
  theme(legend.position = "none")

## visualize sentiment by negative and positive 
            biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
            count(linenumber, sentiment) %>% 
            spread(sentiment, n, fill=0) %>% 
            mutate(sent=positive-negative)  %>%
            ggplot(aes(x = linenumber, y = sent, fill = as.factor(sent))) + 
            geom_col()+ggtitle("Negative and Positive Sentiments in Tweets about Biden")+
            theme(legend.position = "none")

4.6 Words Contibute to Sentiment - Tweets (Biden)

# how does emotion % sentiment change in tweets 
              biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>% 
                  count(section, sentiment) %>% 
              ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) + 
              geom_col(position = "stack")+ ggtitle("Emotion % Change in Tweets about Biden")

biden.df2 %>% 
            inner_join(get_sentiments("bing")) %>% 
              count(word, sentiment, sort=TRUE) %>% 
                group_by(sentiment) %>% 
                  top_n(10) %>% 
                    ungroup() %>%
                      mutate(word=reorder(word, n)) %>% 
                      ggplot(aes(word, n)) + 
                    geom_col(aes(fill=sentiment)) +
                  facet_wrap(~sentiment, scale="free_y") + 
                coord_flip()+ggtitle("Top 10 Words Contribute to Sentiment - Biden")+
              theme(legend.position = "none")

## Joining, by = "word"

## Selecting by n

5. Topic Analysis

Topic analysis is a Natural Language Processing (NLP) technique that allows us to automatically extract meaning from texts by identifying recurrent themes or topics.

Below I will first count the words in each of the Trump and Biden data frames that were created, then I will use ldatuning package’s FindTopicsNumber() to find how many topic were there. At last, I will visualize the words that goes in each of the topics.

library(quanteda)
library(tidytext)
quanteda is a package used specifically for quantitative text analysis

5.1 Cleaning Data - Tweets (Trump)

word_counts = trump.df2 %>% count(linenumber, word)
    names(word_counts)[1] = "id"

    ## cast_dtm is a function in tidytext 
dtm = word_counts %>% cast_dtm(id,word, n)
    dfm = word_counts %>% cast_dfm(id,word, n)
    
# use FindTopicsNumber from the package ldatuning
            topics_found = ldatuning::FindTopicsNumber(
              dtm,
              topics = seq(from = 2, to = 7, by = 1),
              metrics = c("Griffiths2004", "CaoJuan2009",
                          "Arun201", "Deveaud2014"),
             )

##  unknown!

5.2 Number of Topics - Trump

FindTopicsNumber_plot(topics_found)

# look like the optimal topics will be 4. So we will use k = 4 
            
              #here is use the LDA engine from the package topic models
              trump_lda = topicmodels::LDA(dtm, k = 4, method = "Gibbs")

              #quick peek ######################
              TopicTerms <- topicmodels::terms(trump_lda, 5)
              TNames <- apply(TopicTerms, 2, paste, collapse=" ")
              ( topicNames = as.data.frame(TNames) )

##                                        TNames
## Topic 1         things now trump’s hate right
## Topic 2 president gaetz america former greene
## Topic 3           gop just matt biggest white
## Topic 4       amp trump's people taylor three

# the betas are the probabilities of appearance of each word - for each topic.
              
              #Next step: obtain the "beta value" which is the "contribution level"
              #of each word to each topic.
              trump_topics_beta = tidy(trump_lda, matrix = "beta")
              
              #and display the top 10 visually
              #first organize the dataframe
              trump_top_terms = trump_topics_beta %>%
                group_by(topic) %>%
                  top_n(10, beta) %>%
                    ungroup() %>%
                      arrange(topic, -beta)
              
              trump_top_terms

## # A tibble: 41 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 things   0.0177 
##  2     1 now      0.0154 
##  3     1 trump’s  0.0154 
##  4     1 hate     0.0120 
##  5     1 right    0.0117 
##  6     1 remember 0.0117 
##  7     1 coming   0.0102 
##  8     1 knows    0.0101 
##  9     1 majorie  0.00958
## 10     1 great    0.00941
## # ... with 31 more rows

5.3 Visualize Topics - Trump

trump_top_terms %>%
                mutate(term = reorder(term, beta)) %>%
                ggplot(aes(term, beta, fill = factor(topic))) +
                geom_col(show.legend = FALSE) +
                facet_wrap(~ topic, scales = "free") + ggtitle("Top Terms in Tweets about Trump") +
                coord_flip()

5.4 Cleaning Data - Tweets (Biden)

word_counts = biden.df2 %>% count(linenumber, word)
    names(word_counts)[1] = "id"
dtm2 = word_counts %>% cast_dtm(id,word, n)
    dfm2 = word_counts %>% cast_dfm(id,word, n)
  

# Use FindTopicsNumber from the package ldatuning
            topics_found2 = ldatuning::FindTopicsNumber(
              dtm2,
              topics = seq(from = 2, to = 7, by = 1),
              metrics = c("Griffiths2004", "CaoJuan2009",
                          "Arun201", "Deveaud2014"),
             )

##  unknown!

5.5 Number of Topics - Biden

FindTopicsNumber_plot(topics_found2)

            # look like the optimal topics will be 4. So we will use k = 4 
              #here is use the LDA engine from the package topic models
              biden_lda = topicmodels::LDA(dtm2, k = 4, method = "Gibbs")

              #quick peek ######################
              TopicTerms <- topicmodels::terms(biden_lda, 5)
              TNames <- apply(TopicTerms, 2, paste, collapse=" ")
              ( topicNames = as.data.frame(TNames) )

##                                           TNames
## Topic 1 hunter budget border just taylorsversion
## Topic 2    taylor fallontonight guns harris like
## Topic 3 amp president supreme take fallontonight
## Topic 4            court now biden’s hunter news

# the betas are the probabilities of appearance of each word - for each topic.
              
              
              #Next step: obtain the "beta value" which is the "contribution level"
              #of each word to each topic.
              biden_topics_beta = tidy(biden_lda, matrix = "beta")
              
              #and display the top 10 visually
              #first organize the dataframe
              biden_top_terms = biden_topics_beta %>%
                group_by(topic) %>%
                  top_n(10, beta) %>%
                    ungroup() %>%
                      arrange(topic, -beta)
              
              biden_top_terms

## # A tibble: 40 x 3
##    topic term              beta
##    <int> <chr>            <dbl>
##  1     1 hunter         0.0348 
##  2     1 budget         0.0151 
##  3     1 border         0.0144 
##  4     1 just           0.0131 
##  5     1 taylorsversion 0.0127 
##  6     1 one            0.0102 
##  7     1 ngp808syd0     0.00888
##  8     1 can            0.00804
##  9     1 crisis         0.00771
## 10     1 get            0.00654
## # ... with 30 more rows

5.6 Visualize Topics - Biden

biden_top_terms %>%
                mutate(term = reorder(term, beta)) %>%
                ggplot(aes(term, beta, fill = factor(topic))) +
                geom_col(show.legend = FALSE) +
                facet_wrap(~ topic, scales = "free") +ggtitle("Top Terms in Tweets about Biden") +
                coord_flip()

6. Thanks to…

A deep thank you for all my professors who taught me how to use R and R markdown!

Dr. Armando E. Rodriguez

Dr. Ahmet Ozkul

Department of Economics and Business Analytics

Text Mining

Working with Twitter API

Ethan Duan

4/6/2021