Hello and welcome!
Text mining is another word for Text Analytics. It is an artificial intelligence (AI) technology that uses natural language processing (NLP) technique to transform the unstructured texts into normalized, structured data suitable for business analysts to extract Business Intelligence.
In this project, I will attempt to pull text data from Twitter using Application Programming Interface (API) then process the data and extract meaningful information. I will pull the most recent 2000 tweets referring to president"Biden" and president "Trump" (2000 each). Subsequently, word clouds will be drawn to get a sense of what are the most frequent terms in the tweets. In the end, I will conduct sentiment analysis to assess the emotions behind the tweets and run topic analysis to see how many and what are the topics discussed in these tweets.
If you have any questions about this project, you can reach at yduan@newhaven.edu.
Please utilize the navigator (outline) on the left to jump to any of the analyses.
If you don’t have the following libraries, please use install.packages() function to install these libraries.
library(dplyr) ## accessing to select () and pipes
library(twitteR) ## working with twitter
library(tm) ## text manipulation for text mining
library(wordcloud) ## creating word cloud
library(RColorBrewer) ## color palette for word cloud
library(ldatuning) ## finding number of topics
library(quanteda) ## topic analysis
library(tidytext) ## text manipulation for text mining
library(tidyverse) ## text manipulation for text mining
library(ggplot2) ## data visualizationIn order to pull data from Twitter, you will need to use API to connect your R with Twitter. There are many YouTube tutorials https://www.youtube.com/watch?v=vlvtqp44xoQ. The process is very simple.
Basically, you need to get 4 “passwords” that are unique to you so that R can use these passwords to log you in to Twitter. I hid my passwords in this demonstration, however, you should have 4 passwords after you successfully create/generate passwords following the YouTube tutorial.
Below are some dummy codes to demonstrate what your codes suppose to look like:
# consumer_key <- 'acbefg'
# consumer_secret <- 'adfasd'
# access_token <- 'adsfasdga'
# access_secret <- 'fgsdfgsdfg'Once you have the Twitter “tokens” (passwords), you can use twitteR package’s setup_twitter_oauth() function to establish a connection from R to Twitter.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)## [1] "Using direct authentication"
I will be pulling 2000 tweets each using #Trump and # Biden. Then, I will store them into variables so that I can use them later on for other data analysis. I only need to do this once.
## this is searching tweets -- for 2000 of them , in English language
trump.tw <- searchTwitter('trump', n= 2000, lang = 'en')
biden.tw <- searchTwitter('biden', n= 2000, lang = 'en')Now I am getting the text out of the tweets and grabbing text data from the tweets and save it to an object. Then I am using iconv function to make sure the texts are in UTF8 format so that we don’t see special characters and weird symbols or accent marks. ASCII is a variable to remove the emoticons that are in the texts.
For additional information/reading about UTF-8 , see here: https://en.wikipedia.org/wiki/UTF-8
trump.text <- sapply(trump.tw, function(x) x$getText())
trump.text <- iconv(trump.text, 'UTF-8', 'ASCII')
trump.corpus <- Corpus(VectorSource(trump.text))Once I have the corpus to work with, I need to clean the data so that all the words are in a workable format, meaning without special characters, punctuation, white spaces, or have tons of stop words like “a”, “the”, “an” etc.
removePunctuation , stripWhitespace, removeURL are some functions that I am using here.
## remove punctuation
trump.corpus <- tm_map(trump.corpus, removePunctuation)
# eliminate extra white spaces
trump.corpus <- tm_map(trump.corpus, stripWhitespace)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
trump.corpus <- tm_map(trump.corpus, content_transformer(removeURL))
## creating document term matrix, pass it through the corpus,
## control function take a list (actions that you want to perform
term.doc.matrix <- TermDocumentMatrix(trump.corpus,
control = list(
removePunctuation = T,
stopwords = c('trump','president','trumps','https','just',
'httpstcokgckgcfx','etc',
'donald','white','house','now',"biden",
stopwords('english')),
removeNumbers=T,
tolower=T))
## removed some customized stop words, and numbers and lowered all the words to lower case
## this object is not in a matrix so I need to turn it into matrix
term.doc.matrix <- as.matrix(term.doc.matrix)
## getting the word counts
word.freq <- sort(rowSums(term.doc.matrix), decreasing = T)
dm <- data.frame(word=names(word.freq), freq = word.freq)
head(dm, 5)## word freq
## gaetz gaetz 76
## america america 72
## matt matt 69
## wrong wrong 64
## coming coming 64
## creating a word cloud using library
## also used color palette here.
wordcloud(dm$word, dm$freq, random.order = F, min.freq = 1,
max.words=500, rot.per=0.35,color = brewer.pal(8, "Dark2"))Now I am doing the same process for tweets about Biden. Since I am mostly performing the same procedures, I won’t have too much explanations.
biden.text <- sapply(biden.tw, function(x) x$getText())
biden.text <- iconv(biden.text, 'UTF-8', 'ASCII')
biden.corpus <- Corpus(VectorSource(biden.text))removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
biden.corpus <- tm_map(biden.corpus, content_transformer(removeURL))
## customized stop words for biden are abit different from trumps.
term.doc.matrix2 <- TermDocumentMatrix(biden.corpus,
control = list(removePunctuation = T,
stopwords = c('biden','bidens','president','trump',
'https','white','house','httpstcozrqolm','joe',
stopwords('english')),
removeNumbers=T,
tolower=T))
term.doc.matrix2 <- as.matrix(term.doc.matrix2)
word.freq2 <- sort(rowSums(term.doc.matrix2), decreasing = T)
dm2 <- data.frame(word=names(word.freq2), freq = word.freq2)
head(dm2, 5)## word freq
## fallontonight fallontonight 194
## amp amp 102
## taylor taylor 99
## taylorsversion taylorsversion 97
## tcongpsyd tcongpsyd 97
## this is rotation of words
## different color
wordcloud(dm2$word, dm2$freq, random.order = F, min.freq = 1,max.words=500, rot.per=0.35,
color = brewer.pal(8, "Spectral"))Sentiment Analysis is also known as opinion mining. It is the process of determining the emotional tone behind a series of words. It is used to gain an understanding of the attitudes, opinions and emotions expressed in a set of words.
Below, I will save the tweets pulled out mentioning Biden and Trump then put them in a data frame. After that i will clean the data again. I will show you the number of words left after each step. Then visualize them by sections. I am performing the same steps for both presidents. So I will only add comments for one of them.
## twlisttoDF is under TwitterR
trump.df <- twListToDF(trump.tw)
biden.df <- twListToDF(biden.tw)dim() shows the number of rows and column in the data frame.
###
trump.df = trump.df %>% select(text)
biden.df = biden.df %>% select(text)
## breaking words down to single rows
## unnest_tokens function is under tidytext package
trump.df = trump.df %>% unnest_tokens(word, text)
dim(trump.df) ## [1] 40351 1
biden.df = biden.df %>% unnest_tokens(word, text)
dim(biden.df)## [1] 39231 1
# created a set of accustomed words to remove from the dataframes.
custom_stop_words <- tibble(word = c("trump","rt","https",
"t.co","predident","biden","one","donald"))
custom_stop_words2 <- tibble(word = c("biden","joe","white",
"house","administration",
"rt", "https", "t.co","biden's","trump"))
trump.df2 = trump.df %>% anti_join(get_stopwords())
dim(trump.df2)## [1] 26962 1
trump.df2 = trump.df2 %>% anti_join(custom_stop_words)
dim(trump.df2)## [1] 22223 1
### Biden
biden.df2 = biden.df %>% anti_join(get_stopwords())
dim(biden.df2)## [1] 27093 1
biden.df2 = biden.df2 %>% anti_join(custom_stop_words2)
dim(biden.df2)## [1] 21709 1
# fct_reorder is under tidyverse
# display top 10 words in tweets
trump.df2 %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
ggplot(aes(fct_reorder(word,n), n, fill = as.factor(n)))+
geom_col() +
coord_flip() + ggtitle("Top 20 Words in Tweets about Trump")+
theme(legend.position = "none") biden.df2 %>%
count(word, sort = TRUE) %>%
top_n(20) %>%
ggplot(aes(fct_reorder(word,n), n, fill = as.factor(n)))+
geom_col() +
coord_flip() + ggtitle("Top 20 Words in Tweets about Biden")+
theme(legend.position = "none")## adding line numbers to data frame
linenumber1 = 1:nrow(trump.df2)
trump.df2$linenumber = linenumber1
## breaking down all the words in 11 sections
section = rep(c(1,2,3,4,5,6,7,8,9,10,11), each = 10,
times = round(nrow(trump.df2)/100))
section = as.data.frame(section)
section2 = slice(section, 1:nrow(trump.df2))
trump.df2 = trump.df2 %>% arrange(linenumber) %>%
cbind.data.frame(section2)
## visualize sentiments by words
trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(linenumber, sentiment) %>%
ggplot(aes(x = linenumber, y = n, fill = as.factor(sentiment))) +
geom_col()+ggtitle("Sentiments in Tweets about Trump by Words")+
theme(legend.position = "none")## visualize sentiments by sections
trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(section, sentiment) %>%
ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) +
geom_col()+ ggtitle("Sentiments in Tweets about Trump by Sections")+
theme(legend.position = "none")## visualize sentiment by negative and positive
trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(linenumber, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sent=positive-negative) %>%
ggplot(aes(x = linenumber, y = sent, fill = as.factor(sent))) +
geom_col()+ggtitle("Negative and Positive Sentiments in Tweets about Trump")+
theme(legend.position = "none")# how does emotion % sentiment change in tweets
trump.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(section, sentiment) %>%
ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) +
geom_col(position = "stack")+ ggtitle("Emotion % Change in Tweets about Trump")## top 10 words that contribute to sentiment
trump.df2 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word=reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(aes(fill=sentiment)) +
facet_wrap(~sentiment, scale="free_y") +
coord_flip()+ggtitle("Top 10 Words Contribute to Sentiment -Trump")+
theme(legend.position = "none")## adding line numbers to data frame
linenumber2 = 1:nrow(biden.df2)
biden.df2$linenumber = linenumber2
## breaking down all the words in 11 sections
section = rep(c(1,2,3,4,5,6,7,8,9,10,11), each = 10,
times = round(nrow(biden.df2)/100))
section = as.data.frame(section)
section2 = slice(section, 1:nrow(biden.df2))
biden.df2 = biden.df2 %>% arrange(linenumber) %>%
cbind.data.frame(section2)
## visualize sentiments by words
biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(linenumber, sentiment) %>%
ggplot(aes(x = linenumber, y = n, fill = as.factor(sentiment))) +
geom_col()+ggtitle("Sentiments in Tweets about Biden by Words")+
theme(legend.position = "none")## visualize sentiments by sections
biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(section, sentiment) %>%
ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) +
geom_col()+ggtitle("Sentiments in Tweets about Biden by Sections")+
theme(legend.position = "none")## visualize sentiment by negative and positive
biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(linenumber, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sent=positive-negative) %>%
ggplot(aes(x = linenumber, y = sent, fill = as.factor(sent))) +
geom_col()+ggtitle("Negative and Positive Sentiments in Tweets about Biden")+
theme(legend.position = "none")# how does emotion % sentiment change in tweets
biden.df2 %>% inner_join(get_sentiments("nrc"), by = "word") %>%
count(section, sentiment) %>%
ggplot(aes(x = section, y = n, fill = as.factor(sentiment))) +
geom_col(position = "stack")+ ggtitle("Emotion % Change in Tweets about Biden")biden.df2 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word=reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col(aes(fill=sentiment)) +
facet_wrap(~sentiment, scale="free_y") +
coord_flip()+ggtitle("Top 10 Words Contribute to Sentiment - Biden")+
theme(legend.position = "none")## Joining, by = "word"
## Selecting by n
Topic analysis is a Natural Language Processing (NLP) technique that allows us to automatically extract meaning from texts by identifying recurrent themes or topics.
Below I will first count the words in each of the Trump and Biden data frames that were created, then I will use ldatuning package’s FindTopicsNumber() to find how many topic were there. At last, I will visualize the words that goes in each of the topics.
library(quanteda)
library(tidytext)
quanteda is a package used specifically for quantitative text analysis
word_counts = trump.df2 %>% count(linenumber, word)
names(word_counts)[1] = "id"
## cast_dtm is a function in tidytext
dtm = word_counts %>% cast_dtm(id,word, n)
dfm = word_counts %>% cast_dfm(id,word, n)
# use FindTopicsNumber from the package ldatuning
topics_found = ldatuning::FindTopicsNumber(
dtm,
topics = seq(from = 2, to = 7, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009",
"Arun201", "Deveaud2014"),
)## unknown!
FindTopicsNumber_plot(topics_found) # look like the optimal topics will be 4. So we will use k = 4
#here is use the LDA engine from the package topic models
trump_lda = topicmodels::LDA(dtm, k = 4, method = "Gibbs")
#quick peek ######################
TopicTerms <- topicmodels::terms(trump_lda, 5)
TNames <- apply(TopicTerms, 2, paste, collapse=" ")
( topicNames = as.data.frame(TNames) )## TNames
## Topic 1 things now trump’s hate right
## Topic 2 president gaetz america former greene
## Topic 3 gop just matt biggest white
## Topic 4 amp trump's people taylor three
# the betas are the probabilities of appearance of each word - for each topic.
#Next step: obtain the "beta value" which is the "contribution level"
#of each word to each topic.
trump_topics_beta = tidy(trump_lda, matrix = "beta")
#and display the top 10 visually
#first organize the dataframe
trump_top_terms = trump_topics_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
trump_top_terms## # A tibble: 41 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 things 0.0177
## 2 1 now 0.0154
## 3 1 trump’s 0.0154
## 4 1 hate 0.0120
## 5 1 right 0.0117
## 6 1 remember 0.0117
## 7 1 coming 0.0102
## 8 1 knows 0.0101
## 9 1 majorie 0.00958
## 10 1 great 0.00941
## # ... with 31 more rows
trump_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") + ggtitle("Top Terms in Tweets about Trump") +
coord_flip() word_counts = biden.df2 %>% count(linenumber, word)
names(word_counts)[1] = "id"
dtm2 = word_counts %>% cast_dtm(id,word, n)
dfm2 = word_counts %>% cast_dfm(id,word, n)
# Use FindTopicsNumber from the package ldatuning
topics_found2 = ldatuning::FindTopicsNumber(
dtm2,
topics = seq(from = 2, to = 7, by = 1),
metrics = c("Griffiths2004", "CaoJuan2009",
"Arun201", "Deveaud2014"),
)## unknown!
FindTopicsNumber_plot(topics_found2) # look like the optimal topics will be 4. So we will use k = 4
#here is use the LDA engine from the package topic models
biden_lda = topicmodels::LDA(dtm2, k = 4, method = "Gibbs")
#quick peek ######################
TopicTerms <- topicmodels::terms(biden_lda, 5)
TNames <- apply(TopicTerms, 2, paste, collapse=" ")
( topicNames = as.data.frame(TNames) )## TNames
## Topic 1 hunter budget border just taylorsversion
## Topic 2 taylor fallontonight guns harris like
## Topic 3 amp president supreme take fallontonight
## Topic 4 court now biden’s hunter news
# the betas are the probabilities of appearance of each word - for each topic.
#Next step: obtain the "beta value" which is the "contribution level"
#of each word to each topic.
biden_topics_beta = tidy(biden_lda, matrix = "beta")
#and display the top 10 visually
#first organize the dataframe
biden_top_terms = biden_topics_beta %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
biden_top_terms## # A tibble: 40 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 hunter 0.0348
## 2 1 budget 0.0151
## 3 1 border 0.0144
## 4 1 just 0.0131
## 5 1 taylorsversion 0.0127
## 6 1 one 0.0102
## 7 1 ngp808syd0 0.00888
## 8 1 can 0.00804
## 9 1 crisis 0.00771
## 10 1 get 0.00654
## # ... with 30 more rows
biden_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +ggtitle("Top Terms in Tweets about Biden") +
coord_flip() A deep thank you for all my professors who taught me how to use R and R markdown!