Text mining is a qualitative analysis method that allows us to extract keywords or insights from text data. Text data is unstructured and must be cleaned and manipulated before any analysis can be done. Once the text is cleaned, that is, void of uninformative text such as, punctuation and certain terms like “of”, “the”, “is”, known as stop words, we can then summarize and visualize the characteristics of the remaining text. These characteristic words of the text data can then be communicated using frequency tables, plots or word clouds. We might also be interested in which words appear together or are correlated. In addition, we can analyze the text to determine if the subject matter is positive, negative, neutral, or some other emotion.
# Load libraries
library(tm) # text mining package
library(wordcloud) # word cloud generator
library(SnowballC) # text stemming
library(ggplot2) # graphs
library(tidyverse) # data manipulation
library(tidytext) # word lexicon dictionaries for sentiments
library(reshape2) # data transformation
library(textdata) # provides access to lexicon dictionaries
library(knitr) # used to make kable tablesThe text data used for this analysis is the “I Have a Dream” speech by Martin Luther King Jr. - delivered on August 28, 1963. The text from the speech was copied and pasted into a text editor and converted to a plain text format before importing into R. The data source: https://www.americanrhetoric.com/speeches/mlkihaveadream.htm. Note: results from any data analysis will vary depending on the source of the data and the methods used to analyze it.
# read data file
text_df <- readLines("~/Rpubs/Speech/MLKdream.txt")A corpus is simply a collection of documents.The documents can be text from speeches, books, news articles, product reviews, etc. A corpus is created from the speech text below.
# Create a word corpus
uncleaned_corpus<- Corpus(VectorSource(text_df))
uncleaned_corpus## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 42
# Get all the documents in the corpus
# inspect(uncleaned_corpus)As this output shows, a corpus was created with 42 documents, where each document is a paragraph from the speech. If you want to see all the documents contained in a corpus, then run the inspect() function which takes as an argument the name of the corpus.
Let’s take a look at the fourth document of the uncleaned corpus. This document contains 7 lines. As expected, there are commas, periods, quotation marks, and capital letters contained in the text. These are some of the uninformative text that will need to be removed before any analysis can be performed.
writeLines(head(strwrap(uncleaned_corpus[[4]]), 7))## In a sense we've come to our nation's capital to cash a check. When the
## architects of our republic wrote the magnificent words of the
## Constitution and the Declaration of Independence, they were signing a
## promissory note to which every American was to fall heir. This note was
## a promise that all men, yes, black men as well as white men, would be
## guaranteed the "unalienable Rights" of "Life, Liberty and the pursuit
## of Happiness." It is obvious today that America has defaulted on this
In this step, the text is converted to lowercase; numbers, stop words, and punctuation, are removed, and unnecessary white spaces are striped. The order of data cleaning is important. For example, if punctuation is removed before the SMART stop words, then words like “you’ll” or “we’ve” would not be removed.
# Clean text file and pre-process for word cloud
# Convert to lowercase
clean_corpus <- tm_map(uncleaned_corpus, content_transformer(tolower))
# Remove numbers
clean_corpus <- tm_map(clean_corpus, removeNumbers)
# Remove conjunctions etc.: "and",the", "of"
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords("english"))
# Remove words like "you'll", "will", "anyways", etc.
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords("SMART"))
# Remove commas, periods, etc.
clean_corpus <- tm_map(clean_corpus, removePunctuation)
# Strip unnecessary whitespace
clean_corpus <- tm_map(clean_corpus, stripWhitespace)
# Customize your own list of words for removal
clean_corpus <- tm_map(clean_corpus, removeWords, c("tis"))
#inspect(clean_corpus)Now that the data has been cleaned, let’s take a look at the same document. We see that the document no longer contains the unnecessary text.
writeLines(head(strwrap(clean_corpus[[4]])))## sense nation capital cash check architects republic wrote magnificent
## words constitution declaration independence signing promissory note
## american fall heir note promise men black men white men guaranteed
## unalienable rights life liberty pursuit happiness obvious today america
## defaulted promissory note citizens color concerned honoring sacred
## obligation america negro people bad check check back marked
The word corpus is now converted to a term document matrix in which the rows correspond to the terms and column names are the documents. The frequency table quantifies the terms. It shows each word and the number of times it occurs in the data.
# Create data frame with words and frequency of occurrence
tdm = TermDocumentMatrix(clean_corpus)
tdm2 = as.matrix(tdm)
words = sort(rowSums(tdm2), decreasing = TRUE)
df = data.frame(word = names(words), freq = words)
dim(df)[1] 412 2
# Word frequency table
head(df, 10) word freq
freedom freedom 20
negro negro 15
ring ring 12
nation nation 11
day day 11
back back 9
dream dream 9
justice justice 8
satisfied satisfied 8
today today 7
We can visualize the frequency of words using different methods.
A word cloud is a visual representation of word frequency. It is a useful tool to identify the focus of written material. The word cloud for the “I Have a Dream” speech is shown below. The more commonly the term appears within the text, the larger the word appears in the image. The cloud shows that “freedom” and “negro”, are the two most important words.
The word frequency plot is simply a visual representation of the frequency table. In a bar plot the length of the bars represent the frequencies of the words.
When MLK spoke about “freedom”, “dream”, and the “negro”, what other terms did he use? A word correlation plot will show which terms are correlated.
Sentiment analysis allows us to evaluate the opinion or emotion in text. The tidytext package contains three sentiment lexicons in the sentiments dataset:
All three lexicons are based on unigrams (or single words). The NRC lexicon categories words into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The Bing lexicon categorizes words into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
The distribution of negative and positive words contained in all three lexicons are shown below. All three lexicons have more negative than positive words, but the ratio of negative to positive words is higher in both the AFINN and Bing lexicons. This will contribute to different results depending on which lexicon is used, as well as any systematic difference in word matches.
NRC lexicon:
# A tibble: 2 × 2
sentiment n
<chr> <int>
1 negative 3318
2 positive 2308
Bing lexicon:
# A tibble: 2 × 2
sentiment n
<chr> <int>
1 negative 4781
2 positive 2005
AFINN lexicon:
word value
Length:2477 Min. :-5.0000
Class :character 1st Qu.:-2.0000
Mode :character Median :-2.0000
Mean :-0.5894
3rd Qu.: 2.0000
Max. : 5.0000
# A tibble: 2 × 2
sentiment total
<chr> <int>
1 negative 1598
2 positive 878
Cross-matching the words from the speech with the NRC lexicon returned a total of 55 negative words and 71 positive words. The top 20 words that contribute to these sentiments are shown in the plot below.
The calculated result indicates that the net sentiment of the speech is positive.
| negative | positive | (net.sentiment = positive - negative) |
|---|---|---|
| 55 | 69 | 14 |
The comparison word cloud depicts all the words that contribute to positive and negative sentiments according to the NRC lexicon.
Cross-matching the words from the speech with the Bing lexicon returned a total of 54 negative words and 46 positive words. The top 20 words that contribute to these sentiments are shown in the plot below.
The calculated result indicates that the net sentiment of the speech is negative.
| negative | positive | (net.sentiment = positive - negative) |
|---|---|---|
| 54 | 45 | -9 |
The comparison word cloud depicts all the words that contribute to positive and negative sentiments according to the Bing lexicon.
Cross-matching the words from the speech with the AFINN lexicon returned a total of 22 negative words and 34 positive words. The top 20 words that contribute to these sentiments are shown in the plot below.
The calculated result indicates that the net sentiment of the speech is positive. Note: remember that the AFINN lexion scores each word, therefore the values in the table below are sums of those scores rather than the number of words.
| negative | positive | (net.sentiment = positive - abs(negative)) |
|---|---|---|
| -50 | 61 | 11 |
The comparison word cloud depicts all the words that contribute to positive and negative sentiments according to the AFINN lexicon.
# Create word cloud
set.seed(1000)
wordcloud(clean_corpus
, scale=c(5,0.5) # Set min and max scale
, max.words=200 # Set top n words
, random.order=FALSE # Words in decreasing freq
, rot.per=0.20 # % of vertical words
, use.r.layout=FALSE # Use C++ collision detection
, colors=brewer.pal(8, "Set2"))# other palette options: Accent, Dark2, Set1# Plot of most frequently used words
barplot(df[1:20,]$freq, las=2, names.arg = df[1:20,]$word,
col="lightblue", main="Top 20 Most Frequent Words",
ylab="Word frequencies")# Plot of terms correlated with the word Freedom
freedom <-data.frame(findAssocs(tdm, "freedom", 0.35))
my_title <-expression(paste("Words Correlated with ", bold("Freedom")))
freedom %>% rownames_to_column() %>%
ggplot(aes(x=reorder(rowname, freedom), y=freedom)) +
geom_point(shape=20, size=3) +
coord_flip() + ylab("Correlation") + xlab("Word") +
ggtitle(my_title) + theme(plot.title = element_text(hjust = 0.5))# Plot of terms correlated with the word Dream
dream <-data.frame(findAssocs(tdm, "dream", 0.35))
my_title <-expression(paste("Words Correlated with ", bold("Dream")))
dream %>% rownames_to_column() %>%
ggplot(aes(x=reorder(rowname, dream), y=dream)) +
geom_point(shape=20, size=3) +
coord_flip() + ylab("Correlation") + xlab("Word") +
ggtitle(my_title) + theme(plot.title = element_text(hjust = 0.5))# Plot of terms correlated with the word Negro
negro <-data.frame(findAssocs(tdm, "negro", 0.30))
my_title <-expression(paste("Words Correlated with ", bold("Negro")))
negro %>% rownames_to_column() %>%
slice(1:40) %>% # only show 40 correlations
ggplot(aes(x=reorder(rowname, negro), y=negro)) + geom_point(shape=20,size=3) +
coord_flip() + ylab("Correlation") + xlab("Word") +
ggtitle(my_title) + theme(plot.title = element_text(hjust = 0.5))# NRC Lexicon terms
# Get the negative and positive sentiments word list
nrc_sent <-get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(word, sentiment, sort=T) %>%
ungroup()
# Inner join words with NRC lexicon
# There are 55 negative terms and 71 positive terms
nrc_df <- df %>% inner_join(nrc_sent)
# Plot of negative and positive sentiments
nrc_df %>%
group_by(sentiment) %>%
#slice_max(order_by = freq, n=10) %>%
do(head(., n=10)) %>% # top 20 words
ungroup() %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq, fill=sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment (NRC lexicon)", x=NULL) +
coord_flip()
# NRC:
nrc_df %>% group_by(sentiment) %>%
summarize(total=sum(n)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive-negative)) %>%
kable(align = 'l')# Generate a comparison word cloud
set.seed(123)
nrc_df %>%
acast(word ~ sentiment, value.var = "freq", fill=0) %>%
comparison.cloud(colors = brewer.pal(8,"Set1")
,scale =c(5,.5), rot.per=0.1, title.size=2, max.words=100)# Bing Lexicon terms
bing_sent <- df %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort=T) %>%
ungroup()
# Inner join words with Bing lexicon
bing_df <- df %>% inner_join(bing_sent)
# Plot positive and negative sentiments
bing_df %>%
group_by(sentiment) %>%
do(head(., n=10)) %>% # top 20 words
ungroup() %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq, fill=sentiment)) +
geom_col(show.legend = F) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment (Bing lexicon)", x=NULL) +
coord_flip()
# Bing: There are 54 negative terms and 46 positive terms
bing_df %>% group_by(sentiment) %>%
summarize(total=sum(n)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive-negative)) %>%
kable(align = 'l')# Generate a comparison word cloud
set.seed(123)
bing_df %>%
acast(word ~ sentiment, value.var = "freq", fill=0) %>%
comparison.cloud(colors = brewer.pal(8,"Set1")
,scale =c(5,.5), rot.per=0.1, title.size=2, max.words=100)# AFINN lexicon terms
afinn_df <- df %>% inner_join(get_sentiments("afinn")) %>%
mutate(sentiment = case_when(value < 0 ~ 'negative',
value > 0 ~ 'positive'))
# Plot positive and negative sentiments
afinn_df %>%
group_by(sentiment) %>%
do(head(., n=20)) %>% # top 20 words
ungroup() %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq, fill=sentiment)) +
geom_col(show.legend = F) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment (AFINN lexicon)", x=NULL) +
coord_flip()
# Generate a comparison word cloud
set.seed(123)
afinn_df %>%
acast(word ~ sentiment, value.var = "freq", fill=0) %>%
comparison.cloud(colors = brewer.pal(7,"Set1")
,scale =c(5,.5), rot.per=0.10, title.size=2, max.words=100)afinn_net <- afinn_df %>%
group_by(sentiment) %>%
summarize(total=sum(value)) %>%
spread(sentiment, total) %>%
mutate((net.sentiment=positive - abs(negative))) %>%
kable(align = 'l')
afinn_netText Mining with R by Julia Silge and David Robinson (O’Reilly). Copyright 2017.