S3729C Data Analytics Seminar

05 Text Analytics

Class on 28 August 2021

Introduction to Text Mining and Sentiment Analysis in R

This last section in the S3729C Data Analytics Seminar will introduce you to some basics of text mining and sentiment analysis in R. For this, we will be working with an extract of qualitative feedback data from participants of various Community2Campus events in the past year.

We will focus on the following tasks in this section

Loading the Data and Packages

This is the code for installation of Pacman which is used to load all packages for this section. You have used it in all preceding sections (01 to 04).

install.packages("pacman",repos = "http://cran.us.r-project.org")
## package 'pacman' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\aaron_chen_angus\AppData\Local\Temp\RtmpsbuF9N\downloaded_packages

Aside from the previous packages loaded previously (pacman, psych, rio, magrittr, ggplot2, tidyverse) you will also be using the following packages in this section.

pacman::p_load(pacman, psych, rio, magrittr, ggplot2, tidyverse, tm, SnowballC, wordcloud, RColorBrewer, syuzhet)

I have placed the source file for at Github for the purpose of conducting the text mining and sentiment analysis for this section at the link below. This time, the file is a .txt file instead of a .csv file which was used in the preceding sections.

https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/RawDataForTextAnalytics.txt

To load the file for analysis, we will use the readLines() function in base R.

# Read the text file from the file at GitHub
text <- readLines("https://raw.githubusercontent.com/aaron-chen-angus/community2campus/main/RawDataForTextAnalytics.txt")
# Load the data as a corpus
TextDoc <- Corpus(VectorSource(text))

Cleaning up Text Data

Cleaning the text data starts with making transformations like removing special characters from the text. This is done using the tm_map() function to replace special characters like /, @ and | with a space. The next step is to remove the unnecessary whitespace and convert the text to lower case.

Then remove the stopwords. They are the most commonly occurring words in a language and have very little value in terms of gaining useful information. They should be removed before performing further analysis. Examples of stopwords in English are “the, is, at, on”. There is no single universal list of stop words used by all NLP tools. stopwords in the tm_map() function supports several languages like English, French, German, Italian, and Spanish. Please note the language names are case sensitive. I will also demonstrate how to add your own list of stopwords, which is useful in this Community2Campus example for removing non-default stop words like “team”, “company”, “health”. Next, remove numbers and punctuation.

The last step is text stemming. It is the process of reducing the word to its root form. The stemming process simplifies the word to its common origin. For example, the stemming process reduces the words “fishing”, “fished” and “fisher” to its stem “fish”. Please note stemming uses the SnowballC package.

#Replacing "/", "@" and "|" with space
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
TextDoc <- tm_map(TextDoc, toSpace, "/")
## Warning in tm_map.SimpleCorpus(TextDoc, toSpace, "/"): transformation drops
## documents
TextDoc <- tm_map(TextDoc, toSpace, "@")
## Warning in tm_map.SimpleCorpus(TextDoc, toSpace, "@"): transformation drops
## documents
TextDoc <- tm_map(TextDoc, toSpace, "\\|")
## Warning in tm_map.SimpleCorpus(TextDoc, toSpace, "\\|"): transformation drops
## documents
# Convert the text to lower case
TextDoc <- tm_map(TextDoc, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(TextDoc, content_transformer(tolower)):
## transformation drops documents
# Remove numbers
TextDoc <- tm_map(TextDoc, removeNumbers)
## Warning in tm_map.SimpleCorpus(TextDoc, removeNumbers): transformation drops
## documents
# Remove english common stopwords
TextDoc <- tm_map(TextDoc, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(TextDoc, removeWords, stopwords("english")):
## transformation drops documents
# Remove your own stop word
# specify your custom stopwords as a character vector
TextDoc <- tm_map(TextDoc, removeWords, c("s", "company", "team")) 
## Warning in tm_map.SimpleCorpus(TextDoc, removeWords, c("s", "company", "team")):
## transformation drops documents
# Remove punctuations
TextDoc <- tm_map(TextDoc, removePunctuation)
## Warning in tm_map.SimpleCorpus(TextDoc, removePunctuation): transformation drops
## documents
# Eliminate extra white spaces
TextDoc <- tm_map(TextDoc, stripWhitespace)
## Warning in tm_map.SimpleCorpus(TextDoc, stripWhitespace): transformation drops
## documents
# Text stemming - which reduces words to their root form
TextDoc <- tm_map(TextDoc, stemDocument)
## Warning in tm_map.SimpleCorpus(TextDoc, stemDocument): transformation drops
## documents

Building the term document matrix

After cleaning the text data, the next step is to count the occurrence of each word, to identify popular or trending topics. Using the function TermDocumentMatrix() from the text mining package, you can build a Document Matrix – a table containing the frequency of words.

The following codes will help provide at a glance the top 5 most frequently found words in your text.

# Build a term-document matrix
TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
# Sort by descearing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)
# Display the top 5 most frequent words
head(dtm_d, 5)
##          word freq
## good     good  125
## work     work  119
## health health   92
## feel     feel   89
## improv improv   69

We can also visualise this by plotting the top 5 most frequent words using a bar chart.

# Plot the most frequent words
barplot(dtm_d[1:5,]$freq, las = 2, names.arg = dtm_d[1:5,]$word,
        col ="lightgreen", main ="Top 5 most frequent words",
        ylab = "Word frequencies")

We could interpret the following from the bar chart above :

Generate the Word Cloud

A word cloud is one of the most popular ways to visualize and analyze qualitative data. It is an image composed of keywords found within a body of text, where the size of each word indicates its frequency in that body of text.

Use the word frequency data frame (table) created previously to generate the word cloud.

Below is a brief description of the arguments used in the word cloud function

You can see the resulting word cloud below.

#generate word cloud
set.seed(1234)
wordcloud(words = dtm_d$word, freq = dtm_d$freq, min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0.40, 
          colors=brewer.pal(8, "Dark2"))

The word cloud shows additional words that occur frequently and could be of interest for further analysis. Words like “need”, “support”, “issu” (root for “issue(s)”, etc. could provide more context around the most frequently occurring words and help to gain a better understanding of the main themes.

Word Association

Correlation is a statistical technique that can demonstrate whether, and how strongly, pairs of variables are related. This technique can be used effectively to analyze which words occur most often in association with the most frequently occurring words in the survey responses, which helps to see the context around these words

# Find associations 
findAssocs(TextDoc_dtm, terms = c("good","work","health"), corlimit = 0.25) 
## $good
##  integr synergi 
##    0.28    0.28 
## 
## $work
## togeth 
##    0.4 
## 
## $health
##    declin    happen      noth      real sentiment    suppli      wors 
##      0.29      0.29      0.29      0.29      0.29      0.29      0.29

This script shows which words are most frequently associated with the top three terms (corlimit = 0.25 is the lower limit/threshold I have set. You can set it lower to see more words, or higher to see less). The output indicates that “integr” (which is the root for word “integrity”) and “synergi” (which is the root for words “synergy”, “synergies”, etc.) and occur 28% of the time with the word “good”. You can interpret this as the context around the most frequently occurring word (“good”) is positive. Similarly, the root of the word “together” is highly correlated with the word “work”. This indicates that most responses are saying that teams “work together” and can be interpreted in a positive context.

The script can also be modified to find terms associated with words that occur at least 50 times or more, instead of having to hard code the terms in your script.

# Find associations for words that occur at least 50 times
findAssocs(TextDoc_dtm, terms = findFreqTerms(TextDoc_dtm, lowfreq = 50), corlimit = 0.25)
## $work
## togeth 
##    0.4 
## 
## $good
##  integr synergi 
##    0.28    0.28 
## 
## $health
##    declin    happen      noth      real sentiment    suppli      wors 
##      0.29      0.29      0.29      0.29      0.29      0.29      0.29 
## 
## $overal
##  bad 
## 0.26 
## 
## $great
##   journey satisfact     march      goal     pursu    toward      hard 
##      0.52      0.52      0.36      0.35      0.28      0.26      0.26 
## 
## $feel
##   across    board    harsh   system somewhat 
##     0.33     0.32     0.32     0.32     0.29 
## 
## $improv
##    room perfect   propl    thik attitud 
##    0.41    0.35    0.35    0.35    0.32

Sentiment Scores

Sentiments can be classified as positive, neutral or negative. They can also be represented on a numeric scale, to better express the degree of positive or negative strength of the sentiment contained in a body of text.

This example uses the syuzhet package for generating sentiment scores, which has four sentiment dictionaries and offers a method for accessing the sentiment extraction tool developed in the NLP group at Stanford.

The get_sentiment function accepts two arguments: a character vector (of sentences or words) and a method. The selected method determines which of the four available sentiment extraction methods will be used. The four methods are syuzhet (this is the default), bing, afinn and nrc. Each method uses a different scale and hence returns slightly different results. Please note the outcome of nrc method is more than just a numeric score, requires additional interpretations and is out of scope for this article.

The descriptions of the get_sentiment function has been sourced from the R website : https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

# regular sentiment score using get_sentiment() function and method of your choice
# please note that different methods may have different scales
syuzhet_vector <- get_sentiment(text, method="syuzhet")
# see the first row of the vector
head(syuzhet_vector)
## [1] 2.60 4.65 2.55 1.05 1.00 0.25
# see summary statistics of the vector
summary(syuzhet_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.450   0.900   1.600   1.883   2.650   9.000

An inspection of the syuzhet vector shows the first element has the value of 2.60. It means the sum of the sentiment scores of all meaningful words in the first response(line) in the text file, adds up to 2.60. The scale for sentiment scores using the syuzhet method is decimal and ranges from -1(indicating most negative) to +1(indicating most positive). Note that the summary statistics of the suyzhet vector show a median value of 1.6, which is above zero and can be interpreted as the overall average sentiment across all the responses is positive.

Next, we will run the same analysis for the next two methods and inspect their respective vectors.

# bing
bing_vector <- get_sentiment(text, method="bing")
head(bing_vector)
## [1]  3  1  4 -1  1  1
summary(bing_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -3.000   1.000   2.000   2.007   3.000   9.000
#affin
afinn_vector <- get_sentiment(text, method="afinn")
head(afinn_vector)
## [1] 4 8 6 5 6 2
summary(afinn_vector)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.00    2.00    4.00    4.42    7.00   18.00

Please note the scale of sentiment scores generated by:

The summary statistics of bing and afinn vectors also show that the Median value of Sentiment scores is above 0 and can be interpreted as the overall average sentiment across the all the responses is positive.

Because these different methods use different scales, it’s better to convert their output to a common scale before comparing them.

This basic scale conversion can be done easily using R’s built-in sign function, which converts all positive number to 1, all negative numbers to -1 and all zeros remain 0.

#compare the first row of each vector using sign function
rbind(
  sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector))
)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    1    1    1
## [2,]    1    1    1   -1    1    1
## [3,]    1    1    1    1    1    1

Note the first element of each row (vector) is 1, indicating that all three methods have calculated a positive sentiment score, for the first response (line) in the text.

Emotion Classification

Emotion classification is built on the NRC Word-Emotion Association Lexicon (aka EmoLex). The definition of “NRC Emotion Lexicon”, sourced from >http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm> is “The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.”

To understand this, explore the get_nrc_sentiments function, which returns a data frame with each row representing a sentence from the original file.

The data frame has ten columns (one column for each of the eight emotions, one column for positive sentiment valence and one for negative sentiment valence).

The data in the columns (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive) can be accessed individually or in sets.

The definition of get_nrc_sentiments has been sourced from the R website: https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html?

# run nrc sentiment analysis to return data frame with each row classified as one of the following emotions, rather than a score: 
# anger, anticipation, disgust, fear, joy, sadness, surprise, trust 
# It also counts the number of positive and negative emotions found in each row
d <- get_nrc_sentiment(text)
# head(d,10) - to see top 10 lines of the get_nrc_sentiment dataframe
head (d,10)
##    anger anticipation disgust fear joy sadness surprise trust negative positive
## 1      0            1       0    0   1       0        0     2        1        2
## 2      0            3       0    1   0       0        0     1        1        5
## 3      0            1       0    0   1       0        0     1        0        2
## 4      0            3       0    0   2       1        1     3        2        4
## 5      0            2       0    0   2       0        1     4        1        3
## 6      0            0       0    0   0       0        0     0        0        1
## 7      0            2       0    0   2       0        0     4        0        6
## 8      0            4       0    0   4       0        1     4        0        5
## 9      0            3       0    0   3       0        1     3        0        5
## 10     1            1       0    1   0       0        1     1        1        3

The output shows that the first line of text has;

The next step is to create two plots charts to help visually analyze the emotions in this survey text. First, we can perform some data transformation and clean-up steps before plotting charts.

The first plot shows the total number of instances of words in the text, associated with each of the eight emotions.

#transpose
td<-data.frame(t(d))
#The function rowSums computes column sums across rows for each level of a grouping variable.
td_new <- data.frame(rowSums(td[2:253]))
#Transformation and cleaning
names(td_new)[1] <- "count"
td_new <- cbind("sentiment" = rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]
#Plot One - count of words associated with each sentiment
quickplot(sentiment, data=td_new2, weight=count, geom="bar", fill=sentiment, ylab="count")+ggtitle("Survey sentiments")

This bar chart demonstrates that words associated with the positive emotion of “trust” occurred about five hundred times in the text, whereas words associated with the negative emotion of “disgust” occurred less than 25 times.

A deeper understanding of the overall emotions occurring in the survey response can be gained by comparing these number as a percentage of the total number of meaningful words.

#Plot Two - count of words associated with each sentiment, expressed as a percentage
barplot(
  sort(colSums(prop.table(d[, 1:8]))), 
  horiz = TRUE, 
  cex.names = 0.7, 
  las = 1, 
  main = "Emotions in Text", xlab="Percentage"
)

This bar plot allows for a quick and easy comparison of the proportion of words associated with each emotion in the text. The emotion “trust” has the longest bar and shows that words associated with this positive emotion constitute just over 35% of all the meaningful words in this text.

On the other hand, the emotion of “disgust” has the shortest bar and shows that words associated with this negative emotion constitute less than 2% of all the meaningful words in this text. Overall, words associated with the positive emotions of “trust” and “joy” account for almost 60% of the meaningful words in the text, which can be interpreted as a good sign of team health.

Congratulations !

You have completed session 5 of the S3729C Data Analytics Seminar.