Introduction

This project explores R for text mining and sentiment analysis and will demonstrate some common text analytics and visualization techniques in R.

The main steps involve:

  • loading the text file and transforming it into R Corpus

  • cleaning the data and performing analysis:

    • Word Frequency
    • Word Cloud
    • Word Association
    • Sentiment Scores
    • Emotion Classification

Packages

The following packages are used:

  • tm for text mining operations:

    -removing numbers, special characters, punctuation

    -removing stop words

  • wordcloud for generating the word cloud plot

  • RColorBrewer for color palettes used in various plots

  • syuzhet for sentiment scores and emotion classification

Reading file data

The input file has multiple lines of text. Data is not tabular, so the readLines function is used.

The function returns a vector containing as many elements as the number of lines in the file.

Each line is an article on the theme stated in the file name.

The file.choose() function within the argument is used to choose the file interactively.

text <- readLines(file.choose())
text2 <- readLines(file.choose())
text3 <- readLines(file.choose())

Reading file data

TextDoc
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 93
TextDoc1
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 242
TextDoc2
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 170

Cleaning file data

The next step is to load text as corpus to apply text mining on.

The tm_map() function is used to remove ‘unnecessary’ characters, numbers, punctuation etc. and convert the text to lower case.

TextDoc <- VCorpus(VectorSource(text))

toSpace <- content_transformer(function (x , pattern) sub(pattern, " ", x))

  • TextDoc <- tm_map(TextDoc, toSpace, “/”)
  • TextDoc <- tm_map(TextDoc, toSpace, “@”)
  • TextDoc <- tm_map(TextDoc, toSpace, “\|”)
  • TextDoc <- tm_map(TextDoc, toSpace, “"”)
  • TextDoc <- tm_map(TextDoc, content_transformer(tolower))
  • TextDoc <- tm_map(TextDoc, removeNumbers)
  • TextDoc <- tm_map(TextDoc, removePunctuation)
  • TextDoc <- tm_map(TextDoc, stripWhitespace)

Stop words

Examples of stop-words including personal list:

The stop-words are the most commonly occurring words and have very little value in terms of gaining useful information.

They should be removed before performing further analysis.

Examples of stop-words in English are “the, is, at, on”.

TextDoc <- tm_map(TextDoc, removeWords, stopwords("english"))
TextDoc <- tm_map(TextDoc, removeWords, stopwords_personal)
TextDoc1 <- tm_map(TextDoc1, removeWords, stopwords("english"))
TextDoc1 <- tm_map(TextDoc1, removeWords, stopwords_personal) 
TextDoc2 <- tm_map(TextDoc2, removeWords, stopwords("english"))
TextDoc2 <- tm_map(TextDoc2, removeWords, stopwords_personal) 

Stop words

Examples of stop-words including personal list:

##  [1] "I"        "a"        "about"    "above"    "after"    "again"   
##  [7] "against"  "ain"      "all"      "also"     "am"       "an"      
## [13] "and"      "andrew"   "angele"   "any"      "apr"      "are"     
## [19] "aren"     "as"       "ask"      "at"       "aug"      "baiden"  
## [25] "be"       "because"  "been"     "before"   "being"    "below"   
## [31] "between"  "biden"    "both"     "britney"  "bryant"   "but"     
## [37] "by"       "can"      "canc"     "carry"    "che"      "chloe"   
## [43] "cnn"      "co"       "colin"    "com"      "come"     "could"   
## [49] "couldn"   "cruz"     "cry"      "cuomo"    "d"        "davidson"
## [55] "dec"      "dennis"   "did"      "didn"     "die"      "do"      
## [61] "does"     "doesn"    "doing"    "don"      "down"     "during"  
## [67] "each"     "eddy"     "everyone" "fboiqs"   "feb"      "few"     
## [73] "five"     "for"      "four"     "from"     "further"  "get"     
## [79] "give"     "gmt"      "gov"      "had"      "hadn"     "has"     
## [85] "hasn"     "have"     "haven"    "having"   "he"       "her"

Term document matrix

The next step is to count the occurrence of each word by using the function TermDocumentMatrix().

Document Matrix is a table containing the frequency of words.

TextDoc_dtm <- TermDocumentMatrix(TextDoc)
dtm_m <- as.matrix(TextDoc_dtm)
# Sort by descearing value of frequency
dtm_v <- sort(rowSums(dtm_m),decreasing=TRUE)
dtm_d <- data.frame(word = names(dtm_v),freq=dtm_v)

TextDoc_dtm1 <- TermDocumentMatrix(TextDoc1)
dtm_m1 <- as.matrix(TextDoc_dtm1)
# Sort by descearing value of frequency
dtm_v1 <- sort(rowSums(dtm_m1),decreasing=TRUE)
dtm_d1 <- data.frame(word = names(dtm_v1),freq=dtm_v1)

TextDoc_dtm2 <- TermDocumentMatrix(TextDoc2)
dtm_m2 <- as.matrix(TextDoc_dtm2)
# Sort by descearing value of frequency
dtm_v2 <- sort(rowSums(dtm_m2),decreasing=TRUE)
dtm_d2 <- data.frame(word = names(dtm_v2),freq=dtm_v2)

Term document matrix

The top 15 most frequent words corpus # 1

head(dtm_d, 15)
##                  word freq
## company       company  167
## percent       percent  162
## people         people  157
## last             last  151
## pandemic     pandemic  136
## business     business  118
## home             home  100
## market         market  100
## workers       workers   95
## bankruptcy bankruptcy   88
## filed           filed   87
## investors   investors   87
## prices         prices   85
## many             many   84
## million       million   84

Term document matrix

The top 15 most frequent words corpus # 2

head(dtm_d1, 15)
##                          word freq
## house                   house  564
## president           president  487
## senate                 senate  469
## state                   state  432
## administration administration  388
## bill                     bill  378
## people                 people  367
## republican         republican  320
## republicans       republicans  317
## washington         washington  315
## white                   white  313
## political           political  300
## federal               federal  296
## states                 states  294
## last                     last  287

Term document matrix

The top 15 most frequent words corpus # 3

head(dtm_d2, 15)
##                    word freq
## people           people  373
## university   university  219
## time               time  209
## researchers researchers  191
## science         science  191
## scientists   scientists  160
## first             first  155
## data               data  151
## solar             solar  150
## space             space  141
## вђ“                 вђ“  135
## вђ”                 вђ”  135
## black             black  134
## many               many  129
## might             might  124

Most frequent words - business

The most frequently occurring words indicate the theme of corpus very clearly as they are all connected to business area.

Most frequent words - politics

Most frequent words - science

The most frequently occurring words indicate the theme of corpus very clearly as they are all connected to science area.

Word cloud

A word cloud helps to visualize and analyze qualitative data. It’s an image composed of keywords and the size of each word indicates its frequency. The word cloud shows additional words that occur frequently and could be of interest for further analysis.

Below is a brief description of the arguments used in the word cloud function:

  • words – words to be plotted
  • freq – frequencies of words
  • min.freq – words with frequency at or above the threshold value
  • max.words – the maximum number of words to display on the plot
  • random.order – FALSE, so the words are plotted in order of decreasing frequency
  • rot.per – the percentage of words that are displayed as vertical text
  • colors – changes word colors going from lowest to highest frequencies

Word cloud - business

set.seed(1234)
wordcloud(words = dtm_d$word[1:7044], freq = dtm_d$freq, min.freq = 2,
          max.words=300, random.order=FALSE, rot.per=0.5, 
          colors=brewer.pal(12, "Set3"))

Word cloud - politics

set.seed(1234)
wordcloud(words = dtm_d1$word[1:10294], freq = dtm_d1$freq, min.freq = 2,
          max.words=300, random.order=FALSE, rot.per=0.5, 
          colors=brewer.pal(12, "Set3"))

Word cloud - science

set.seed(1234)
wordcloud(words = dtm_d2$word[1:7044], freq = dtm_d2$freq, min.freq = 2,
          max.words=300, random.order=FALSE, rot.per=0.5, 
          colors=brewer.pal(12, "Set3"))

Association

Studing the strength of variables relation helps to see the context around the word. The findAssocs() function can be used to analyze which words occur most often in association with the word in question.

findAssocs(TextDoc_dtm, terms = c('percent'), corlimit = 0.52)
## $percent
##  aggregate yearonyear     borrow     writes 
##       0.60       0.55       0.54       0.54
## $president
##       former        white institutions    political 
##         0.49         0.40         0.38         0.38
## $people
##   research    doctors   versions accomplish    medical  histories 
##       0.78       0.75       0.72       0.71       0.71       0.70

Sentiment Scores

Sentiments can be classified as positive, neutral or negative and are represented on a numeric scale. Numbers help to express the degree of positive or negative strength. The get_sentiment function accepts two arguments: a character vector and one of methods:

  • syuzhet (this is the default)
  • bing
  • afinn
  • nrc (will be used for emotion classification)
syuzhet_vector <- get_sentiment(text, method="syuzhet")
bing_vector <- get_sentiment(text, method="bing")
afinn_vector <- get_sentiment(text, method="afinn")
syuzhet_vector1 <- get_sentiment(text2, method="syuzhet")
bing_vector1 <- get_sentiment(text2, method="bing")
afinn_vector1 <- get_sentiment(text2, method="afinn")
syuzhet_vector2 <- get_sentiment(text3, method="syuzhet")
bing_vector2 <- get_sentiment(text3, method="bing")
afinn_vector2 <- get_sentiment(text3, method="afinn")

Sentiment Scores

An inspection to be done:

  • the first element is the sum of the sentiment scores of all meaningful words in the first line (article) in the corpus(file)

Note: the scale for sentiment scores:

  • syuzhet - is decimal and ranges from -1 (most negative) to +1(most positive)

  • bing – binary scale with -1 (negative) and +1 (positive) sentiment

  • afinn – integer scale ranging from -5 to +5

  • positive median value of can be interpreted as the overall average sentiment across all the responses is positive.

Each method uses a different scale so results are different.

Sentiment Scores

Business

## [1] 11.80 -2.05  0.60 15.20 18.35 10.35
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -14.150   0.600   5.150   5.784  10.400  23.800
## [1]  5 -3 -1 12  0 -1
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -19.0000  -3.0000   0.0000   0.1505   4.0000  16.0000
## [1] 28 -3 -4 28 13  4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -59.000  -3.000   2.000   2.914  11.000  29.000

Sentiment Scores

Politics

## [1] -2.95  6.05 23.45  2.55  5.30 -4.75
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -13.4500  -0.3875   2.9250   3.6682   6.8750  23.6500
## [1] -5  1 14 -4 -2 -5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -38.000  -5.000  -1.000  -1.558   2.000  15.000
## [1] -28   2  17 -34  -6 -21
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -66.000 -12.750  -1.000  -3.302   8.000  53.000

Sentiment Scores

Science

## [1] -1.55 12.25 10.15 -1.45 15.45 -2.40
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -15.350   0.000   1.250   4.317   6.987  56.700
## [1] -15   7   2  -2  11  -8
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -25.0000  -3.0000   0.0000  -0.6118   2.0000  23.0000
## [1] -33  15   7  -8  28  -4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -54.000  -2.000   0.000   2.776   8.000  69.000

Sentiment Scores

Different methods use different scales.

It’s better to convert their output to a common scale.

This basic scale conversion can be done using sign function:

  • all positive number to 1
  • all negative numbers to -1
  • all zeros remain 0

Sentiment Scores - business

rbind(sign(head(syuzhet_vector)),
  sign(head(bing_vector)),
  sign(head(afinn_vector)))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1   -1    1    1    1    1
## [2,]    1   -1   -1    1    0   -1
## [3,]    1   -1   -1    1    1    1

Sentiment Scores - politics

rbind(sign(head(syuzhet_vector1)),
  sign(head(bing_vector1)),
  sign(head(afinn_vector1)))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   -1    1    1    1    1   -1
## [2,]   -1    1    1   -1   -1   -1
## [3,]   -1    1    1   -1   -1   -1

Sentiment Scores - science

rbind(sign(head(syuzhet_vector2)),
  sign(head(bing_vector2)),
  sign(head(afinn_vector2)))
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   -1    1    1   -1    1   -1
## [2,]   -1    1    1   -1    1   -1
## [3,]   -1    1    1   -1    1   -1

Emotion Classification

Emotion classification is built on the NRC Word-Emotion Association Lexicon.

It is a list of eight words and their associations with basic emotions:

  • anger, fear, anticipation, trust, surprise, sadness, joy and disgust

and two sentiments:

  • negative and positive

The get_nrc_sentiments function returns a data frame with each row representing an article from corpus.

The data frame has ten columns:

  • eight columns for each of the emotions
  • one column for positive sentiment valence
  • one column for negative sentiment valence

Emotion Classification - business

d<-get_nrc_sentiment(text)
head (d,10)
##    anger anticipation disgust fear joy sadness surprise trust negative positive
## 1      4           17       1    4   7       6        5    24       12       40
## 2      1            3       0    2   2       1        0     2        3        2
## 3      1            2       0    1   1       1        1     2        1        2
## 4      6           16       3    8  11       6        9    20       15       44
## 5     18           25       9   23  19      25       15    42       44       71
## 6     12           25       7   18  16      13       12    39       31       55
## 7      0            3       0    1   2       1        2     4        2       11
## 8     11            7       4    8   5      10        5    14       18       17
## 9      3           10       2    9   5       3        4    11       16       16
## 10     0            2       0    0   1       0        0     7        0        9

Emotion Classification - business

The output shows that the first article has:

  • Most occurrences of words associated with emotions of anticipation (17) and trust (24)
  • Fewer occurrences of words associated with emotions of joy (7)
  • Less occurrences of words associated with emotions of surprise (5), sadness (6), fear and anger (4), disgust (1)

Overall positive vs negative sentiment:

  • Total of 12 occurrences of words associated with negative emotions
  • Total of 40 occurrences of words associated with positive emotions

Emotion Classification - politics

d1<-get_nrc_sentiment(text2)
head (d1,10)
##    anger anticipation disgust fear joy sadness surprise trust negative positive
## 1     11           12       8   15   4      10        3    18       20       23
## 2      4            6       3    4   4       8        3    22       20       30
## 3      6           14       2    9   7       3        4    29       14       41
## 4     13            9       5   13   5      11        3    28       24       34
## 5      7           11       2   11   8       7        9    17       18       31
## 6      5            8       4    9   8      11        5    11       18       20
## 7      6            6       0    8   5       5        4    19        9       24
## 8     10           13       3   15   7       8        6    29       19       36
## 9      3            4       1    1   5       2        1     8        9       15
## 10     5            6       3   10   5      11        4    16       18       25

Emotion Classification - politics

The output shows that the first article has:

  • Equal occurrences of words associated with emotions of anticipation (12), fear (15) and trust (18)
  • Fewer occurrences of words associated with emotions of disgust (8), sadness (10) and anger (11)
  • Less occurrences of words associated with emotions of surprise (3) and joy (4)

Overall positive vs negative sentiment:

  • Total of 20 occurrences of words associated with negative emotions
  • Total of 23 occurrences of words associated with positive emotions

Emotion Classification - science

d2<-get_nrc_sentiment(text3)
head (d2,10)
##    anger anticipation disgust fear joy sadness surprise trust negative positive
## 1      8           15       7   18   9      19        8    23       28       38
## 2      9            9       4   11   9       5        7    22       14       41
## 3      2            7       2    5   3       2        3    10        5       19
## 4      3            7       1    6   4       4        5     9        8       21
## 5      2           11       1    3  12       4        4    19       10       34
## 6     11            8       3   18   3       5        5    18       21       30
## 7     15           34       9   14  26      13       10    43       29       64
## 8      9            9       3    8   4       7        4    12       22       21
## 9      9            9       2   12   6       7        2    22       15       39
## 10     7           13       3   10   8       6        4    34       11       48

Emotion Classification - science

The output shows that the first article has:

  • Most occurrences of words associated with emotions of trust (23)
  • Fewer occurrences of words associated with emotions of anticipation (15), sadness (19) and fear (18)
  • Less occurrences of words associated with emotions of surprise (8), joy (9), disgust (7) and anger (8)

Overall positive vs negative sentiment:

  • Total of 28 occurrences of words associated with negative emotions
  • Total of 38 occurrences of words associated with positive emotions

Emotion Classification - business

This bar chart demonstrates that words associated with the positive emotion of trust occurred up to 1500 times in the text.

Whereas words associated with the negative emotion of disgust occurred less than 250 times.

Though emotion of anticipation is also high - up to 1000 times.

Emotion Classification - politics

This bar chart demonstrates that words associated with the positive emotion of trust occurred up to 5000 times in the corpus.

Whereas words associated with the negative emotion of disgust occurred less than 1000 times.

Though emotions of anticipation, fear and anger are high - more then 2000 times.

Emotion Classification - science

This bar chart demonstrates that words associated with the positive emotion of trust occurred up to 2000 times in the corpus.

Whereas words associated with the negative emotion of fear occurred around 1250 times.

As well as emotions of anticipation, anger and sadness are high - up to 3000 times.

Emotion Classification - business

Bar Plot showing the count of words associated with each sentiment expressed as a percentage.

Emotion Classification - business

A deeper understanding of the overall emotions can be gained by comparing these number as a percentage of the total number of meaningful words.

The bar plot compare the proportion of words associated with each emotion in the corpus:

  • the emotion of trust has the longest bar and shows that words associated with this positive emotion constitute up to 30% of all the meaningful words
  • the emotion of disgust has the shortest bar and shows that words associated with this negative emotion constitute less than 5% of all the meaningful words
  • words associated with the negative emotions of sadness and fear combined account for almost 25% of the meaningful words
  • both the emotion of anticipaton and surprise could be identified as positive or negative only within the context
  • the positve emotion of joy shows a bit more then 10%, so overall sentiment interpretation of the corpus would be positive

Emotion Classification - politics

Bar Plot showing the count of words associated with each sentiment expressed as a percentage.

Emotion Classification - politics

A deeper understanding of the overall emotions can be gained by comparing these number as a percentage of the total number of meaningful words.

The bar plot compare the proportion of words associated with each emotion in the corpus:

  • the emotion trust has the longest bar and shows that words associated with this positive emotion constitute up to 30% of all the meaningful words
  • the emotion of disgust has the shortest bar and shows that words associated with this negative emotion constitute 5% of all the meaningful words
  • words associated with the negative emotions of anger and fear combined account for almost 25% of the meaningful words
  • the positve emotion of joy shows a bit more then 10%, as well as negative emotion sadness marks at 10%
  • both the emotion of anticipaton and surprise could be identified as positive or negative only within the context
  • overall sentiment interpretation of the corpus could be stated neutral

Emotion Classification - science

Bar Plot showing the count of words associated with each sentiment expressed as a percentage.

Emotion Classification - science

A deeper understanding of the overall emotions can be gained by comparing these number as a percentage of the total number of meaningful words.

The bar plot compare the proportion of words associated with each emotion in the corpus:

  • the emotion trust has the longest bar and shows that words associated with this positive emotion constitute up to 30% of all the meaningful words
  • the emotion of disgust has the shortest bar and shows that words associated with this negative emotion constitute 6% of all the meaningful words
  • words associated with the negative emotions of fear and sadness combined account for almost 27% of the meaningful words
  • the positve emotion of joy shows a bit more then 10%, as well as negative emotion anger marks at 10%
  • both the emotion of anticipaton and surprise could be identified as positive or negative only within the context
  • overall sentiment interpretation of the corpus could be stated negative

Correlation test

To quantify similarity / difference between the sets of word frequencies a correlation test could be used.

frequency <- bind_rows(mutate(dtm_d, c_theme = "business"),
                       mutate(dtm_d1, c_theme = "politics"),
                       mutate(dtm_d2, c_theme = "science")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(c_theme, word) %>%
  group_by(c_theme) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(c_theme, proportion) %>% 
  gather(c_theme, proportion, `business`:`politics`)

Correlation test

cor.test(data = frequency[frequency$c_theme == "business",],
         ~ proportion + `science`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and science
## t = 39.056, df = 3816, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5113390 0.5566774
## sample estimates:
##       cor 
## 0.5343925

Correlation test

cor.test(data = frequency[frequency$c_theme == "politics",],
         ~ proportion + `science`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and science
## t = 47.867, df = 4894, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5453146 0.5834830
## sample estimates:
##       cor 
## 0.5647007