suppressWarnings(suppressPackageStartupMessages(library(tidyverse)))
suppressWarnings(suppressPackageStartupMessages(library(tidytext)))
suppressWarnings(suppressPackageStartupMessages(library(sentimentr)))
suppressWarnings(suppressPackageStartupMessages(library(syuzhet)))
suppressWarnings(suppressPackageStartupMessages(library(here)))
suppressWarnings(suppressPackageStartupMessages(library(DT)))
suppressWarnings(suppressPackageStartupMessages(library(kableExtra)))
suppressWarnings(suppressPackageStartupMessages(library(tm)))
suppressWarnings(suppressPackageStartupMessages(library(topicmodels)))

# 0) -----------------------------------------

Overview

We are trying to do the following, using data in the form of comments in a survey:

Sentence-level and respondent-level sentiment analysis of comments
Vector encoding of each comment using term frequency
Topic modeling using LDA on the vectorized comments

# 1) -----------------------------------------

Data

df1.text <- read_csv(here::here("data", 
                                "2019-12-11_test-text-data.csv"), 
                     col_types = cols(person_id = "i", comment = "c"))

str(df1.text)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 28 obs. of  2 variables:
##  $ person_id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ comment  : chr  "Innovation is important. Without it, we will die." "Best workplace ever" "Even in their prime, the tigers had their detractors. Twenty-five years ago this month, Paul Krugman, an econom"| __truncated__ "There is always a sense of laziness. If people or other managers can push work to you, they will. I learned to "| __truncated__ ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   person_id = col_integer(),
##   ..   comment = col_character()
##   .. )

# df1.text
# df1.text$comment


# 2) -----------------------------------------

Tokenize into words/ngrams

# break into words
df2.tokens_word <- 
  df1.text %>% 
  unnest_tokens(word, 
                comment, 
                token = "words")

df2.tokens_word %>% 
  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

# break into ngrams
df3.tokens_ngram <- 
  df1.text %>% 
  unnest_tokens(word,
                comment, 
                token = "ngrams")

df3.tokens_ngram %>% 
  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

# break into sentences
df4.tokens_sent <- 
  df1.text %>% 
  mutate(comment = gsub('"', '""', comment)) %>% 
  unnest_tokens(sentence,
                comment,
                token = "sentences") %>% 
  mutate(sentence = gsub('"', '', sentence))

df4.tokens_sent %>% 
  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

# 3) -----------------------------------------

Sentiment analysis

Reference: https://github.com/trinker/sentimentr

sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup.

sentimentr is designed to quickly calculate text polarity sentiment at the sentence level and optionally aggregate by rows or grouping variable(s)

The preferred workflow is to spit the text into sentences with get_sentences before any sentiment analysis is done.

Overview of methodology

For each sentence do:

Use dictionary lookup to tag polarized words (+1 or -1).
Pull out the cluster of words around each polarized word. Default is 4 words before and 2 words after.
Tag words in the cluster as neutral, negator, amplifier (intensifier), or de-amplifier (downtoner). Collectively, these are known as valence shifters.
Use valence shifters as factors, and multiply with the polarized words to get a cluster total.
Last, these weighted context clusters are summed and divided by the square root of the word count yielding an unbounded polarity score for each sentence.

Sentence-level sentiment:

sentiment(df4.tokens_sent$sentence) %>% 
  bind_cols(df4.tokens_sent %>% 
              select(person_id, 
                     sentence)) %>% 

  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

Average sentiment by respondent:

sentiment(df4.tokens_sent$sentence) %>% 
  bind_cols(df4.tokens_sent %>% 
              select(person_id, 
                     sentence)) %>% 
  group_by(person_id) %>% 
  summarise(avg_sentiment = mean(sentiment), 
            min_sentiment = min(sentiment)) %>% 
  
  bind_cols(df1.text %>% select(comment)) %>% 
  
  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

# 4) -----------------------------------------

Creating a doc-term matrix

Reference: See here for details on vectorizing by TF-IDF.

Function cast_dtm calls tm::as.DocumentTermMatrix which returns object of class TermDocumentMatrix or class DocumentTermMatrix containing a sparse term-document matrix or document-term matrix. The attribute weighting contains the weighting applied to the matrix.

Note: LDA expects weighting by term frequency, not TF-IDF, so we’ll use that.

m1.doc_term_matrix <- 
  df2.tokens_word %>%
  anti_join(stop_words) %>% 
  count(person_id, 
        word) %>% 
  
  cast_dtm(document = person_id,  # which col distinguishes documents?
           term = word, 
           value = n, 
           weighting = tm::weightTf)  # to get TF-IDF, use "tm::weightTfIdf"

## Joining, by = "word"

# ?as.DocumentTermMatrix  # from tm package 
# ?inspect  # from tm package 
# m1.doc_term_matrix
# str(m1.doc_term_matrix)

m1.doc_term_matrix$nrow  # 25 docs

## [1] 28

m1.doc_term_matrix$ncol  # 249 terms (tokens)

## [1] 149

inspect(m1.doc_term_matrix[1:5, 1:10])  # docs are in rows, terms in cols

## <<DocumentTermMatrix (documents: 5, terms: 10)>>
## Non-/sparse entries: 10/40
## Sparsity           : 80%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs affairs ago american argued article asia's based die innovation
##    1       0   0        0      0       0      0     0   1          1
##    2       0   0        0      0       0      0     0   0          0
##    3       1   1        1      1       1      2     1   0          0
##    4       0   0        0      0       0      0     0   0          0
##    5       0   0        0      0       0      0     0   0          0
##     Terms
## Docs workplace
##    1         0
##    2         1
##    3         0
##    4         0
##    5         0

# 5) -----------------------------------------

LDA

Reference: link

num_topics <- 2

m2.lda <- 
  LDA(m1.doc_term_matrix, 
      k = num_topics, 
      control = list(seed = 11))

m2.lda

## A LDA_VEM topic model with 2 topics.

str(m2.lda)

## Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
##   ..@ alpha          : num 0.0524
##   ..@ call           : language LDA(x = m1.doc_term_matrix, k = num_topics, control = list(seed = 11))
##   ..@ Dim            : int [1:2] 28 149
##   ..@ control        :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
##   .. .. ..@ estimate.alpha: logi TRUE
##   .. .. ..@ alpha         : num 25
##   .. .. ..@ seed          : int 11
##   .. .. ..@ verbose       : int 0
##   .. .. ..@ prefix        : chr "C:\\Users\\nahmad3\\AppData\\Local\\Temp\\Rtmp29c10D\\file4b7c638cbae"
##   .. .. ..@ save          : int 0
##   .. .. ..@ nstart        : int 1
##   .. .. ..@ best          : logi TRUE
##   .. .. ..@ keep          : int 0
##   .. .. ..@ estimate.beta : logi TRUE
##   .. .. ..@ var           :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
##   .. .. .. .. ..@ iter.max: int 500
##   .. .. .. .. ..@ tol     : num 1e-06
##   .. .. ..@ em            :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
##   .. .. .. .. ..@ iter.max: int 1000
##   .. .. .. .. ..@ tol     : num 1e-04
##   .. .. ..@ initialize    : chr "random"
##   ..@ k              : int 2
##   ..@ terms          : chr [1:149] "die" "innovation" "workplace" "affairs" ...
##   ..@ documents      : chr [1:28] "1" "2" "3" "4" ...
##   ..@ beta           : num [1:2, 1:149] -4.53 -196 -4.53 -195.39 -182.24 ...
##   ..@ gamma          : num [1:28, 1:2] 0.97512 0.0474 0.00102 0.00575 0.01687 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:169] 1 1 2 3 3 3 3 3 3 3 ...
##   .. ..$ j   : int [1:169] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ v   : num [1:169] 1 1 2 2 2 2 2 2 2 2 ...
##   .. ..$ nrow: int 28
##   .. ..$ ncol: int 149
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num [1:28] -9.81 -5.11 -220.04 -40.6 -14.02 ...
##   ..@ iter           : int 31
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 176

Extract word probs by topic (betas)

Let’s find the 10 terms that are most common within each topic

df5.topic_betas <- 
  tidy(m2.lda, 
       matrix = "beta") 

df5.topic_betas %>% 
  ggplot(aes(x = beta)) + 
  geom_density() + 
  facet_wrap(~topic) + 
  labs(title = "Dist of word probs by topic")

For a particular topic, high beta terms are those highly associated with the topic.

For each term, we can compare betas across topics to see which topic it belongs to most clearly.

df5.topic_betas %>% 
  group_by(topic) %>% 
  top_n(3, beta) %>% 
  ungroup() %>% 
  arrange(topic, 
          -beta) %>% 
  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

Let’s find the terms that had the greatest difference in β between topic 1 and topic 2. This can be estimated based on the log ratio of the two.

To constrain it to a set of especially relevant words, we can filter for relatively common words, such as those that have a beta greater than 1/1000 in at least one topic.

df6.beta_spread <- 
  df5.topic_betas %>% 
  mutate(topic = paste0("topic", 
                        topic)) %>% 
  spread(key = topic, 
         value = beta) %>% 
  
  filter(topic1 > .01 | topic2 > .01) %>% 
  mutate(log_ratio = log2(topic2/topic1), 
         sign = sign(log_ratio) %>% as.factor()) %>% 
  arrange(desc(log_ratio))

# tbl
df6.beta_spread %>% 
  datatable(extensions = 'Buttons',
            options = list(dom = 'Bfrtip', 
                           buttons = c('excel', "csv")))

# plot top and bottom 
df6.beta_spread %>% 
  head(15) %>% 
  bind_rows(df6.beta_spread %>% tail(15)) %>% 
  
  mutate(term = as.factor(term)) %>% 
  
  ggplot(aes(x = reorder(term, 
                         log_ratio),  
             y = log_ratio, 
             fill = sign)) +
  geom_col() + 
  coord_flip() + 
  labs(title = "Top 15 terms most uniquely associated with \nTopic 1 versus Topic 2")

Tidytext package test drive

Nayef Ahmad

2019-12-11