ANLY540 - Analysis of Human Language - Assignment 8: Semantic Vector Spaces

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)
library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)

The Data

You will want to use some books from Project Gutenberg to perform either a Topic or LSA model. The code to pick the books has been provided for you, so all you would need to do is change out the titles.

I tried some other titles, but I had some difficulty in getting the results. So I am using the initial coded titles.

##pick some titles from project gutenberg
titles = c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
            "Pride and Prejudice", "Great Expectations")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

Your Analysis

Pick either LSA or Topics Models (pick just one version of the topics math).
Create a corpus from the data above.
Build the semantic vector space.
Explore the vector space:

Include at least one graphic of the vector space that interests you.
Include some statistics for your model (LSA: might be coherence for a chapter, cosine, etc.; Topics: might be alpha, entropy, etc.)

Include these steps with a brief explanation of what you are doing. For example, you could write “I am doing LSA, this part creates the corpus and the space.”, and then include an r-chunk underneath it that runs the analysis. You should end up with a report that explains the space - you can assume the person reading it knows the terminology, but explain why/what you are doing at each step. Therefore, you don’t have to explain what alpha is but just simply that you are calculating it and what the interpretation of your results might be.

Introduction

I will be running various Topics Models such as LDA Fit, LDA Fixed, LDA Gibbs, and CTM and comparing these models based on their alpha and entropy values. Then I will pick one specific model, most likely the LDA fit model and explore the beta values for terms and gamma values using visual means.

Creating the Corpus

The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.

import_corpus = Corpus(VectorSource(by_chapter$text))

Building the Semantic Vector Space

Cleaning up the text

import_mat =
  DocumentTermMatrix(import_corpus,
                     control = list(stemming = FALSE, #create root words
                                    stopwords = TRUE, #remove stop words
                                    minWordLength = 4, #cut out small words
                                    removeNumbers = TRUE, #take out the numbers
                                    removePunctuation = TRUE)) #take out the punctuation

Note that I am setting ‘stemming’ as FALSE here and specified minWordLength as 4 because I am not using stemming.

Weighting the matrix

Next up, we weight the metrix to control for sparsity of the matrix. Not all words are in each document and some words are very frequent. We want to control for both ends of the spectrum, that is the words with zero frequency as well as the very frequent words.

#weight the space
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i],
                       import_mat$j,
                       mean) *
  log2(nDocs(import_mat)/col_sums(import_mat > 0))

#ignore the very frequent and 0 terms
import_mat = import_mat[ , import_weight >= 0.03]
import_mat = import_mat[row_sums(import_mat) > 0, ]

Note that a threshold of 0.1 was used in class for the answers to the essay question. However, since this corpus is created from different books, using such a high threshold initially resulted in just 2 docs being considered and everything else being removed. Setting the threshold at 0.03 removes quite a bit of the 194 documents and at the same time results in varied terms for each of the topics.

Modeling

Let’s run the following models:

LDA Fit Model : This model uses the VEM (Variational expectation maximization) algorithm and estimates alpha
LDA Fixed Model : This model also uses the VEM algorithm but with a fixed alpha value
LDA Gibbs Model : This model uses the Gibbs algorithm, which is a Bayesian algorithm, instead of the VEM algorithm
CTM : The correlated topics models allows for correlation between topics and uses the VEM algorithm

For running the models, we need to pick a value for k, the number of expected topics. Let’s pick this as 3.

#set the number of topics
k = 3

#set a random number for seed
SEED = 12345

LDA_fit = LDA(import_mat, k = k,
              control = list(seed = SEED))

LDA_fixed = LDA(import_mat, k = k,
                control = list(estimate.alpha = FALSE, seed = SEED))

LDA_gibbs = LDA(import_mat, k = k, method = "Gibbs",
                control = list(seed = SEED, burnin = 1000,
                               thin = 100, iter = 1000))

CTM_fit = CTM(import_mat, k = k,
              control = list(seed = SEED,
                             var = list(tol = 10^-4),
                             em = list(tol = 10^-3)))

Exploring the Vector Space

Comparison of models

Alpha Values

Alpha is a measure of the number or rather the predominance of topics. Low alpha values indicate few document topics predominant per doc and high values indicate more topics predominant per doc.

LDA_fit@alpha

## [1] 0.03544873

LDA_fixed@alpha

## [1] 16.66667

LDA_gibbs@alpha

## [1] 16.66667

The LDA Fit Model has a very low alpha value, indicating that a single topic is predominant across the docs and there isn’t much spread. The higher alpha values for LDA Fixed and LDA Gibbs models indicate higher spread across the topics.

Entropy Values

Entropy is a measure of randomness. Low entropy values indicate low randomness or less topics or more coherence in a doc. High entropy values indicate high randomness, that the topics are all over the place.

sapply(list(LDA_fit, LDA_fixed, LDA_gibbs, CTM_fit),
       function (x)
         mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))

## [1] 0.1355341 1.0509520 1.0417784 0.2350925

The LDA Fit Model and CTM have low entropy values, indicating low randomness or coherence in a doc. The LDA Fixed and LDA Gibbs models have very high entropy values, indicating that the topics are all over the place, that is, very less coherence within docs.

Deepdive of LDA Fit Model

Let us now explore the LDA Fit Model, which has the lowest entropy in more detail.

Topic Matrix

The topic matrix indicates the rank of the topics for each document. The topic covered most in each document is ranked higher and the topics covered less are ranked lower.

topics(LDA_fit, k)

##      1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 28 29
## [1,] 3 3 3 3 3 3 3 3  3  3  3  2  2  3  2  2  2  2  3  3  3  3  3  2  3  3
## [2,] 2 2 2 2 2 2 2 2  2  2  1  1  1  2  3  1  3  1  2  2  2  2  1  1  2  2
## [3,] 1 1 1 1 1 1 1 1  1  1  2  3  3  1  1  3  1  3  1  1  1  1  2  3  1  1
##      30 31 32 33 34 35 37 38 39 40 41 42 43 45 47 48 49 50 51 52 53 54 55
## [1,]  2  2  2  3  3  2  1  2  2  2  2  2  2  1  2  3  3  3  2  3  3  3  3
## [2,]  1  1  3  2  2  1  2  3  3  1  1  1  3  3  1  2  2  2  1  1  2  1  2
## [3,]  3  3  1  1  1  3  3  1  1  3  3  3  1  2  3  1  1  1  3  2  1  2  1
##      56 57 58 59 69 104 114 124 125 130 132 134 135 136 137 138 139 141
## [1,]  3  3  3  3  1   2   1   1   1   1   2   1   2   1   1   2   3   3
## [2,]  2  2  2  1  3   1   3   3   3   3   1   3   1   2   2   1   1   1
## [3,]  1  1  1  2  2   3   2   2   2   2   3   2   3   3   3   3   2   2
##      142 145 151 154 158 159 166 167 176 182 183 184 186 189 193
## [1,]   1   2   2   1   2   2   2   2   3   1   1   1   1   1   1
## [2,]   2   1   1   2   1   1   1   1   1   2   2   2   3   2   3
## [3,]   3   3   3   3   3   3   3   3   2   3   3   3   2   3   2

Below is a list of each of the books in the corpus and how dominant each of the topics are within the book:

Great Expectations: Topic 3 is predominant in the first and last parts of the book, while Topic 2 is predominant in the middle. Topic 1 seems to be the least dominant in this book.
Pride and Prejudice: Topic 1 is predominant in the first and last parts of the book, while Topic 2 is predominant in the middle. However, Topic 2 is the least dominant in the first and last parts of the book.
The War of the Worlds: Topic 1 seems to be dominant in most parts of the book, while Topics 2 and 3 seem to be more or less eqaully non-dominant across the book.
Twenty Thousand Leagues under the sea: Topic 2 seems to predominant in the first half of the book, while Topic 1 seems to be dominant in the latter half of the book. Topic 3 seems to be the least dominant in this book.

Most frequent terms

Let’s look at the most frequent terms for each of the topics that were estimated for the LDA Fit Model.

terms(LDA_fit,10)

##       Topic 1        Topic 2    Topic 3       
##  [1,] "sergeant"     "wemmick"  "joe"         
##  [2,] "cuttlefish"   "drummle"  "ulla"        
##  [3,] "wandered"     "vanikoro" "galley"      
##  [4,] "adventures"   "perouse"  "waldengarver"
##  [5,] "arthur"       "pearl"    "rul"         
##  [6,] "experiences"  "cellar"   "twoandthirty"
##  [7,] "roehampton"   "jaggerth" "venus"       
##  [8,] "hopped"       "turnkey"  "sargasso"    
##  [9,] "hilton"       "coiler"   "danish"      
## [10,] "ecclesiastes" "sausage"  "experiences"

Topic 1: Topic 1 seems to be centered around adventures with frequent words such as wandered, adventures, experiences, and hopped.
Topic 2: The top two terms for Topic 2 are “Wemmick” and “Drummle”, which represent characters from the book, “Great Expectations”
Topic 3: Some of the top terms such as “Joe” and “Waldengarver” are again characters from “Great Expectations”.

Beta Values

The beta values represent the weight of each word with respect to each topic. Let us plot the most frequent terms for each topic and visualize their beta values.

#use tidyverse to clean up the fit
LDA_fit_topics = tidy(LDA_fit, matrix = "beta")

#create top terms
top_terms = LDA_fit_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

#clean up ggplot2 defaults
cleanup = theme(panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.background = element_blank(),
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 10))

#make the plot
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  cleanup +
  coord_flip()

We can see that “sergeant” and “cuttlefish” are the dominant terms in topic 1, “wemmick” and “drummie” are the dominant terms in topic 2, and “joe” is the dominant term in topic 3.

Gamma Values

The gamma values represent the probability of each topic within each doc. Let’s take the gamma matrix from the LDA Fit Model and visualize the gamma values.

LDA_gamma = tidy(LDA_fit, matrix = "gamma")

LDA_gamma %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_point() +
  cleanup

We see a lot of points with gammma of 1 and lot of points with gamma of 0. This is in tune with the low entropy score for this model. Either a topic is highly probable for a document or it is highly improbable.