Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(gutenbergr)
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
#install.packages("tm")
#install.packages("topicmodels")
#install.packages("tidytext")
library(tm)
## Warning: package 'tm' was built under R version 3.6.1
## Loading required package: NLP
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.6.1
library(tidyverse)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1     v readr   1.3.1
## v tibble  2.1.2     v purrr   0.3.2
## v ggplot2 3.1.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.6.1
library(slam)

The Data

You will want to use some books from Project Gutenberg to perform either a Topic or LSA model. The code to pick the books has been provided for you, so all you would need to do is change out the titles. I tried changing some titles but the results weren’t inline hence decided to go with extisting titles.

##pick some titles from project gutenberg
titles = c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
            "Pride and Prejudice", "Great Expectations")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

#by_chapter

The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.

Your Analysis

  1. Pick either LSA or Topics Models (pick just one version of the topics math). First, I will try the topic models (such as LDA Gibbs, CTM, LDA fit) and compare these models by using either alpha, or entropy values or both. After that, we can pick a specific model and beta and gamam values for terms and visual means respectively.

  2. Create a corpus from the data above.

import_corpus = Corpus(VectorSource(by_chapter$text))
  1. Build the semantic vector space. Here, we set stemming to “False” and the minimum word length as 4 since we aren’t using stemming.
import_mat =
  DocumentTermMatrix(import_corpus,
                     control = list(stemming = FALSE, 
                                    stopwords = TRUE, 
                                    minimumWordLength = 4, 
                                    eliminateNumbers = TRUE, 
                                    eliminatePunctuation = TRUE)) 

#weight the space because some words are more frequent than others. Hence we need to set the floor and the ceiling for these. 
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i],
                       import_mat$j,
                       mean) *
  log2(nDocs(import_mat)/col_sums(import_mat > 0))

#ignore the very frequent and 0 terms
import_mat = import_mat[ , import_weight >= 0.03]
import_mat = import_mat[row_sums(import_mat) > 0, ]

Modeling the data I plan to run the following models: LDA fit, LDA fixed, LDA Gibbs.

LDA Fit Model : This model uses the VEM (Variational expectation maximization) algorithm and estimates alpha LDA Fixed Model : As the name suggests, in addition to the LDA Fit model provides a fixed alpha LDA Gibbs Model : This model uses the Gibbs algorithm, which is a Bayesian algorithm, instead of the VEM algorithm CTM : CTM (Correlated Topic Models) helps correlate topics by using the VEM algorithm. Let’s start with ha value of 4 which is the expected number of topics

#set the number of topics
k = 4

#set a random number for seed
SEED = 12345

LDA_Fit_Model = LDA(import_mat, k = k,
              control = list(seed = SEED))

LDA_Fixed_Model = LDA(import_mat, k = k,
                control = list(estimate.alpha = FALSE, seed = SEED))

LDA_Gibbs_Model = LDA(import_mat, k = k, method = "Gibbs",
                control = list(seed = SEED, burnin = 1000,
                               thin = 100, iter = 1000))

CTM_Fit_Model = CTM(import_mat, k = k,
              control = list(seed = SEED,
                             var = list(tol = 10^-4),
                             em = list(tol = 10^-3)))
  1. Explore the vector space:

Alpha Values Alpha indicates the frequency of topics. THus a low alpha value indicates fewer document topics per doc whereas a high alpha value indicates more topics per doc. As seen in the output, LDA Fit Model has a value of 0.013 indicating that a single topic is predominant across the doc but the spreadd isn’t much. Alpha values for LDA Fixed Model and LDA Gibbs Model are 12.5 (high) Let’s look at the Entropy values next.

Entropy is a measure of randomness. Hence a low entropy would indicate less randomness (or more stability/coherence) in a doc while high entropy would mean the exact opposite. The entropy values for Fit Model and CTM are 0.098 and 0.43 respectively - with Fit model having the lowest entropy. The Fixed and Gibbs model have a value that is slightly higher (~1.36 which is still 15 times that of Fit model) indicating that the topics are spread out all over and less coherent within the docs.

Since LDA Fit Mode lhas the lowest entropy, let’s look at it more closely by doing a topics matrix. A topic matrix indicates the rank of topics for each document. The more frequently occuring topic will be ranked higher and vice versa.

Let’s take a look at the books in the corpus provided in the data and explore the topics within these.

Topic Matrix Great Expectations: Topic 1 is predominant in the first and last parts of the book, while Topic 4 is predominant in the middle. Topic 3 seems to be the least dominant in this book. Pride and Prejudice: Topic 1 and 2 is appead intermittently in the first and last parts of the book. However, Topics 3 and 4 are the least dominant. The War of the Worlds: Topics 1,2, and 3 appear fairly spread out here but the initial parts are dominated by Topic 2 with an equal spread of 1,2,3 towards the tail end of the book. Topic 4 rarely appears here. Twenty Thousand Leagues under the sea: Pretty much dominated by Topic 4 towards the beginning, then with a progression of 3 and 4. Topics 1 and 2 are barely to be seen.

Most Frequent topics Topic 1: Topic 1 is centered experiences such as “wadered”, “experiences,”adventures“. Topic 2: Topic 2 has words such as”sargasso“,”wandered“,”ecclesiastes“,”20,000" which are from the book “Twenty thousand Leagues under the sea”. Topic 3: “Sergeant” has appeared a few times along with “wandered”, “experiences”, “adventures” which seem like words from the book “Great Expectations”. Topic 4: “Girls”, “Woe” along with the common words “wandered”, “experiences”, “adventures” are the frequently occuring ones.

Beta values The beta values indicate the weight of each word under each topic. Higher the beta bvalue, more frequent the occurence. “Skiffins” and “rul” are the most common for Topic 1. “ulla” and “sargasso” are the most common for Topic 2. “compeyson” and “sergeant” are the most common for Topic 3. “wandered”, “cachalots” and “vanikoto” are the most common for Topic 4.

Gamma values Gamma values represent the probability of each topic occuring within a doc. Visualising the gamma values from LDA Fit Model. As seen through the graph, the values of the probabilityes are either 1 or 0 which is inline with the fact that LDA FIt Model has a low entropy value.

#Alpha values
LDA_Fit_Model@alpha
## [1] 0.01345296
LDA_Fixed_Model@alpha
## [1] 12.5
LDA_Gibbs_Model@alpha
## [1] 12.5
#Entropy Values
sapply(list(LDA_Fit_Model, LDA_Fixed_Model, LDA_Gibbs_Model, CTM_Fit_Model),
       function (n)
         mean(apply(posterior(n)$topics, 1, function(m) - sum(m * log(m)))))
## [1] 0.09839209 1.36243173 1.36793591 0.43948793
#Topic matrix
topics(LDA_Fit_Model, k)
##      7 31 37 38 40 45 51 52 60 69 71 84 114 125 130 134 135 136 137 139
## [1,] 1  1  3  3  1  3  3  3  4  4  1  2   4   4   4   4   4   4   4   2
## [2,] 2  2  1  1  2  1  1  1  1  1  2  1   1   1   2   1   2   1   1   1
## [3,] 3  3  2  2  3  2  2  2  2  2  3  3   2   2   1   2   1   2   2   3
## [4,] 4  4  4  4  4  4  4  4  3  3  4  4   3   3   3   3   3   3   3   4
##      142 145 147 154 158 171 172 176 177 182 184 186 189 193
## [1,]   4   4   4   4   4   1   4   2   4   4   1   4   2   4
## [2,]   1   1   1   2   1   2   1   1   1   2   2   1   4   2
## [3,]   2   2   2   1   2   3   2   3   2   1   3   2   1   1
## [4,]   3   3   3   3   3   4   3   4   3   3   4   3   3   3
#Most frequent topics
terms(LDA_Fit_Model,10)
##       Topic 1        Topic 2        Topic 3          Topic 4      
##  [1,] "skiffins"     "ulla,"        "compeyson"      "wandered"   
##  [2,] "rul"          "\"ulla,"      "sergeant,"      "vanikoro,"  
##  [3,] "cuttlefish"   "sargasso"     "sergeant"       "cachalots"  
##  [4,] "tom,"         "strangers.\"" "two-and-thirty" "woe!"       
##  [5,] "poulps"       "wandered"     "wandered"       "adventures" 
##  [6,] "\"conseil,\"" "adventures"   "adventures"     "roehampton" 
##  [7,] "youngest,"    "20,000"       "experiences"    "parabolic"  
##  [8,] "wandered"     "answer----"   "girls;"         "gibraltar"  
##  [9,] "adventures"   "believed?"    "youngest,"      "girls;"     
## [10,] "experiences"  "ecclesiastes" "addition,"      "experiences"
#Beta values
#use tidyverse to clean up the fit
LDA_fit_topics = tidy(LDA_Fit_Model, matrix = "beta")

#create top terms
top_terms = LDA_fit_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

#clean up ggplot2 defaults
cleanup = theme(panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.background = element_blank(),
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 10))

#make the plot
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  cleanup +
  coord_flip()

#Gamma values
LDA_gamma = tidy(LDA_Fit_Model, matrix = "gamma")

LDA_gamma %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_point() +
  cleanup