Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
You will want to use some books from Project Gutenberg to perform either a Topic or LSA model. The code to pick the books has been provided for you, so all you would need to do is change out the titles.
I tried some other titles, but I had some difficulty in getting the results. So I am using the initial coded titles.
##pick some titles from project gutenberg
titles = c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
"Pride and Prejudice", "Great Expectations")
##read in those books
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title") %>%
mutate(document = row_number())
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))Include these steps with a brief explanation of what you are doing. For example, you could write “I am doing LSA, this part creates the corpus and the space.”, and then include an r-chunk underneath it that runs the analysis. You should end up with a report that explains the space - you can assume the person reading it knows the terminology, but explain why/what you are doing at each step. Therefore, you don’t have to explain what alpha is but just simply that you are calculating it and what the interpretation of your results might be.
I will be running various Topics Models such as LDA Fit, LDA Fixed, LDA Gibbs, and CTM and comparing these models based on their alpha and entropy values. Then I will pick one specific model, most likely the LDA fit model and explore the beta values for terms and gamma values using visual means.
The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.
import_mat =
DocumentTermMatrix(import_corpus,
control = list(stemming = FALSE, #create root words
stopwords = TRUE, #remove stop words
minWordLength = 4, #cut out small words
removeNumbers = TRUE, #take out the numbers
removePunctuation = TRUE)) #take out the punctuationNote that I am setting ‘stemming’ as FALSE here and specified minWordLength as 4 because I am not using stemming.
Next up, we weight the metrix to control for sparsity of the matrix. Not all words are in each document and some words are very frequent. We want to control for both ends of the spectrum, that is the words with zero frequency as well as the very frequent words.
#weight the space
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i],
import_mat$j,
mean) *
log2(nDocs(import_mat)/col_sums(import_mat > 0))
#ignore the very frequent and 0 terms
import_mat = import_mat[ , import_weight >= 0.03]
import_mat = import_mat[row_sums(import_mat) > 0, ]Note that a threshold of 0.1 was used in class for the answers to the essay question. However, since this corpus is created from different books, using such a high threshold initially resulted in just 2 docs being considered and everything else being removed. Setting the threshold at 0.03 removes quite a bit of the 194 documents and at the same time results in varied terms for each of the topics.
Let’s run the following models:
For running the models, we need to pick a value for k, the number of expected topics. Let’s pick this as 3.
#set the number of topics
k = 3
#set a random number for seed
SEED = 12345
LDA_fit = LDA(import_mat, k = k,
control = list(seed = SEED))
LDA_fixed = LDA(import_mat, k = k,
control = list(estimate.alpha = FALSE, seed = SEED))
LDA_gibbs = LDA(import_mat, k = k, method = "Gibbs",
control = list(seed = SEED, burnin = 1000,
thin = 100, iter = 1000))
CTM_fit = CTM(import_mat, k = k,
control = list(seed = SEED,
var = list(tol = 10^-4),
em = list(tol = 10^-3)))Alpha is a measure of the number or rather the predominance of topics. Low alpha values indicate few document topics predominant per doc and high values indicate more topics predominant per doc.
## [1] 0.03544873
## [1] 16.66667
## [1] 16.66667
The LDA Fit Model has a very low alpha value, indicating that a single topic is predominant across the docs and there isn’t much spread. The higher alpha values for LDA Fixed and LDA Gibbs models indicate higher spread across the topics.
Entropy is a measure of randomness. Low entropy values indicate low randomness or less topics or more coherence in a doc. High entropy values indicate high randomness, that the topics are all over the place.
sapply(list(LDA_fit, LDA_fixed, LDA_gibbs, CTM_fit),
function (x)
mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))## [1] 0.1355341 1.0509520 1.0417784 0.2350925
The LDA Fit Model and CTM have low entropy values, indicating low randomness or coherence in a doc. The LDA Fixed and LDA Gibbs models have very high entropy values, indicating that the topics are all over the place, that is, very less coherence within docs.
Let us now explore the LDA Fit Model, which has the lowest entropy in more detail.
The topic matrix indicates the rank of the topics for each document. The topic covered most in each document is ranked higher and the topics covered less are ranked lower.
## 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 28 29
## [1,] 3 3 3 3 3 3 3 3 3 3 3 2 2 3 2 2 2 2 3 3 3 3 3 2 3 3
## [2,] 2 2 2 2 2 2 2 2 2 2 1 1 1 2 3 1 3 1 2 2 2 2 1 1 2 2
## [3,] 1 1 1 1 1 1 1 1 1 1 2 3 3 1 1 3 1 3 1 1 1 1 2 3 1 1
## 30 31 32 33 34 35 37 38 39 40 41 42 43 45 47 48 49 50 51 52 53 54 55
## [1,] 2 2 2 3 3 2 1 2 2 2 2 2 2 1 2 3 3 3 2 3 3 3 3
## [2,] 1 1 3 2 2 1 2 3 3 1 1 1 3 3 1 2 2 2 1 1 2 1 2
## [3,] 3 3 1 1 1 3 3 1 1 3 3 3 1 2 3 1 1 1 3 2 1 2 1
## 56 57 58 59 69 104 114 124 125 130 132 134 135 136 137 138 139 141
## [1,] 3 3 3 3 1 2 1 1 1 1 2 1 2 1 1 2 3 3
## [2,] 2 2 2 1 3 1 3 3 3 3 1 3 1 2 2 1 1 1
## [3,] 1 1 1 2 2 3 2 2 2 2 3 2 3 3 3 3 2 2
## 142 145 151 154 158 159 166 167 176 182 183 184 186 189 193
## [1,] 1 2 2 1 2 2 2 2 3 1 1 1 1 1 1
## [2,] 2 1 1 2 1 1 1 1 1 2 2 2 3 2 3
## [3,] 3 3 3 3 3 3 3 3 2 3 3 3 2 3 2
Below is a list of each of the books in the corpus and how dominant each of the topics are within the book:
Let’s look at the most frequent terms for each of the topics that were estimated for the LDA Fit Model.
## Topic 1 Topic 2 Topic 3
## [1,] "sergeant" "wemmick" "joe"
## [2,] "cuttlefish" "drummle" "ulla"
## [3,] "wandered" "vanikoro" "galley"
## [4,] "adventures" "perouse" "waldengarver"
## [5,] "arthur" "pearl" "rul"
## [6,] "experiences" "cellar" "twoandthirty"
## [7,] "roehampton" "jaggerth" "venus"
## [8,] "hopped" "turnkey" "sargasso"
## [9,] "hilton" "coiler" "danish"
## [10,] "ecclesiastes" "sausage" "experiences"
The beta values represent the weight of each word with respect to each topic. Let us plot the most frequent terms for each topic and visualize their beta values.
#use tidyverse to clean up the fit
LDA_fit_topics = tidy(LDA_fit, matrix = "beta")
#create top terms
top_terms = LDA_fit_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
#clean up ggplot2 defaults
cleanup = theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line.x = element_line(color = "black"),
axis.line.y = element_line(color = "black"),
legend.key = element_rect(fill = "white"),
text = element_text(size = 10))
#make the plot
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
cleanup +
coord_flip()We can see that “sergeant” and “cuttlefish” are the dominant terms in topic 1, “wemmick” and “drummie” are the dominant terms in topic 2, and “joe” is the dominant term in topic 3.
The gamma values represent the probability of each topic within each doc. Let’s take the gamma matrix from the LDA Fit Model and visualize the gamma values.
LDA_gamma = tidy(LDA_fit, matrix = "gamma")
LDA_gamma %>%
ggplot(aes(factor(topic), gamma)) +
geom_point() +
cleanupWe see a lot of points with gammma of 1 and lot of points with gamma of 0. This is in tune with the low entropy score for this model. Either a topic is highly probable for a document or it is highly improbable.