Topics Models

For this assignment, you will rework your previous LSA model into a topics model. Note that the first few sections are the same - so use the same data you did before!

Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk

library(gutenbergr)
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tm)

## Loading required package: NLP

library(topicmodels)
library(tidyverse)

## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## <U+2713> ggplot2 3.2.1     <U+2713> purrr   0.3.3
## <U+2713> tibble  2.1.3     <U+2713> forcats 0.4.0
## <U+2713> readr   1.3.1

## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x ggplot2::annotate() masks NLP::annotate()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()

library(tidytext)
library(slam)
library(reticulate)

py_config()

## python:         C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python.exe
## libpython:      C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python37.dll
## pythonhome:     C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate
## version:        3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/Lib/site-packages/numpy
## numpy_version:  1.17.3
## 
## python versions found: 
##  C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python.exe
##  C:/Users/rohan/AppData/Local/Continuum/anaconda3/python.exe
##  C:/Users/rohan/AppData/Local/Programs/Python/Python38-32/python.exe

use_python("C://Users//rohan//AppData//Local//Continuum//anaconda3//envs//r-reticulate//python")

py_install("nltk")

py_install("pyLDAvis")

py_install("gensim")


py_install("matplotlib")

py_install("rpy2")

Load the Python libraries or functions that you will use for that section.

##python chunk

import pyLDAvis

## C:\Users\rohan\AppData\Local\Continuum\anaconda3\envs\r-reticulate\lib\site-packages\past\builtins\misc.py:45: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
##   from imp import reload

import pyLDAvis.gensim

import matplotlib.pyplot as plt

import gensim

## C:\Users\rohan\AppData\Local\Continuum\anaconda3\envs\r-reticulate\lib\site-packages\gensim\corpora\dictionary.py:11: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
##   from collections import Mapping, defaultdict
## C:\Users\rohan\AppData\Local\Continuum\anaconda3\envs\r-reticulate\lib\site-packages\scipy\sparse\sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
## scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
##   _deprecated()
## C:\Users\rohan\AppData\Local\Continuum\anaconda3\envs\r-reticulate\lib\site-packages\gensim\models\doc2vec.py:73: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
##   from collections import namedtuple, defaultdict, Iterable

import gensim.corpora as corpora

import nltk

## C:\Users\rohan\AppData\Local\Continuum\anaconda3\envs\r-reticulate\lib\site-packages\nltk\decorators.py:68: DeprecationWarning: `formatargspec` is deprecated since Python 3.5. Use `signature` and the `Signature` object directly
##   regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
## C:\Users\rohan\AppData\Local\Continuum\anaconda3\envs\r-reticulate\lib\site-packages\nltk\lm\counter.py:15: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
##   from collections import Sequence, defaultdict

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer 

ps = PorterStemmer()

The Data

You will want to use some books from Project Gutenberg to perform a Latent Semantic Analysis. The code to pick the books has been provided for you, so all you would need to do is change out the titles. Be sure to pick different books - these are just provided as an example. Check out the book titles at https://www.gutenberg.org/.

##r chunk
##pick some titles from project gutenberg
titles = c("The Adventures of Sherlock Holmes","The Odyssey",
           "The Innocents Abroad","Pride and Prejudice")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())

## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

#by_chapter

The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.

Create the Topics Model

Create the corpus for the model in R.

##r chunk

import_corpus = Corpus(VectorSource(by_chapter$text))

Clean up the text and create the Document Term Matrix.

##r chunk 

import_mat =
  DocumentTermMatrix(import_corpus,
                     control = list(stemming = FALSE, #create root words
                                    stopwords = TRUE, #remove stop words
                                    minWordLength = 4, #cut out small words
                                    removeNumbers = TRUE, #take out the numbers
                                    removePunctuation = TRUE)) #take out the punctuation

Weight the matrix to remove all the high and low frequency words.

##r chunk

#weight the space
import_weight = tapply(import_mat$v/row_sums(import_mat)[import_mat$i], 
                       import_mat$j, 
                       mean) *
  log2(nDocs(import_mat)/col_sums(import_mat > 0))

#ignore very frequent and 0 terms
import_mat = import_mat[ , import_weight >= .05]
import_mat = import_mat[ row_sums(import_mat) > 0, ]

Run and LDA Fit model (only!).

##r chunk

k = 3 #set the number of topics

SEED = 2010 #set a random number 

LDA_fit = LDA(import_mat, k = k, 
              control = list(seed = SEED))

LDA_fixed = LDA(import_mat, k = k, 
                control = list(estimate.alpha = FALSE, seed = SEED))

LDA_gibbs = LDA(import_mat, k = k, method = "Gibbs", 
                control = list(seed = SEED, burnin = 1000, 
                               thin = 100, iter = 1000))

CTM_fit = CTM(import_mat, k = k, 
              control = list(seed = SEED, 
                             var = list(tol = 10^-4), 
                             em = list(tol = 10^-3)))

Comparison of the model.

##r chunk

LDA_fit@alpha

## [1] 0.04882677

LDA_fixed@alpha

## [1] 16.66667

LDA_gibbs@alpha

## [1] 16.66667

# Getting the entropy values

sapply(list(LDA_fit, LDA_fixed, LDA_gibbs, CTM_fit), 
       function (x) 
         mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))

## [1] 0.1873166 1.0727754 1.0899554 0.4022338

# Actual Topics

topics(LDA_fit, k)

##      4 7 8 10 12 14 16 19 20 29 37 39 40 43 48 49 51 52 56 59 62 63 64 65 66 67
## [1,] 2 3 3  3  2  3  1  3  3  2  2  3  2  3  1  1  1  3  2  2  1  2  2  3  1  3
## [2,] 3 2 2  2  1  2  2  1  1  3  1  1  1  2  2  2  2  1  1  1  2  3  3  2  3  1
## [3,] 1 1 1  1  3  1  3  2  2  1  3  2  3  1  3  3  3  2  3  3  3  1  1  1  2  2
##      68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [1,]  2  3  3  1  1  1  2  1  2  2  1  2  2  3  2  1  1  2  1  2  3  2  2  2  3
## [2,]  3  1  2  3  3  2  1  2  1  1  2  1  1  1  1  2  3  1  2  3  2  1  1  3  1
## [3,]  1  2  1  2  2  3  3  3  3  3  3  3  3  2  3  3  2  3  3  1  1  3  3  1  2
##      93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## [1,]  3  3  1  1  3  1  2   2   3   2   2   2   2   2   2   3   3   3   2   1
## [2,]  2  1  3  2  2  3  3   3   2   1   1   1   3   3   1   1   1   2   3   3
## [3,]  1  2  2  3  1  2  1   1   1   3   3   3   1   1   3   2   2   1   1   2
##      113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
## [1,]   2   1   2   2   3   1   3   3   3   1   1   3   1   3   3   3   3   1
## [2,]   3   3   3   1   2   2   2   2   2   2   3   2   2   1   1   2   1   3
## [3,]   1   2   1   3   1   3   1   1   1   3   2   1   3   2   2   1   2   2
##      131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
## [1,]   3   1   2   2   2   2   2   1   1   1   1   2   1   1   2   3   3   3
## [2,]   1   2   3   1   1   3   3   2   3   2   3   1   3   2   3   2   2   1
## [3,]   2   3   1   3   3   1   1   3   2   3   2   3   2   3   1   1   1   2
##      149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
## [1,]   1   1   1   2   2   2   3   1   3   1   2   2   1   1   1   1   1   1
## [2,]   2   3   2   1   3   1   1   3   2   2   1   3   3   3   3   3   3   3
## [3,]   3   2   3   3   1   3   2   2   1   3   3   1   2   2   2   2   2   2
##      167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
## [1,]   1   2   1   1   3   1   3   2   2   2   1   3   3   3   3   3   2   1
## [2,]   2   3   2   2   1   3   2   3   1   1   3   2   2   2   1   1   1   3
## [3,]   3   1   3   3   2   2   1   1   3   3   2   1   1   1   2   2   3   2
##      185 186 187 188 189 190 191 192 193 194 195 197 198
## [1,]   3   2   2   3   1   2   1   1   2   2   3   3   1
## [2,]   1   1   1   1   2   3   2   3   1   3   1   2   3
## [3,]   2   3   3   2   3   1   3   2   3   1   2   1   2

topics(LDA_gibbs, k)

##      4 7 8 10 12 14 16 19 20 29 37 39 40 43 48 49 51 52 56 59 62 63 64 65 66 67
## [1,] 2 3 1  3  2  3  1  1  1  2  1  1  3  3  1  1  1  1  3  3  3  1  3  2  3  1
## [2,] 1 1 2  1  1  1  2  2  2  1  2  2  1  1  2  2  2  2  1  1  1  3  2  1  1  3
## [3,] 3 2 3  2  3  2  3  3  3  3  3  3  2  2  3  3  3  3  2  2  2  2  1  3  2  2
##      68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [1,]  3  2  1  1  3  1  1  3  2  2  3  1  1  2  1  1  1  3  3  2  1  3  1  1  2
## [2,]  1  1  2  2  2  2  3  1  1  3  1  2  2  1  3  3  3  1  2  1  2  1  3  2  3
## [3,]  2  3  3  3  1  3  2  2  3  1  2  3  3  3  2  2  2  2  1  3  3  2  2  3  1
##      93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## [1,]  2  2  1  2  1  3  3   3   3   2   1   2   2   1   3   3   2   3   1   1
## [2,]  3  3  2  3  3  2  1   2   1   1   3   3   1   2   1   2   1   1   3   3
## [3,]  1  1  3  1  2  1  2   1   2   3   2   1   3   3   2   1   3   2   2   2
##      113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
## [1,]   2   1   3   1   2   1   1   3   3   2   2   2   2   3   1   2   2   3
## [2,]   1   2   1   2   1   2   3   1   2   1   1   1   3   1   3   3   1   1
## [3,]   3   3   2   3   3   3   2   2   1   3   3   3   1   2   2   1   3   2
##      131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
## [1,]   3   1   3   3   3   1   1   2   2   3   3   1   1   3   3   2   3   2
## [2,]   2   3   2   1   2   2   3   1   3   1   2   3   3   1   2   3   2   1
## [3,]   1   2   1   2   1   3   2   3   1   2   1   2   2   2   1   1   1   3
##      149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
## [1,]   1   3   2   3   3   1   3   3   1   2   1   2   3   2   2   2   2   2
## [2,]   3   1   1   2   1   3   1   1   2   3   2   3   1   1   1   3   3   3
## [3,]   2   2   3   1   2   2   2   2   3   1   3   1   2   3   3   1   1   1
##      167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
## [1,]   2   2   3   1   3   2   1   1   1   1   2   2   2   2   2   1   2   2
## [2,]   1   1   2   3   1   1   2   2   2   2   1   3   1   3   1   2   1   1
## [3,]   3   3   1   2   2   3   3   3   3   3   3   1   3   1   3   3   3   3
##      185 186 187 188 189 190 191 192 193 194 195 197 198
## [1,]   1   3   2   1   2   2   1   1   2   1   1   3   2
## [2,]   2   2   1   2   1   1   2   2   3   3   3   1   1
## [3,]   3   1   3   3   3   3   3   3   1   2   2   2   3

# The terms for topics

terms(LDA_fit,10)

##       Topic 1    Topic 2      Topic 3       
##  [1,] "tangier"  "coliseum"   "vesuvius"    
##  [2,] "holmes"   "abelard"    "turkish"     
##  [3,] "rucastle" "malakoff"   "acropolis"   
##  [4,] "jaffa"    "pendulum"   "popular"     
##  [5,] "moorish"  "becoming"   "oracle"      
##  [6,] "toller"   "passports"  "annunciation"
##  [7,] "oyster"   "july"       "como"        
##  [8,] "oliver"   "cruise"     "numerous"    
##  [9,] "lamp"     "quarantine" "quarantine"  
## [10,] "turkish"  "nimrod"     "sailing"

terms(LDA_gibbs,10)

##       Topic 1        Topic 2      Topic 3   
##  [1,] "turkish"      "coliseum"   "vesuvius"
##  [2,] "abelard"      "tangier"    "holmes"  
##  [3,] "popular"      "jaffa"      "rucastle"
##  [4,] "oliver"       "legend"     "lamp"    
##  [5,] "ways"         "moorish"    "toller"  
##  [6,] "noted"        "oracle"     "oyster"  
##  [7,] "annunciation" "quarantine" "morocco" 
##  [8,] "catacombs"    "acropolis"  "numerous"
##  [9,] "complained"   "footprints" "bosporus"
## [10,] "splendor"     "treasures"  "bazaar"

Create a plot of the top ten terms for each topic.

##r chunk

#use tidyverse to clean up the the fit     
LDA_fit_topics = tidy(LDA_fit, matrix = "beta")

#create a top terms 
top_terms = LDA_fit_topics %>%
   group_by(topic) %>%
   top_n(10, beta) %>%
   ungroup() %>%
   arrange(topic, -beta)

cleanup = theme(panel.grid.major = element_blank(), 
                panel.grid.minor = element_blank(), 
                panel.background = element_blank(), 
                axis.line.x = element_line(color = "black"),
                axis.line.y = element_line(color = "black"),
                legend.key = element_rect(fill = "white"),
                text = element_text(size = 10))


#make the plot
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  cleanup +
  coord_flip()

##r chunk

LDA_gamma = tidy(LDA_fit, matrix = "gamma")

LDA_gamma %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_point() + 
  cleanup

Gensim Modeling in Python

Transfer the by_chapter to Python and convert it to a list for processing.

##python chunk


exam_answers = list(r.by_chapter["text"])
exam_answers[0]

## 'Chapter 1   It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.  However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.  "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"  Mr. Bennet replied that he had not.  "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."  Mr. Bennet made no answer.  "Do you not want to know who has taken it?" cried his wife impatiently.  "_You_ want to tell me, and I have no objection to hearing it."  This was invitation enough.  "Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he came down on Monday in a chaise and four to see the place, and was so much delighted with it, that he agreed with Mr. Morris immediately; that he is to take possession before Michaelmas, and some of his servants are to be in the house by the end of next week."  "What is his name?"  "Bingley."  "Is he married or single?"  "Oh! Single, my dear, to be sure! A single man of large fortune; four or five thousand a year. What a fine thing for our girls!"  "How so? How can it affect them?"  "My dear Mr. Bennet," replied his wife, "how can you be so tiresome! You must know that I am thinking of his marrying one of them."  "Is that his design in settling here?"  "Design! Nonsense, how can you talk so! But it is very likely that he _may_ fall in love with one of them, and therefore you must visit him as soon as he comes."  "I see no occasion for that. You and the girls may go, or you may send them by themselves, which perhaps will be still better, for as you are as handsome as any of them, Mr. Bingley may like you the best of the party."  "My dear, you flatter me. I certainly _have_ had my share of beauty, but I do not pretend to be anything extraordinary now. When a woman has five grown-up daughters, she ought to give over thinking of her own beauty."  "In such cases, a woman has not often much beauty to think of."  "But, my dear, you must indeed go and see Mr. Bingley when he comes into the neighbourhood."  "It is more than I engage for, I assure you."  "But consider your daughters. Only think what an establishment it would be for one of them. Sir William and Lady Lucas are determined to go, merely on that account, for in general, you know, they visit no newcomers. Indeed you must go, for it will be impossible for _us_ to visit him if you do not."  "You are over-scrupulous, surely. I dare say Mr. Bingley will be very glad to see you; and I will send a few lines by you to assure him of my hearty consent to his marrying whichever he chooses of the girls; though I must throw in a good word for my little Lizzy."  "I desire you will do no such thing. Lizzy is not a bit better than the others; and I am sure she is not half so handsome as Jane, nor half so good-humoured as Lydia. But you are always giving _her_ the preference."  "They have none of them much to recommend them," replied he; "they are all silly and ignorant like other girls; but Lizzy has something more of quickness than her sisters."  "Mr. Bennet, how _can_ you abuse your own children in such a way? You take delight in vexing me. You have no compassion for my poor nerves."  "You mistake me, my dear. I have a high respect for your nerves. They are my old friends. I have heard you mention them with consideration these last twenty years at least."  "Ah, you do not know what I suffer."  "But I hope you will get over it, and live to see many young men of four thousand a year come into the neighbourhood."  "It will be no use to us, if twenty such should come, since you will not visit them."  "Depend upon it, my dear, that when there are twenty, I will visit them all."  Mr. Bennet was so odd a mixture of quick parts, sarcastic humour, reserve, and caprice, that the experience of three-and-twenty years had been insufficient to make his wife understand his character. _Her_ mind was less difficult to develop. She was a woman of mean understanding, little information, and uncertain temper. When she was discontented, she fancied herself nervous. The business of her life was to get her daughters married; its solace was visiting and news.   '

Process the text using Python.

##python chunk


##create a spot to save the processed text
processed_text = []

##loop through each item in the list
for answer in exam_answers:
  #lower case
  answer = answer.lower() 
  #create tokens
  answer = nltk.word_tokenize(answer) 
  #take out stop words
  answer = [word for word in answer if word not in stopwords.words('english')] 
  #stem the words
  answer = [ps.stem(word = word) for word in answer]
  #add it to our list
  processed_text.append(answer)

processed_text[0]

## ['chapter', '1', 'truth', 'univers', 'acknowledg', ',', 'singl', 'man', 'possess', 'good', 'fortun', ',', 'must', 'want', 'wife', '.', 'howev', 'littl', 'known', 'feel', 'view', 'man', 'may', 'first', 'enter', 'neighbourhood', ',', 'truth', 'well', 'fix', 'mind', 'surround', 'famili', ',', 'consid', 'right', 'properti', 'one', 'daughter', '.', '``', 'dear', 'mr.', 'bennet', ',', "''", 'said', 'ladi', 'one', 'day', ',', '``', 'heard', 'netherfield', 'park', 'let', 'last', '?', "''", 'mr.', 'bennet', 'repli', '.', '``', ',', "''", 'return', ';', '``', 'mrs.', 'long', ',', 'told', '.', "''", 'mr.', 'bennet', 'made', 'answer', '.', '``', 'want', 'know', 'taken', '?', "''", 'cri', 'wife', 'impati', '.', '``', '_you_', 'want', 'tell', ',', 'object', 'hear', '.', "''", 'invit', 'enough', '.', '``', ',', 'dear', ',', 'must', 'know', ',', 'mrs.', 'long', 'say', 'netherfield', 'taken', 'young', 'man', 'larg', 'fortun', 'north', 'england', ';', 'came', 'monday', 'chais', 'four', 'see', 'place', ',', 'much', 'delight', ',', 'agre', 'mr.', 'morri', 'immedi', ';', 'take', 'possess', 'michaelma', ',', 'servant', 'hous', 'end', 'next', 'week', '.', "''", '``', 'name', '?', "''", '``', 'bingley', '.', "''", '``', 'marri', 'singl', '?', "''", '``', 'oh', '!', 'singl', ',', 'dear', ',', 'sure', '!', 'singl', 'man', 'larg', 'fortun', ';', 'four', 'five', 'thousand', 'year', '.', 'fine', 'thing', 'girl', '!', "''", '``', '?', 'affect', '?', "''", '``', 'dear', 'mr.', 'bennet', ',', "''", 'repli', 'wife', ',', '``', 'tiresom', '!', 'must', 'know', 'think', 'marri', 'one', '.', "''", '``', 'design', 'settl', '?', "''", '``', 'design', '!', 'nonsens', ',', 'talk', '!', 'like', '_may_', 'fall', 'love', 'one', ',', 'therefor', 'must', 'visit', 'soon', 'come', '.', "''", '``', 'see', 'occas', '.', 'girl', 'may', 'go', ',', 'may', 'send', ',', 'perhap', 'still', 'better', ',', 'handsom', ',', 'mr.', 'bingley', 'may', 'like', 'best', 'parti', '.', "''", '``', 'dear', ',', 'flatter', '.', 'certainli', '_have_', 'share', 'beauti', ',', 'pretend', 'anyth', 'extraordinari', '.', 'woman', 'five', 'grown-up', 'daughter', ',', 'ought', 'give', 'think', 'beauti', '.', "''", '``', 'case', ',', 'woman', 'often', 'much', 'beauti', 'think', '.', "''", '``', ',', 'dear', ',', 'must', 'inde', 'go', 'see', 'mr.', 'bingley', 'come', 'neighbourhood', '.', "''", '``', 'engag', ',', 'assur', '.', "''", '``', 'consid', 'daughter', '.', 'think', 'establish', 'would', 'one', '.', 'sir', 'william', 'ladi', 'luca', 'determin', 'go', ',', 'mere', 'account', ',', 'gener', ',', 'know', ',', 'visit', 'newcom', '.', 'inde', 'must', 'go', ',', 'imposs', '_us_', 'visit', '.', "''", '``', 'over-scrupul', ',', 'sure', '.', 'dare', 'say', 'mr.', 'bingley', 'glad', 'see', ';', 'send', 'line', 'assur', 'hearti', 'consent', 'marri', 'whichev', 'choos', 'girl', ';', 'though', 'must', 'throw', 'good', 'word', 'littl', 'lizzi', '.', "''", '``', 'desir', 'thing', '.', 'lizzi', 'bit', 'better', 'other', ';', 'sure', 'half', 'handsom', 'jane', ',', 'half', 'good-humour', 'lydia', '.', 'alway', 'give', '_her_', 'prefer', '.', "''", '``', 'none', 'much', 'recommend', ',', "''", 'repli', ';', '``', 'silli', 'ignor', 'like', 'girl', ';', 'lizzi', 'someth', 'quick', 'sister', '.', "''", '``', 'mr.', 'bennet', ',', '_can_', 'abus', 'children', 'way', '?', 'take', 'delight', 'vex', '.', 'compass', 'poor', 'nerv', '.', "''", '``', 'mistak', ',', 'dear', '.', 'high', 'respect', 'nerv', '.', 'old', 'friend', '.', 'heard', 'mention', 'consider', 'last', 'twenti', 'year', 'least', '.', "''", '``', 'ah', ',', 'know', 'suffer', '.', "''", '``', 'hope', 'get', ',', 'live', 'see', 'mani', 'young', 'men', 'four', 'thousand', 'year', 'come', 'neighbourhood', '.', "''", '``', 'use', 'us', ',', 'twenti', 'come', ',', 'sinc', 'visit', '.', "''", '``', 'depend', 'upon', ',', 'dear', ',', 'twenti', ',', 'visit', '.', "''", 'mr.', 'bennet', 'odd', 'mixtur', 'quick', 'part', ',', 'sarcast', 'humour', ',', 'reserv', ',', 'capric', ',', 'experi', 'three-and-twenti', 'year', 'insuffici', 'make', 'wife', 'understand', 'charact', '.', '_her_', 'mind', 'less', 'difficult', 'develop', '.', 'woman', 'mean', 'understand', ',', 'littl', 'inform', ',', 'uncertain', 'temper', '.', 'discont', ',', 'fanci', 'nervou', '.', 'busi', 'life', 'get', 'daughter', 'marri', ';', 'solac', 'visit', 'news', '.']

Create the dictionary and term document matrix in Python.

##python chunk

dictionary = corpora.Dictionary(processed_text)

#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]

Create the LDA Topics model in Python using the same number of topics you picked for the LDA Fit R Model.

##python chunk

lda_model = gensim.models.ldamodel.LdaModel(corpus = doc_term_matrix, #TDM
                                           id2word = dictionary, #Dictionary
                                           num_topics = 3, 
                                           random_state = 100,
                                           update_every = 1,
                                           chunksize = 100,
                                           passes = 10,
                                           alpha = 'auto',
                                           per_word_topics = True)

Create the interactive graphics html file. Please note that this file saves in the same folder as your markdown document, and you should upload the knitted file and the LDA visualization html file.

##python chunk


print(lda_model.print_topics())

## [(0, '0.113*"," + 0.063*"." + 0.023*"``" + 0.022*"\'\'" + 0.019*";" + 0.009*"mr." + 0.008*"\'s" + 0.007*"elizabeth" + 0.006*"could" + 0.006*"!"'), (1, '0.088*"," + 0.057*"." + 0.029*"--" + 0.011*"\'\'" + 0.010*"``" + 0.009*";" + 0.008*"!" + 0.005*"one" + 0.005*"\'s" + 0.004*"said"'), (2, '0.098*"," + 0.058*"." + 0.016*"--" + 0.011*";" + 0.008*"one" + 0.005*"\'\'" + 0.005*"``" + 0.004*"!" + 0.004*"upon" + 0.004*"\'s"')]

Interpretation

Interpret your space - can you see the differences between books/novels? Explain the results from your analysis (more than one sentence please).

LDA has low alpha value, which indicate one single topic has more influence in the document. High alpha values for LDA Fixed and LDA Gibbs models indicates higher spread in the topics.

There is low entropy values for LDA Fit Model and CTM Fit model , which shows low randomness in document whereas the LDA Fixed and LDA Gibbs models have high entropy values, show that topics are spreaded all over place, which indicate less coherence with the document.

For all the books ‘The Adventures of Sherlock Holmes’, ‘The Odyssey’, ‘The Innocents Abroad’, ‘Pride and Prejudice’ have all the topics 1 2 and 3 are unifomally distributed across the books

For all the topics 1 , 2 and 3 there seems no any frequent words , all words are equally visible.

Most dominant terms in topic 1 is tangier and holmes, and for topic 2 , dominant terms are coliseum , ableard. For topic 3 the two most dominant terms are vesuvius and turkish.

From gamma plot we can understand that all the points are uniformally distributed from gamma 0 to 1. It indicated that a topic is equally probalble.