Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
##r chunk
library(gutenbergr)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
#library(pdftools)
library(tm)
## Loading required package: NLP
library(lsa)
## Loading required package: SnowballC
library(reticulate)
library(ngram)
library(knitr)
py_config()
## python: C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python.exe
## libpython: C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python37.dll
## pythonhome: C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate
## version: 3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
## Architecture: 64bit
## numpy: C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/Lib/site-packages/numpy
## numpy_version: 1.17.3
##
## python versions found:
## C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python.exe
## C:/Users/rohan/AppData/Local/Continuum/anaconda3/python.exe
## C:/Users/rohan/AppData/Local/Programs/Python/Python38-32/python.exe
use_python("C://Users//rohan//AppData//Local//Continuum//anaconda3//envs//r-reticulate//python")
py_install("nltk")
py_install("gensim")
py_install("matplotlib")
py_install("rpy2")
Load the Python libraries or functions that you will use for that section.
##python chunk
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
You will want to use some books from Project Gutenberg to perform a Latent Semantic Analysis. The code to pick the books has been provided for you, so all you would need to do is change out the titles. Be sure to pick different books - these are just provided as an example. Check out the book titles at https://www.gutenberg.org/.
##r chunk
##pick some titles from project gutenberg
titles = c( "Little Woman", "War and Peace",
"A Modest Proposal", "A Tale of Two Cities")
##read in those books
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title") %>%
mutate(document = row_number())
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
## Warning in .f(.x[[i]], ...): Could not download a book at http://
## aleph.gutenberg.org/1/9/5/0/19508/19508.zip
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))
by_chapter
## # A tibble: 777 x 2
## document text
## <chr> <chr>
## 1 A Tale of Two Cities_1 " Chapter I The Period"
## 2 A Tale of Two Cities_10 " Chapter IV Congratulatory"
## 3 A Tale of Two Cities_11 " Chapter V The Jackal"
## 4 A Tale of Two Cities_12 " Chapter VI Hundreds of People"
## 5 A Tale of Two Cities_13 " Chapter VII Monseigneur in Town"
## 6 A Tale of Two Cities_14 " Chapter VIII Monseigneur in the Country"
## 7 A Tale of Two Cities_15 " Chapter IX The Gorgon's Head"
## 8 A Tale of Two Cities_16 " Chapter X Two Promises"
## 9 A Tale of Two Cities_17 " Chapter XI A Companion Picture"
## 10 A Tale of Two Cities_18 " Chapter XII The Fellow of Delicacy"
## # … with 767 more rows
The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.
Use tm_map to clean up the text.
##r chunk
exam_answers = read.csv("C:\\Users\\rohan\\OneDrive\\540\\exam_answers.csv",
header = F, stringsAsFactors = F)
library(ngram)
exam_answers$processed = apply(exam_answers, 1, preprocess)
library(LSAfun, quietly = T)
corpus = Corpus(VectorSource(by_chapter$text))
corpus = tm_map(corpus, tolower)
## Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops documents
corpus = tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus = tm_map(corpus, function(x) removeWords(x, stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus, function(x) removeWords(x,
## stopwords("english"))): transformation drops documents
Create a latent semantic analysis model in R.
##r chunk
#Create the term by document matrix
book_mat = as.matrix(TermDocumentMatrix(corpus))
library(lsa)
#Weight the semantic space
book_weight = lw_logtf(book_mat) * gw_idf(book_mat)
#Run the SVD
book_lsa = lsa(book_weight)
## Warning in lsa(book_weight): [lsa] - there are singular values which are zero.
#Convert to textmatrix for coherence
book_lsa = as.textmatrix(book_lsa)
Explore the vector space: - Include at least one graphic of the vector space that interests you. - Include some statistics for your model: coherence, cosine, neighbors, etc.
##r chunk
load(file = "book_lsa.rda")
coherence(exam_answers$processed[5],
tvectors = book_lsa)
## $local
## [1] 0.8382780 0.5912498 0.7265681 0.7052255 0.6247112 0.4329691 0.5219851
## [8] 0.7849194 0.9018771 0.6324761 0.7202464 0.6815167
##
## $global
## [1] 0.6801685
plot_neighbors("attention", #single word
n = 10, #number of neighbors
tvectors = book_lsa, #matrix space
method = "MDS", #PCA or MDS
dims = 2) #number of dimensions
## x y
## attention -0.21577067 0.10472351
## pay -0.53624750 0.12263114
## 1992b 0.07825270 -0.08888274
## enabling 0.07825270 -0.08888274
## settings -0.07175133 -0.19237428
## joint 0.14599147 -0.05983595
## setting -0.13889655 -0.30670474
## hall 0.31443183 0.06327499
## woodward 0.29986067 0.05606942
## labelling 0.04587667 0.38998139
choose.target("information", #choose word
lower = .3, #lower cosine
upper = .4, #upper cosine
n = 10, #number of related words to get
tvectors = book_lsa)
## modalityspecific takes eye allowed
## 0.3346019 0.3384103 0.3489222 0.3141806
## understand select question 1998
## 0.3380527 0.3190996 0.3587158 0.3561607
## discourse end
## 0.3367944 0.3202899
#use a multiselect for lists of words
list1 = c("object", "basic", "attention", "approaches")
#plot all the words selected
plot_wordlist(list1, #put in the list above
method = "MDS",
dims = 2, #pick the number of dimensions
tvectors = book_lsa)
## x y
## object 0.2698205 0.43055794
## basic 0.1816531 -0.13959887
## attention 0.1028264 -0.38363000
## approaches -0.5543001 0.09267093
multicos(list1, tvectors = book_lsa)
## object basic attention approaches
## object 1.0000000 0.1686430 0.1499263 0.0944985
## basic 0.1686430 1.0000000 0.1817637 0.1138556
## attention 0.1499263 0.1817637 1.0000000 0.1194913
## approaches 0.0944985 0.1138556 0.1194913 1.0000000
Transfer the by_chapter to Python and convert it to a list for processing.
bychapter= list(r.by_chapter["text"])
bychapter[0]
## ' Chapter I The Period'
Process the text using Python.
##python chunk
import nltk
nltk.download('punkt')
## True
##
## [nltk_data] Downloading package punkt to
## [nltk_data] C:\Users\rohan\AppData\Roaming\nltk_data...
## [nltk_data] Package punkt is already up-to-date!
nltk.download( 'stopwords')
##create a spot to save the processed text
## True
##
## [nltk_data] Downloading package stopwords to
## [nltk_data] C:\Users\rohan\AppData\Roaming\nltk_data...
## [nltk_data] Package stopwords is already up-to-date!
processed_text = []
##loop through each item in the list
for answer in bychapter:
#lower case
answer = answer.lower()
#create tokens
answer = nltk.word_tokenize(answer)
#take out stop words
answer = [word for word in answer if word not in stopwords.words('english')]
#stem the words
answer = [ps.stem(word = word) for word in answer]
#add it to our list
processed_text.append(answer)
processed_text[0]
## ['chapter', 'period']
Create the dictionary and term document matrix in Python.
##python chunk
dictionary = corpora.Dictionary(processed_text)
#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]
Find the most likely number of dimensions using the coherence functions from the lecture.
##python chunk
##figure out the coherence scores
#def compute_coherence_values(dictionary, doc_term_matrix, clean_text, start = 2, stop = 10, step = 2):
#coherence_values = []
#model_list = []
#for num_topics in range(start, stop, step):
# generate LSA model
#model = LsiModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary) # train model
# model_list.append(model)
# coherencemodel = CoherenceModel(model=model, texts=clean_text, dictionary=dictionary, coherence='c_v')
#coherence_values.append(coherencemodel.get_coherence())
# return model_list, coherence_values
#def plot_graph(dictionary, doc_term_matrix, clean_text, start, stop, step):
#model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix, clean_text, start, stop, step)
# Show graph
#x = range(start, stop, step)
#plt.plot(x, coherence_values)
#plt.xlabel("Number of Topics")
#plt.ylabel("Coherence score")
#plt.legend(("coherence_values"), loc='best')
#plt.show()
#start,stop,step=2,7,1
#plot_graph(dictionary, doc_term_matrix, processed_text, start, stop, step)
Create the LSA model in Python with the optimal number of dimensions from the previous step.
##python chunk
#run the LSA
number_of_topics = 4
words = 10
lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)
print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
## [(0, '0.860*"," + 0.393*"." + 0.183*"``" + 0.177*"\'\'" + 0.071*";" + 0.063*"!" + 0.060*"?" + 0.046*"--" + 0.044*"said" + 0.042*"\'s"'), (1, '-0.459*"“" + -0.458*"”" + 0.364*"``" + 0.353*"\'\'" + -0.346*"’" + -0.266*"." + 0.096*";" + -0.093*"princ" + -0.092*"..." + 0.091*"--"'), (2, '-0.376*"``" + -0.365*"\'\'" + -0.338*"”" + -0.334*"“" + -0.255*"!" + -0.241*"?" + 0.203*"," + -0.154*"said" + 0.126*"armi" + 0.100*"napoleon"'), (3, '0.730*"." + -0.381*"," + 0.277*"pierr" + -0.159*"’" + -0.136*"”" + -0.136*"rostóv" + -0.134*"“" + 0.113*"princess" + 0.106*"mari" + 0.076*"princ"')]
Interpret your space - can you see the differences between books/novels? Explain the results from your analysis (more than one sentence please).
Due to large observations the plot was not able to be measured and therefore model has partial visibility of the processed text.