Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk
library(gutenbergr)
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
#library(pdftools)
library(tm)
## Loading required package: NLP
library(lsa)
## Loading required package: SnowballC
library(reticulate)
library(ngram)
library(knitr)

py_config()
## python:         C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python.exe
## libpython:      C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python37.dll
## pythonhome:     C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate
## version:        3.7.1 (default, Oct 28 2018, 08:39:03) [MSC v.1912 64 bit (AMD64)]
## Architecture:   64bit
## numpy:          C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/Lib/site-packages/numpy
## numpy_version:  1.17.3
## 
## python versions found: 
##  C:/Users/rohan/AppData/Local/Continuum/anaconda3/envs/r-reticulate/python.exe
##  C:/Users/rohan/AppData/Local/Continuum/anaconda3/python.exe
##  C:/Users/rohan/AppData/Local/Programs/Python/Python38-32/python.exe
use_python("C://Users//rohan//AppData//Local//Continuum//anaconda3//envs//r-reticulate//python")

py_install("nltk")

py_install("gensim")

py_install("matplotlib")

py_install("rpy2") 

Load the Python libraries or functions that you will use for that section.

##python chunk


import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
ps = PorterStemmer()


import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

The Data

You will want to use some books from Project Gutenberg to perform a Latent Semantic Analysis. The code to pick the books has been provided for you, so all you would need to do is change out the titles. Be sure to pick different books - these are just provided as an example. Check out the book titles at https://www.gutenberg.org/.

##r chunk
##pick some titles from project gutenberg
titles = c( "Little Woman", "War and Peace",
           "A Modest Proposal", "A Tale of Two Cities")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
## Warning in .f(.x[[i]], ...): Could not download a book at http://
## aleph.gutenberg.org/1/9/5/0/19508/19508.zip
create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

by_chapter
## # A tibble: 777 x 2
##    document                text                                            
##    <chr>                   <chr>                                           
##  1 A Tale of Two Cities_1  "     Chapter I      The Period"                
##  2 A Tale of Two Cities_10 "     Chapter IV     Congratulatory"            
##  3 A Tale of Two Cities_11 "     Chapter V      The Jackal"                
##  4 A Tale of Two Cities_12 "     Chapter VI     Hundreds of People"        
##  5 A Tale of Two Cities_13 "     Chapter VII    Monseigneur in Town"       
##  6 A Tale of Two Cities_14 "     Chapter VIII   Monseigneur in the Country"
##  7 A Tale of Two Cities_15 "     Chapter IX     The Gorgon's Head"         
##  8 A Tale of Two Cities_16 "     Chapter X      Two Promises"              
##  9 A Tale of Two Cities_17 "     Chapter XI     A Companion Picture"       
## 10 A Tale of Two Cities_18 "     Chapter XII    The Fellow of Delicacy"    
## # … with 767 more rows

The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.

Create the Vector Space

Use tm_map to clean up the text.

##r chunk 

exam_answers = read.csv("C:\\Users\\rohan\\OneDrive\\540\\exam_answers.csv", 
                        header = F, stringsAsFactors = F)

library(ngram)

exam_answers$processed = apply(exam_answers, 1, preprocess)

library(LSAfun, quietly = T)
 

corpus = Corpus(VectorSource(by_chapter$text))
                             
corpus = tm_map(corpus, tolower)
## Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops documents
corpus = tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus = tm_map(corpus, function(x) removeWords(x, stopwords("english")))
## Warning in tm_map.SimpleCorpus(corpus, function(x) removeWords(x,
## stopwords("english"))): transformation drops documents

Create a latent semantic analysis model in R.

##r chunk

#Create the term by document matrix
book_mat = as.matrix(TermDocumentMatrix(corpus))

library(lsa)

#Weight the semantic space
book_weight = lw_logtf(book_mat) * gw_idf(book_mat)

#Run the SVD
book_lsa = lsa(book_weight)
## Warning in lsa(book_weight): [lsa] - there are singular values which are zero.
#Convert to textmatrix for coherence
book_lsa = as.textmatrix(book_lsa)

Explore the vector space: - Include at least one graphic of the vector space that interests you. - Include some statistics for your model: coherence, cosine, neighbors, etc.

##r chunk

load(file = "book_lsa.rda")

coherence(exam_answers$processed[5], 
          tvectors = book_lsa)
## $local
##  [1] 0.8382780 0.5912498 0.7265681 0.7052255 0.6247112 0.4329691 0.5219851
##  [8] 0.7849194 0.9018771 0.6324761 0.7202464 0.6815167
## 
## $global
## [1] 0.6801685
plot_neighbors("attention", #single word
               n = 10, #number of neighbors
               tvectors = book_lsa, #matrix space
               method = "MDS", #PCA or MDS
               dims = 2) #number of dimensions

##                     x           y
## attention -0.21577067  0.10472351
## pay       -0.53624750  0.12263114
## 1992b      0.07825270 -0.08888274
## enabling   0.07825270 -0.08888274
## settings  -0.07175133 -0.19237428
## joint      0.14599147 -0.05983595
## setting   -0.13889655 -0.30670474
## hall       0.31443183  0.06327499
## woodward   0.29986067  0.05606942
## labelling  0.04587667  0.38998139
choose.target("information", #choose word
              lower = .3, #lower cosine
              upper = .4, #upper cosine
              n = 10, #number of related words to get
              tvectors = book_lsa)
## modalityspecific            takes              eye          allowed 
##        0.3346019        0.3384103        0.3489222        0.3141806 
##       understand           select         question             1998 
##        0.3380527        0.3190996        0.3587158        0.3561607 
##        discourse              end 
##        0.3367944        0.3202899
#use a multiselect for lists of words
list1 = c("object", "basic", "attention", "approaches")

#plot all the words selected
plot_wordlist(list1, #put in the list above 
              method = "MDS", 
              dims = 2, #pick the number of dimensions
              tvectors = book_lsa)

##                     x           y
## object      0.2698205  0.43055794
## basic       0.1816531 -0.13959887
## attention   0.1028264 -0.38363000
## approaches -0.5543001  0.09267093
multicos(list1, tvectors = book_lsa)
##               object     basic attention approaches
## object     1.0000000 0.1686430 0.1499263  0.0944985
## basic      0.1686430 1.0000000 0.1817637  0.1138556
## attention  0.1499263 0.1817637 1.0000000  0.1194913
## approaches 0.0944985 0.1138556 0.1194913  1.0000000

Transfer the by_chapter to Python and convert it to a list for processing.


bychapter= list(r.by_chapter["text"])

bychapter[0]
## '     Chapter I      The Period'

Process the text using Python.

##python chunk

import nltk
nltk.download('punkt')
## True
## 
## [nltk_data] Downloading package punkt to
## [nltk_data]     C:\Users\rohan\AppData\Roaming\nltk_data...
## [nltk_data]   Package punkt is already up-to-date!
nltk.download( 'stopwords')

##create a spot to save the processed text
## True
## 
## [nltk_data] Downloading package stopwords to
## [nltk_data]     C:\Users\rohan\AppData\Roaming\nltk_data...
## [nltk_data]   Package stopwords is already up-to-date!
processed_text = []

##loop through each item in the list
for answer in bychapter:
  #lower case
  answer = answer.lower() 
  #create tokens
  answer = nltk.word_tokenize(answer) 
  #take out stop words
  answer = [word for word in answer if word not in stopwords.words('english')] 
  #stem the words
  answer = [ps.stem(word = word) for word in answer]
  #add it to our list
  processed_text.append(answer)

processed_text[0]
## ['chapter', 'period']

Create the dictionary and term document matrix in Python.

##python chunk

dictionary = corpora.Dictionary(processed_text)

#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]

Find the most likely number of dimensions using the coherence functions from the lecture.

##python chunk

##figure out the coherence scores
#def compute_coherence_values(dictionary, doc_term_matrix, clean_text, start = 2, stop = 10, step = 2):
    #coherence_values = []
    #model_list = []
    #for num_topics in range(start, stop, step):
        # generate LSA model
        #model = LsiModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary)  # train model
       # model_list.append(model)
       # coherencemodel = CoherenceModel(model=model, texts=clean_text, dictionary=dictionary, coherence='c_v')
        #coherence_values.append(coherencemodel.get_coherence())
   # return model_list, coherence_values

#def plot_graph(dictionary, doc_term_matrix, clean_text, start, stop, step):
    #model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix, clean_text, start, stop, step)
    # Show graph
    #x = range(start, stop, step)
    #plt.plot(x, coherence_values)
    #plt.xlabel("Number of Topics")
    #plt.ylabel("Coherence score")
    #plt.legend(("coherence_values"), loc='best')
    #plt.show()
    
#start,stop,step=2,7,1
#plot_graph(dictionary, doc_term_matrix, processed_text, start, stop, step)

Create the LSA model in Python with the optimal number of dimensions from the previous step.

##python chunk

#run the LSA
number_of_topics = 4
words = 10
lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)
print(lsamodel.print_topics(num_topics=number_of_topics, num_words=words))
## [(0, '0.860*"," + 0.393*"." + 0.183*"``" + 0.177*"\'\'" + 0.071*";" + 0.063*"!" + 0.060*"?" + 0.046*"--" + 0.044*"said" + 0.042*"\'s"'), (1, '-0.459*"“" + -0.458*"”" + 0.364*"``" + 0.353*"\'\'" + -0.346*"’" + -0.266*"." + 0.096*";" + -0.093*"princ" + -0.092*"..." + 0.091*"--"'), (2, '-0.376*"``" + -0.365*"\'\'" + -0.338*"”" + -0.334*"“" + -0.255*"!" + -0.241*"?" + 0.203*"," + -0.154*"said" + 0.126*"armi" + 0.100*"napoleon"'), (3, '0.730*"." + -0.381*"," + 0.277*"pierr" + -0.159*"’" + -0.136*"”" + -0.136*"rostóv" + -0.134*"“" + 0.113*"princess" + 0.106*"mari" + 0.076*"princ"')]

Interpretation

Interpret your space - can you see the differences between books/novels? Explain the results from your analysis (more than one sentence please).

Due to large observations the plot was not able to be measured and therefore model has partial visibility of the processed text.