Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk
library(gutenbergr)
library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(LSAfun)
## Loading required package: lsa
## Loading required package: SnowballC
## Loading required package: rgl
## Warning in rgl.init(initValue, onlyNULL): RGL: unable to open X11 display
## Warning: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'.
library(lsa)

Load the Python libraries or functions that you will use for that section.

##python chunk
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
ps = PorterStemmer()

import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

import matplotlib 
matplotlib.use('pdf')

The Data

You will want to use some books from Project Gutenberg to perform a Latent Semantic Analysis. The code to pick the books has been provided for you, so all you would need to do is change out the titles. Pick 2 titles from the list below. You can also try other titles not on the list, but they may not work. Check out other book titles at https://www.gutenberg.org/.

##r chunk
##pick 2 titles from project gutenberg, put in quotes and separate with commas
## DO NOT use the titles used in class, using those will be a 10 point deduction
titles = c("The Art of War", "The Iliad")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title", mirror = "http://mirrors.xmission.com/gutenberg/") %>% 
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))
## `summarise()` ungrouping output (override with `.groups` argument)
by_chapter$document
##  [1] "The Art of War_1"  "The Art of War_10" "The Art of War_11"
##  [4] "The Art of War_12" "The Art of War_13" "The Art of War_14"
##  [7] "The Art of War_15" "The Art of War_16" "The Art of War_17"
## [10] "The Art of War_18" "The Art of War_19" "The Art of War_2" 
## [13] "The Art of War_20" "The Art of War_21" "The Art of War_22"
## [16] "The Art of War_23" "The Art of War_24" "The Art of War_25"
## [19] "The Art of War_26" "The Art of War_27" "The Art of War_28"
## [22] "The Art of War_29" "The Art of War_3"  "The Art of War_30"
## [25] "The Art of War_31" "The Art of War_32" "The Art of War_33"
## [28] "The Art of War_34" "The Art of War_35" "The Art of War_36"
## [31] "The Art of War_37" "The Art of War_38" "The Art of War_39"
## [34] "The Art of War_4"  "The Art of War_40" "The Art of War_41"
## [37] "The Art of War_42" "The Art of War_43" "The Art of War_44"
## [40] "The Art of War_45" "The Art of War_46" "The Art of War_47"
## [43] "The Art of War_48" "The Art of War_49" "The Art of War_5" 
## [46] "The Art of War_50" "The Art of War_51" "The Art of War_52"
## [49] "The Art of War_53" "The Art of War_54" "The Art of War_55"
## [52] "The Art of War_56" "The Art of War_57" "The Art of War_58"
## [55] "The Art of War_59" "The Art of War_6"  "The Art of War_60"
## [58] "The Art of War_61" "The Art of War_62" "The Art of War_63"
## [61] "The Art of War_64" "The Art of War_65" "The Art of War_66"
## [64] "The Art of War_67" "The Art of War_68" "The Art of War_7" 
## [67] "The Art of War_8"  "The Art of War_9"

The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.

Create the Vector Space

Use tm_map to clean up the text.

##r chunk 
library(tm)
## Loading required package: NLP
chapter_corpus  = Corpus(VectorSource(by_chapter$text))
chapter_corpus = tm_map(chapter_corpus, tolower)
## Warning in tm_map.SimpleCorpus(chapter_corpus, tolower): transformation drops
## documents
chapter_corpus = tm_map(chapter_corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(chapter_corpus, removePunctuation):
## transformation drops documents
chapter_corpus = tm_map(chapter_corpus, function(x) removeWords(x, stopwords("english")))
## Warning in tm_map.SimpleCorpus(chapter_corpus, function(x) removeWords(x, :
## transformation drops documents

Create a latent semantic analysis model in R.

##r chunk
chapter_matrix = as.matrix(TermDocumentMatrix(chapter_corpus))
#Weight the semantic space
chapter_weight = lw_logtf(chapter_matrix) * gw_idf(chapter_matrix)
#Run the SVD
chapter_lsa = lsa(chapter_weight)
#Convert to textmatrix for coherence
chapter_lsa = as.textmatrix(chapter_lsa)

Explore the vector space: - Include at least one graphic of the vector space that interests you. - Include at least 2 set of statistics for your model: coherence, cosine, neighbors, etc.

##r chunk
cosine(chapter_lsa[,1], chapter_lsa[,3])
##           [,1]
## [1,] 0.8552614
cosine(chapter_lsa[,1], chapter_lsa[,2])
##           [,1]
## [1,] 0.9548858
#underground
coherence(by_chapter$text[1], tvectors = chapter_lsa)$global
## [1] 0.9071244

Transfer the by_chapter to Python and convert it to a list for processing.

##python chunk
chapters = list(r.by_chapter['text'])

Process the text using Python.

##python chunk
##python chunk
##create a spot to save the processed text
processed_text = []

##loop through each item in the list
for chapter in chapters:
  #lower case
  chapter = chapter.lower() 
  #remove punctuation
  chapter = chapter.translate(str.maketrans('', '', string.punctuation))
  #create tokens
  chapter = nltk.word_tokenize(chapter) 
  #take out stop words
  chapter = [word for word in chapter if word not in stopwords.words('english')] 
  #stem the words
  chapter = [ps.stem(word = word) for word in chapter]
  #add it to our list
  processed_text.append(chapter)

processed_text[0]
## ['tell', 'us', 'chapter', 'proce', 'give', 'biographi', 'descend', 'sun', 'pin', 'born', 'hundr', 'year', 'famou', 'ancestor', 'death', 'also', 'outstand', 'militari', 'geniu', 'time', 'historian', 'speak', 'sun', 'tzu', 'prefac', 'read', 'sun', 'tzu', 'feet', 'cut', 'yet', 'continu', 'discuss', 'art', 'war', '3', 'seem', 'like', 'pin', 'nicknam', 'bestow', 'mutil', 'unless', 'stori', 'invent', 'order', 'account', 'name', 'crown', 'incid', 'career', 'crush', 'defeat', 'treacher', 'rival', 'pang', 'chuan', 'found', 'briefli', 'relat']

Create the dictionary and term document matrix in Python.

##python chunk
#create a dictionary of the words
dictionary = corpora.Dictionary(processed_text)

#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]

Find the most likely number of dimensions using the coherence functions from the lecture.

##python chunk
##figure out the coherence scores
def compute_coherence_values(dictionary, doc_term_matrix, clean_text, start = 2, stop = 100, step = 2):
    coherence_values = []
    model_list = []
    for num_topics in range(start, stop, step):
        # generate LSA model
        model = LsiModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary)  # train model
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, corpus = doc_term_matrix, texts=clean_text, dictionary=dictionary, coherence='u_mass')
        coherence_values.append(coherencemodel.get_coherence())
    return model_list, coherence_values

def plot_graph(dictionary, doc_term_matrix, clean_text, start, stop, step):
    model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix, clean_text, start, stop, step)
    # Show graph
    x = range(start, stop, step)
    plt.plot(x, coherence_values)
    plt.xlabel("Number of Topics")
    plt.ylabel("Coherence score")
    plt.legend(("coherence_values"), loc='best')
    plt.show()
    
start,stop,step=2,12,1
plot_graph(dictionary, doc_term_matrix, processed_text, start, stop, step)

Create the LSA model in Python with the optimal number of dimensions from the previous step.

##python chunk
#run the LSA
from pprint import pprint

number_of_topics = 2
words = 10
lsamodel = LsiModel(doc_term_matrix, num_topics=number_of_topics, id2word = dictionary)

topics = lsamodel.show_topics(num_topics=number_of_topics, num_words = words, formatted=True)
pprint(topics)
## [(0,
##   '0.322*"armi" + 0.287*"line" + 0.203*"oper" + 0.174*"upon" + 0.170*"may" + '
##   '0.169*"enemi" + 0.168*"war" + 0.162*"thousand" + 0.155*"one" + '
##   '0.152*"battl"'),
##  (1,
##   '-0.311*"thousand" + 0.231*"may" + -0.222*"battl" + 0.215*"enemi" + '
##   '-0.215*"expedit" + 0.198*"armi" + -0.155*"hundr" + 0.117*"say" + '
##   '0.115*"gener" + 0.113*"sun"')]

Interpretation

Interpret your space - can you see the differences between books/novels? Explain the results from your analysis (at least a 1 paragraph-length explanation).

Discussion Question

Thinking of your current job or prospective career, propose a research project to address a problem or question in your industry that would use latent semantic analysis. Spend time thinking about this and write roughly a paragraph describing the problem/question and how you would address it with text data and latent semantic analysis.