Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
##r chunk
# install.packages("reticulate")
# install.packages("tm")
# install.packages("lsa")
# install.packages("LSAfun")
# setwd("C:\\Users\\JIANWEI LI\\Desktop\\anly540_temp\\rstudio_python\\anly540")
library(gutenbergr)
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(reticulate)
library(tm) # text mining
## Loading required package: NLP
library(lsa)
## Loading required package: SnowballC
library(LSAfun)
## Loading required package: rgl
## Warning: package 'rgl' was built under R version 4.0.4
# py_run_string("import os as os")
# py_run_string("os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:\ProgramData\Anaconda3\Library\platforms'")
# reticulate::py_config()
# py_discover_config('python')
use_virtualenv("python")
# py_install("matplotlib")
# py_install("pandas")
#
# #py_install("nltk")
# py_install("numpy")
# py_install("gensim")
Load the Python libraries or functions that you will use for that section.
##python chunk
import matplotlib
matplotlib.use('pdf')
You will want to use some books from Project Gutenberg to perform a Latent Semantic Analysis. The code to pick the books has been provided for you, so all you would need to do is change out the titles. Pick 2 titles from the list below. You can also try other titles not on the list, but they may not work. Check out other book titles at https://www.gutenberg.org/.
##r chunk
##pick 2 titles from project gutenberg, put in quotes and separate with commas
## DO NOT use the titles used in class, using those will be a 10 point deduction
titles = c("The Iliad","The Art of War")
##read in those books
books = gutenberg_works(title %in% titles) %>%
gutenberg_download(meta_fields = "title", mirror = "http://mirrors.xmission.com/gutenberg/") %>%
mutate(document = row_number())
create_chapters = books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
by_chapter = create_chapters %>%
group_by(document) %>%
summarise(text=paste(text,collapse=' '))
by_chapter
## # A tibble: 68 x 2
## document text
## * <chr> <chr>
## 1 The Art of War~ "tell us in this chapter. But he proceeds to give a biograp~
## 2 The Art of War~ "no. 11 [in Chapter VIII] is one that the people of this cou~
## 3 The Art of War~ "title of this chapter, says it refers to the deliberations ~
## 4 The Art of War~ "the subject of the chapter is not what we might expect from~
## 5 The Art of War~ "chapter is intended to enforce.\"] 20. Thus it may b~
## 6 The Art of War~ "the title of this chapter: \"marching and countermarching ~
## 7 The Art of War~ "here differently than anywhere else in this chapter. Thus ~
## 8 The Art of War~ " [The chief lesson of this chapter, in Tu Mu's opinion,~
## 9 The Art of War~ "[1] \"Forty-one Years in India,\" chapter 46. -----------~
## 10 The Art of War~ "follows: \"Chapter IV, on Tactical Dispositions, treated ~
## # ... with 58 more rows
The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.
Use tm_map to clean up the text.
##r chunk
import_corpus = Corpus(VectorSource(by_chapter$text))
import_corpus = tm_map(import_corpus,tolower)
## Warning in tm_map.SimpleCorpus(import_corpus, tolower): transformation drops
## documents
import_corpus = tm_map(import_corpus,function(x) removeWords(x,stopwords("english")))
## Warning in tm_map.SimpleCorpus(import_corpus, function(x) removeWords(x, :
## transformation drops documents
Create a latent semantic analysis model in R.
##r chunk
book_matrix = as.matrix(TermDocumentMatrix(import_corpus))
head(book_matrix)
## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## "pin" 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## "sun 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0
## account 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 1 1 0
## also 1 0 1 0 3 3 0 0 0 0 0 0 0 0 0 1 0 0 3 8 0 8 0 0 0
## ancestor's 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Docs
## Terms 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## "pin" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## "sun 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0
## [3] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## account 1 1 0 0 0 0 0 0 2 0 0 1 1 2 1 0 3 4 3 1 5 0
## also 0 4 0 0 0 0 0 0 0 0 0 10 5 8 6 8 4 7 9 20 12 0
## ancestor's 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Docs
## Terms 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## "pin" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## "sun 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## account 0 0 1 1 3 2 1 4 0 3 1 0 1 2 0 0 5 11 0 0 0
## also 1 1 7 2 1 5 0 12 0 7 2 4 10 7 0 5 6 6 0 0 1
## ancestor's 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
# Weight the sematic space
book_weight = lw_logtf(book_matrix)*gw_idf(book_matrix)
# Run the SVD
book_lsa = lsa(book_weight)
book_lsa = as.textmatrix(book_lsa)
head(book_lsa)
## 1 2 3 4 5
## "pin" 0.000362519 0.001100619 0.003657066 0.003330501 0.005589526
## "sun 0.059358064 0.169665585 0.087874685 0.113147848 0.125236348
## [3] 0.010169170 0.045425716 0.341074947 0.208505310 0.385000364
## account 0.028130781 0.050921321 0.391027523 0.404730858 0.660087241
## also 0.086213343 0.197360301 1.100959691 1.112179822 1.775727942
## ancestor's 0.001931820 0.006423920 0.015251229 0.014601918 0.022906060
## 6 7 8 9 10
## "pin" 0.01770399 0.001701275 2.841637e-05 0.0000652877 2.626797e-05
## "sun -0.16493682 0.064757497 5.099569e-03 0.0052240700 -1.157932e-03
## [3] 0.50300439 0.165622144 7.435279e-03 0.0097521647 2.366217e-03
## account 0.82833167 0.191701030 1.135399e-02 0.0062857930 1.687949e-03
## also 2.83382135 0.564443250 2.420527e-02 0.0206431194 7.458710e-03
## ancestor's 0.05342274 0.008334917 1.572315e-04 0.0003895894 1.421761e-04
## 11 12 13 14 15
## "pin" 0.0001795101 0.0006319974 3.397216e-05 0.003422985 0.0001193069
## "sun 0.0035909754 0.1036982195 8.952982e-04 0.111347712 -0.0016700047
## [3] 0.0020454221 0.0352504458 1.724805e-03 0.319878419 0.0024369814
## account 0.0402490895 0.0538788771 3.647082e-03 0.533547983 0.0129629277
## also 0.0761262162 0.1421657032 8.754280e-03 1.362868852 0.0329910350
## ancestor's 0.0005364399 0.0033356075 1.102284e-04 0.015496790 0.0003978972
## 16 17 18 19 20
## "pin" 0.006355441 0.004502818 0.0001694334 0.005153906 0.007953511
## "sun 0.111836832 0.194424904 0.0036754931 0.122255830 -0.076986786
## [3] 0.468138903 0.455243531 0.0128760653 0.436966050 0.248193669
## account 0.650320964 0.412434271 0.0158566551 0.574557937 0.198360610
## also 1.858904495 1.317161876 0.0464641292 1.653930844 3.799311964
## ancestor's 0.027824029 0.022094401 0.0007081733 0.022951268 0.079024090
## 21 22 23 24 25
## "pin" 9.447035e-05 0.002413102 0.0004732804 0.0002178576 0.0000838108
## "sun 1.082227e-02 -0.045735324 0.0967535597 0.0261838408 0.0108783301
## [3] 7.164258e-03 7.672142511 0.0169536035 0.0299832460 0.0084187677
## account 5.295553e-03 1.627869849 0.0535075183 0.0321797082 0.0087082347
## also 2.082949e-02 4.131278363 0.1527319210 0.0805037567 0.0259002795
## ancestor's 5.265962e-04 0.045788954 0.0026890500 0.0011444807 0.0004645736
## 26 27 28 29 30
## "pin" 0.007909615 0.01449791 0.0001193735 -1.115703e-05 7.000588e-05
## "sun 0.085158622 -0.03882618 0.0159822609 1.169803e-03 3.101357e-03
## [3] 0.535187511 0.11408452 -0.0028237390 -2.381736e-03 -4.463356e-03
## account 0.689336617 1.65218997 0.0390171597 1.762709e-02 2.799609e-02
## also 2.123690186 2.98778329 0.0690884256 4.840837e-02 5.172588e-02
## ancestor's 0.034248984 0.05465840 0.0004424400 3.310560e-05 6.370726e-05
## 31 32 33 34 35
## "pin" -4.448301e-05 -9.126356e-06 -8.030301e-06 0.002574986 3.897163e-05
## "sun -6.546986e-03 1.092204e-03 1.800514e-03 0.479282994 5.375807e-03
## [3] -1.215181e-02 2.478449e-04 -4.867491e-03 0.141973746 2.835861e-03
## account 9.308946e-02 3.343193e-02 2.242591e-02 0.217357858 1.580542e-02
## also 1.194256e-01 4.709280e-02 2.974250e-02 0.640958264 2.544843e-02
## ancestor's -3.771911e-04 -8.678693e-05 -8.402430e-05 0.014676240 1.437834e-04
## 36 37 38 39 40
## "pin" 0.0002194482 0.001023070 0.003210704 0.005760192 0.001598190
## "sun 0.0165861839 -0.008711361 0.084697780 -0.027870351 -0.093577172
## [3] -0.0048420743 0.003390183 -0.023111097 0.031848555 0.073352351
## account 0.1374849817 1.513798044 1.435896498 2.515438923 1.589442443
## also 0.2492872624 4.841244907 2.801578426 4.748008711 2.643577889
## ancestor's 0.0005266150 0.006747375 0.004202317 -0.007799898 0.001663443
## 41 42 43 44 45
## "pin" 0.0005379381 0.0005491163 -0.001662921 -0.001853919 0.02960747
## "sun -0.0363580012 -0.0469073604 0.007699856 0.009289367 9.75669157
## [3] 0.0025475586 -0.0290104562 -0.027363174 -0.018665479 0.01627075
## account 1.5674738453 1.8245766747 3.605752710 2.944127505 1.64819885
## also 2.4137454075 2.8385651453 4.872549642 4.715060267 5.90333341
## ancestor's 0.0002827124 -0.0008135787 -0.002690717 -0.009079588 0.19917539
## 46 47 48 49 50
## "pin" -0.001334280 8.075922e-05 0.0001282604 0.0009129838 0.001290475
## "sun 0.006962771 9.319710e-04 -0.0286981417 -0.0074806483 -0.054184502
## [3] -0.002581494 -6.479108e-04 -0.0138136400 0.0396564830 0.064041944
## account 3.804566907 5.921052e-02 0.4543683144 0.7337131453 1.191085534
## also 5.372788468 9.675359e-02 0.7053408054 1.3102243625 2.148288137
## ancestor's 0.001452673 1.447954e-04 -0.0005770523 0.0016369817 0.001227390
## 51 52 53 54 55
## "pin" 0.0004455256 0.0008861065 -0.0008168425 0.0002719441 -0.001196629
## "sun 0.0039350433 -0.0196141820 -0.0128123134 -0.0444270538 0.010266100
## [3] 0.0275853944 0.0397013864 0.0127699113 0.0340886876 -0.019504729
## account 0.5415299253 0.7880009525 2.8357072051 1.1609228558 3.438188248
## also 0.9669050769 1.4010778406 4.2573780414 1.8656766070 5.182498719
## ancestor's 0.0009727805 0.0009113263 -0.0129178807 -0.0012987890 -0.022219145
## 56 57 58 59 60
## "pin" 5.409204e-05 0.0002488809 -0.001314818 0.0004502019 0.002550524
## "sun 8.521822e-05 -0.0080453403 0.022740133 -0.0109915655 -0.051147698
## [3] 6.673229e-03 0.0132978699 -0.112925993 -0.0172353406 -0.006525284
## account 9.186518e-04 3.0541753183 1.804714775 0.6852425467 1.961784464
## also 1.005032e-02 4.0650166316 3.915033749 1.3254050634 4.509779459
## ancestor's 2.891667e-04 -0.0122578842 -0.003796288 0.0005536246 0.004257265
## 61 62 63 64 65
## "pin" 0.001655970 0.0003221677 0.001579031 0.001822736 0.0007735275
## "sun 0.047981167 -0.0053906187 0.035963164 0.054121019 -0.0032364184
## [3] 0.026330979 0.0106720035 0.044643366 0.023730320 0.0150773719
## account 1.313377267 0.2746947389 1.487363023 1.802509468 5.2971919459
## also 2.465420462 0.4843529549 2.483116172 3.153166713 3.8168701472
## ancestor's 0.002180295 0.0004246915 0.001485692 0.001739249 0.0220389950
## 66 67 68
## "pin" 0.0001950028 0.0000274320 0.001886645
## "sun 0.0294229195 0.0033433319 0.405044687
## [3] 0.0175737209 0.0023688321 0.105312826
## account 0.0088077685 0.0038210894 0.115331500
## also 0.0471391455 0.0100268194 0.420590665
## ancestor's 0.0012683620 0.0001516144 0.012311200
Explore the vector space: - Include at least one graphic of the vector space that interests you. - Include at least 2 set of statistics for your model: coherence, cosine, neighbors, etc.
plot_neighbors("account", #single word
n = 10, #number of neighbors
tvectors = book_lsa, #matrix space
method = "MDS", #PCA or MDS
dims = 2) #number of dimensions
## x y
## account -0.015042352 -0.001794037
## important -0.002687394 -0.023056567
## point, -0.070919386 -0.007059265
## point -0.011861273 -0.032256363
## necessary 0.019390117 0.026983903
## greater 0.035196161 0.005023598
## since -0.022085206 0.041500344
## army, 0.036178512 -0.021870132
## still 0.009673532 0.019388944
## either 0.022157288 -0.006860424
##r chunk
choose.target("important", #choose word
lower = .3, #lower cosine
upper = .4, #upper cosine
n = 10, #number of related words to get
tvectors = book_lsa)
## brundusium tier. object: repaired prisoners social kiew,
## 0.3404906 0.3043622 0.3424747 0.3577153 0.3916178 0.3353763 0.3404906
## _snekars_, considers peril
## 0.3404906 0.3010743 0.3292257
#use a multiselect for lists of words
list1 = c("account", "important", "necessary", "greater")
#plot all the words selected
plot_wordlist(list1, #put in the list above
method = "MDS",
dims = 2, #pick the number of dimensions
tvectors = book_lsa)
## x y
## account -0.01086445 0.002795364
## important -0.02601601 0.002486168
## necessary 0.01373833 -0.023105131
## greater 0.02314213 0.017823599
multicos(list1,tvectors = book_lsa)
## account important necessary greater
## account 1.0000000 0.9808748 0.9634353 0.9618649
## important 0.9808748 1.0000000 0.9525630 0.9484065
## necessary 0.9634353 0.9525630 1.0000000 0.9579993
## greater 0.9618649 0.9484065 0.9579993 1.0000000
Transfer the by_chapter to Python and convert it to a list for processing.
##python chunk
bychapter = list(r.by_chapter['text'])
bychapter[0]
## 'tell us in this chapter. But he proceeds to give a biography of his descendant, Sun Pin, born about a hundred years after his famous ancestor\'s death, and also the outstanding military genius of his time. The historian speaks of him too as Sun Tzu, and in his preface we read: "Sun Tzu had his feet cut off and yet continued to discuss the art of war." [3] It seems likely, then, that "Pin" was a nickname bestowed on him after his mutilation, unless the story was invented in order to account for the name. The crowning incident of his career, the crushing defeat of his treacherous rival P`ang Chuan, will be found briefly related in'
Process the text using Python.
##python chunk
import nltk
nltk.download('punkt')
## True
##
## [nltk_data] Downloading package punkt to C:\Users\JIANWEI
## [nltk_data] LI\AppData\Roaming\nltk_data...
## [nltk_data] Package punkt is already up-to-date!
nltk.download('stopwords')
## True
##
## [nltk_data] Downloading package stopwords to C:\Users\JIANWEI
## [nltk_data] LI\AppData\Roaming\nltk_data...
## [nltk_data] Package stopwords is already up-to-date!
library(stopwords)
##
## Attaching package: 'stopwords'
## The following object is masked from 'package:tm':
##
## stopwords
stopwords = stopwords::stopwords()
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
processed_text = []
##loop through each item in the list
for answer in bychapter:
#lower case
answer = answer.lower()
#create tokens
answer = word_tokenize(answer)
#take out stop words
answer = [word for word in answer if word not in r.stopwords]
#stem the words
answer = [ps.stem(word=word) for word in answer]
#add it to our list
processed_text.append(answer)
processed_text[0]
## ['tell', 'us', 'chapter', '.', 'proce', 'give', 'biographi', 'descend', ',', 'sun', 'pin', ',', 'born', 'hundr', 'year', 'famou', 'ancestor', "'s", 'death', ',', 'also', 'outstand', 'militari', 'geniu', 'time', '.', 'historian', 'speak', 'sun', 'tzu', ',', 'prefac', 'read', ':', '``', 'sun', 'tzu', 'feet', 'cut', 'yet', 'continu', 'discuss', 'art', 'war', '.', "''", '[', '3', ']', 'seem', 'like', ',', ',', '``', 'pin', "''", 'nicknam', 'bestow', 'mutil', ',', 'unless', 'stori', 'invent', 'order', 'account', 'name', '.', 'crown', 'incid', 'career', ',', 'crush', 'defeat', 'treacher', 'rival', 'p', '`', 'ang', 'chuan', ',', 'found', 'briefli', 'relat']
Create the dictionary and term document matrix in Python.
##python chunk
import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
dictionary = corpora.Dictionary(processed_text)
#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]
#doc_term_matrix
Find the most likely number of dimensions using the coherence functions from the lecture.
# #python chunk
#
# #figure out the coherence scores
# def compute_coherence_values(dictionary, doc_term_matrix, clean_text, start = 2, stop = 10, step = 2):
# coherence_values = []
# model_list = []
# for num_topics in range(start, stop, step):
# #generate LSA model
# model = LsiModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary) # train model
# model_list.append(model)
# coherencemodel = CoherenceModel(model=model, texts=clean_text, dictionary=dictionary, coherence='c_v')
# coherence_values.append(coherencemodel.get_coherence())
# return model_list, coherence_values
#
# def plot_graph(dictionary, doc_term_matrix, clean_text, start, stop, step):
# model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix,clean_text, start, stop, step)
# # Show graph
# x = range(start, stop, step)
# plt.plot(x, coherence_values)
# plt.xlabel("Number of Topics")
# plt.ylabel("Coherence score")
# plt.legend(("coherence_values"), loc='best')
# plt.show()
#
# start,stop,step=2,7,1
# plot_graph(dictionary, doc_term_matrix, processed_text, start, stop, step)
Create the LSA model in Python with the optimal number of dimensions from the previous step.
##python chunk
topics = 5
words = 10
lsamodel = LsiModel(doc_term_matrix,num_topics=topics,id2word=dictionary)
print(lsamodel.print_topics(num_topics = topics,num_words = words))
## [(0, '0.964*"," + 0.204*"." + 0.043*"armi" + 0.041*"line" + 0.037*";" + 0.037*"--" + 0.034*"thousand" + 0.030*"battl" + 0.030*"oper" + 0.030*"war"'), (1, '0.793*"." + -0.227*"," + 0.225*"--" + 0.138*";" + 0.134*"`" + 0.129*"[" + 0.128*"]" + 0.123*"\'\'" + 0.122*"``" + 0.119*"armi"'), (2, '-0.766*"--" + -0.245*"divis" + -0.225*"-" + 0.173*"`" + 0.143*"``" + 0.140*"\'\'" + -0.138*"corp" + -0.132*"line" + -0.108*"|" + -0.107*"two"'), (3, '-0.302*"armi" + 0.296*"--" + 0.275*"`" + -0.241*"line" + -0.223*"upon" + -0.210*"oper" + 0.204*"\'\'" + 0.200*"``" + -0.175*"may" + -0.175*";"'), (4, '-0.287*"`" + 0.272*"\'\'" + 0.271*"``" + -0.269*"sun" + -0.252*"wu" + -0.226*"war" + 0.211*"enemi" + -0.182*"ch" + 0.167*"ground" + 0.156*":"')]
Interpret your space - can you see the differences between books/novels? Explain the results from your analysis (at least a 1 paragraph-length explanation).
Thinking of your current job or prospective career, propose a research project to address a problem or question in your industry that would use latent semantic analysis. Spend time thinking about this and write roughly a paragraph describing the problem/question and how you would address it with text data and latent semantic analysis.