Load the libraries + functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

##r chunk
# install.packages("reticulate")
# install.packages("tm")
# install.packages("lsa")
# install.packages("LSAfun")
# setwd("C:\\Users\\JIANWEI LI\\Desktop\\anly540_temp\\rstudio_python\\anly540")
library(gutenbergr)
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(reticulate)
library(tm) # text mining

## Loading required package: NLP

library(lsa)

## Loading required package: SnowballC

library(LSAfun)

## Loading required package: rgl

## Warning: package 'rgl' was built under R version 4.0.4

# py_run_string("import os as os")
# py_run_string("os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:\ProgramData\Anaconda3\Library\platforms'")
# reticulate::py_config()
# py_discover_config('python')
use_virtualenv("python")
# py_install("matplotlib")
# py_install("pandas")
# 
# #py_install("nltk")
# py_install("numpy")
# py_install("gensim")

Load the Python libraries or functions that you will use for that section.

##python chunk
import matplotlib

matplotlib.use('pdf')

The Data

You will want to use some books from Project Gutenberg to perform a Latent Semantic Analysis. The code to pick the books has been provided for you, so all you would need to do is change out the titles. Pick 2 titles from the list below. You can also try other titles not on the list, but they may not work. Check out other book titles at https://www.gutenberg.org/.

Book Titles:
- Crime and Punishment
- Pride and Prejudice
- A Christmas Carol
- The War of the Worlds
- Twenty Thousand Leagues under the Sea
- The Iliad
- The Art of War
- An Inquiry into the Nature and Causes of the Wealth of Nations
- Democracy in America — Volume 1
- Dream Psychology: Psychoanalysis for Beginners
- Talks To Teachers On Psychology; And To Students On Some Of Life's Ideals

##r chunk
##pick 2 titles from project gutenberg, put in quotes and separate with commas
## DO NOT use the titles used in class, using those will be a 10 point deduction

titles = c("The Iliad","The Art of War")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title", mirror = "http://mirrors.xmission.com/gutenberg/") %>% 
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

by_chapter

## # A tibble: 68 x 2
##    document        text                                                         
##  * <chr>           <chr>                                                        
##  1 The Art of War~ "tell us in this chapter.  But he proceeds to give a biograp~
##  2 The Art of War~ "no. 11 [in Chapter VIII] is one that the people of this cou~
##  3 The Art of War~ "title of this chapter, says it refers to the deliberations ~
##  4 The Art of War~ "the subject of the chapter is not what we might expect from~
##  5 The Art of War~ "chapter is intended to enforce.\"]       20.  Thus it may b~
##  6 The Art of War~ "the title of this chapter:  \"marching and countermarching ~
##  7 The Art of War~ "here differently than anywhere else in this chapter.  Thus ~
##  8 The Art of War~ "     [The chief lesson of this chapter, in Tu Mu's opinion,~
##  9 The Art of War~ "[1]  \"Forty-one Years in India,\" chapter 46.  -----------~
## 10 The Art of War~ "follows:   \"Chapter IV, on Tactical Dispositions, treated ~
## # ... with 58 more rows

The by_chapter data.frame can be used to create a corpus with VectorSource by using the text column.

Create the Vector Space

Use tm_map to clean up the text.

##r chunk 
import_corpus = Corpus(VectorSource(by_chapter$text))
import_corpus = tm_map(import_corpus,tolower)

## Warning in tm_map.SimpleCorpus(import_corpus, tolower): transformation drops
## documents

import_corpus = tm_map(import_corpus,function(x) removeWords(x,stopwords("english")))

## Warning in tm_map.SimpleCorpus(import_corpus, function(x) removeWords(x, :
## transformation drops documents

Create a latent semantic analysis model in R.

##r chunk
book_matrix = as.matrix(TermDocumentMatrix(import_corpus))
head(book_matrix)

##             Docs
## Terms        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
##   "pin"      1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   "sun       1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   [3]        1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0
##   account    1 0 2 0 0 0 0 0 0  0  0  0  0  0  0  1  0  0  1  0  0  1  1  1  0
##   also       1 0 1 0 3 3 0 0 0  0  0  0  0  0  0  1  0  0  3  8  0  8  0  0  0
##   ancestor's 1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##             Docs
## Terms        26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
##   "pin"       0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   "sun        0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4  0  0
##   [3]         1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   account     1  1  0  0  0  0  0  0  2  0  0  1  1  2  1  0  3  4  3  1  5  0
##   also        0  4  0  0  0  0  0  0  0  0  0 10  5  8  6  8  4  7  9 20 12  0
##   ancestor's  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##             Docs
## Terms        48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
##   "pin"       0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   "sun        0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   [3]         0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##   account     0  0  1  1  3  2  1  4  0  3  1  0  1  2  0  0  5 11  0  0  0
##   also        1  1  7  2  1  5  0 12  0  7  2  4 10  7  0  5  6  6  0  0  1
##   ancestor's  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1

# Weight the sematic space
book_weight = lw_logtf(book_matrix)*gw_idf(book_matrix)

# Run the SVD
book_lsa = lsa(book_weight)
book_lsa = as.textmatrix(book_lsa)
head(book_lsa)

##                      1           2           3           4           5
## "pin"      0.000362519 0.001100619 0.003657066 0.003330501 0.005589526
## "sun       0.059358064 0.169665585 0.087874685 0.113147848 0.125236348
## [3]        0.010169170 0.045425716 0.341074947 0.208505310 0.385000364
## account    0.028130781 0.050921321 0.391027523 0.404730858 0.660087241
## also       0.086213343 0.197360301 1.100959691 1.112179822 1.775727942
## ancestor's 0.001931820 0.006423920 0.015251229 0.014601918 0.022906060
##                      6           7            8            9            10
## "pin"       0.01770399 0.001701275 2.841637e-05 0.0000652877  2.626797e-05
## "sun       -0.16493682 0.064757497 5.099569e-03 0.0052240700 -1.157932e-03
## [3]         0.50300439 0.165622144 7.435279e-03 0.0097521647  2.366217e-03
## account     0.82833167 0.191701030 1.135399e-02 0.0062857930  1.687949e-03
## also        2.83382135 0.564443250 2.420527e-02 0.0206431194  7.458710e-03
## ancestor's  0.05342274 0.008334917 1.572315e-04 0.0003895894  1.421761e-04
##                      11           12           13          14            15
## "pin"      0.0001795101 0.0006319974 3.397216e-05 0.003422985  0.0001193069
## "sun       0.0035909754 0.1036982195 8.952982e-04 0.111347712 -0.0016700047
## [3]        0.0020454221 0.0352504458 1.724805e-03 0.319878419  0.0024369814
## account    0.0402490895 0.0538788771 3.647082e-03 0.533547983  0.0129629277
## also       0.0761262162 0.1421657032 8.754280e-03 1.362868852  0.0329910350
## ancestor's 0.0005364399 0.0033356075 1.102284e-04 0.015496790  0.0003978972
##                     16          17           18          19           20
## "pin"      0.006355441 0.004502818 0.0001694334 0.005153906  0.007953511
## "sun       0.111836832 0.194424904 0.0036754931 0.122255830 -0.076986786
## [3]        0.468138903 0.455243531 0.0128760653 0.436966050  0.248193669
## account    0.650320964 0.412434271 0.0158566551 0.574557937  0.198360610
## also       1.858904495 1.317161876 0.0464641292 1.653930844  3.799311964
## ancestor's 0.027824029 0.022094401 0.0007081733 0.022951268  0.079024090
##                      21           22           23           24           25
## "pin"      9.447035e-05  0.002413102 0.0004732804 0.0002178576 0.0000838108
## "sun       1.082227e-02 -0.045735324 0.0967535597 0.0261838408 0.0108783301
## [3]        7.164258e-03  7.672142511 0.0169536035 0.0299832460 0.0084187677
## account    5.295553e-03  1.627869849 0.0535075183 0.0321797082 0.0087082347
## also       2.082949e-02  4.131278363 0.1527319210 0.0805037567 0.0259002795
## ancestor's 5.265962e-04  0.045788954 0.0026890500 0.0011444807 0.0004645736
##                     26          27            28            29            30
## "pin"      0.007909615  0.01449791  0.0001193735 -1.115703e-05  7.000588e-05
## "sun       0.085158622 -0.03882618  0.0159822609  1.169803e-03  3.101357e-03
## [3]        0.535187511  0.11408452 -0.0028237390 -2.381736e-03 -4.463356e-03
## account    0.689336617  1.65218997  0.0390171597  1.762709e-02  2.799609e-02
## also       2.123690186  2.98778329  0.0690884256  4.840837e-02  5.172588e-02
## ancestor's 0.034248984  0.05465840  0.0004424400  3.310560e-05  6.370726e-05
##                       31            32            33          34           35
## "pin"      -4.448301e-05 -9.126356e-06 -8.030301e-06 0.002574986 3.897163e-05
## "sun       -6.546986e-03  1.092204e-03  1.800514e-03 0.479282994 5.375807e-03
## [3]        -1.215181e-02  2.478449e-04 -4.867491e-03 0.141973746 2.835861e-03
## account     9.308946e-02  3.343193e-02  2.242591e-02 0.217357858 1.580542e-02
## also        1.194256e-01  4.709280e-02  2.974250e-02 0.640958264 2.544843e-02
## ancestor's -3.771911e-04 -8.678693e-05 -8.402430e-05 0.014676240 1.437834e-04
##                       36           37           38           39           40
## "pin"       0.0002194482  0.001023070  0.003210704  0.005760192  0.001598190
## "sun        0.0165861839 -0.008711361  0.084697780 -0.027870351 -0.093577172
## [3]        -0.0048420743  0.003390183 -0.023111097  0.031848555  0.073352351
## account     0.1374849817  1.513798044  1.435896498  2.515438923  1.589442443
## also        0.2492872624  4.841244907  2.801578426  4.748008711  2.643577889
## ancestor's  0.0005266150  0.006747375  0.004202317 -0.007799898  0.001663443
##                       41            42           43           44         45
## "pin"       0.0005379381  0.0005491163 -0.001662921 -0.001853919 0.02960747
## "sun       -0.0363580012 -0.0469073604  0.007699856  0.009289367 9.75669157
## [3]         0.0025475586 -0.0290104562 -0.027363174 -0.018665479 0.01627075
## account     1.5674738453  1.8245766747  3.605752710  2.944127505 1.64819885
## also        2.4137454075  2.8385651453  4.872549642  4.715060267 5.90333341
## ancestor's  0.0002827124 -0.0008135787 -0.002690717 -0.009079588 0.19917539
##                      46            47            48            49           50
## "pin"      -0.001334280  8.075922e-05  0.0001282604  0.0009129838  0.001290475
## "sun        0.006962771  9.319710e-04 -0.0286981417 -0.0074806483 -0.054184502
## [3]        -0.002581494 -6.479108e-04 -0.0138136400  0.0396564830  0.064041944
## account     3.804566907  5.921052e-02  0.4543683144  0.7337131453  1.191085534
## also        5.372788468  9.675359e-02  0.7053408054  1.3102243625  2.148288137
## ancestor's  0.001452673  1.447954e-04 -0.0005770523  0.0016369817  0.001227390
##                      51            52            53            54           55
## "pin"      0.0004455256  0.0008861065 -0.0008168425  0.0002719441 -0.001196629
## "sun       0.0039350433 -0.0196141820 -0.0128123134 -0.0444270538  0.010266100
## [3]        0.0275853944  0.0397013864  0.0127699113  0.0340886876 -0.019504729
## account    0.5415299253  0.7880009525  2.8357072051  1.1609228558  3.438188248
## also       0.9669050769  1.4010778406  4.2573780414  1.8656766070  5.182498719
## ancestor's 0.0009727805  0.0009113263 -0.0129178807 -0.0012987890 -0.022219145
##                      56            57           58            59           60
## "pin"      5.409204e-05  0.0002488809 -0.001314818  0.0004502019  0.002550524
## "sun       8.521822e-05 -0.0080453403  0.022740133 -0.0109915655 -0.051147698
## [3]        6.673229e-03  0.0132978699 -0.112925993 -0.0172353406 -0.006525284
## account    9.186518e-04  3.0541753183  1.804714775  0.6852425467  1.961784464
## also       1.005032e-02  4.0650166316  3.915033749  1.3254050634  4.509779459
## ancestor's 2.891667e-04 -0.0122578842 -0.003796288  0.0005536246  0.004257265
##                     61            62          63          64            65
## "pin"      0.001655970  0.0003221677 0.001579031 0.001822736  0.0007735275
## "sun       0.047981167 -0.0053906187 0.035963164 0.054121019 -0.0032364184
## [3]        0.026330979  0.0106720035 0.044643366 0.023730320  0.0150773719
## account    1.313377267  0.2746947389 1.487363023 1.802509468  5.2971919459
## also       2.465420462  0.4843529549 2.483116172 3.153166713  3.8168701472
## ancestor's 0.002180295  0.0004246915 0.001485692 0.001739249  0.0220389950
##                      66           67          68
## "pin"      0.0001950028 0.0000274320 0.001886645
## "sun       0.0294229195 0.0033433319 0.405044687
## [3]        0.0175737209 0.0023688321 0.105312826
## account    0.0088077685 0.0038210894 0.115331500
## also       0.0471391455 0.0100268194 0.420590665
## ancestor's 0.0012683620 0.0001516144 0.012311200

Explore the vector space: - Include at least one graphic of the vector space that interests you. - Include at least 2 set of statistics for your model: coherence, cosine, neighbors, etc.

plot_neighbors("account", #single word
               n = 10, #number of neighbors
               tvectors = book_lsa, #matrix space
               method = "MDS", #PCA or MDS
               dims = 2) #number of dimensions

##                      x            y
## account   -0.015042352 -0.001794037
## important -0.002687394 -0.023056567
## point,    -0.070919386 -0.007059265
## point     -0.011861273 -0.032256363
## necessary  0.019390117  0.026983903
## greater    0.035196161  0.005023598
## since     -0.022085206  0.041500344
## army,      0.036178512 -0.021870132
## still      0.009673532  0.019388944
## either     0.022157288 -0.006860424

##r chunk

choose.target("important", #choose word
              lower = .3, #lower cosine
              upper = .4, #upper cosine
              n = 10, #number of related words to get
              tvectors = book_lsa)

## brundusium      tier.    object:   repaired  prisoners     social      kiew, 
##  0.3404906  0.3043622  0.3424747  0.3577153  0.3916178  0.3353763  0.3404906 
## _snekars_,  considers      peril 
##  0.3404906  0.3010743  0.3292257

#use a multiselect for lists of words
list1 = c("account", "important", "necessary", "greater")

#plot all the words selected
plot_wordlist(list1, #put in the list above 
              method = "MDS", 
              dims = 2, #pick the number of dimensions
              tvectors = book_lsa)

##                     x            y
## account   -0.01086445  0.002795364
## important -0.02601601  0.002486168
## necessary  0.01373833 -0.023105131
## greater    0.02314213  0.017823599

multicos(list1,tvectors = book_lsa)

##             account important necessary   greater
## account   1.0000000 0.9808748 0.9634353 0.9618649
## important 0.9808748 1.0000000 0.9525630 0.9484065
## necessary 0.9634353 0.9525630 1.0000000 0.9579993
## greater   0.9618649 0.9484065 0.9579993 1.0000000

Transfer the by_chapter to Python and convert it to a list for processing.

##python chunk
bychapter = list(r.by_chapter['text'])
bychapter[0]

## 'tell us in this chapter.  But he proceeds to give a biography of his descendant,  Sun Pin, born about a hundred years after his famous ancestor\'s death, and also the outstanding military genius of his time.  The historian speaks of him too as Sun Tzu, and in his preface we read:  "Sun Tzu had his feet cut off and yet continued to discuss the art of war." [3]  It seems likely, then, that  "Pin" was a nickname bestowed on him after his mutilation, unless the story was invented in order to account for the name. The crowning incident of his career, the crushing defeat of his treacherous rival P`ang Chuan, will be found briefly related in'

Process the text using Python.

##python chunk
import nltk
nltk.download('punkt')

## True
## 
## [nltk_data] Downloading package punkt to C:\Users\JIANWEI
## [nltk_data]     LI\AppData\Roaming\nltk_data...
## [nltk_data]   Package punkt is already up-to-date!

nltk.download('stopwords')

## True
## 
## [nltk_data] Downloading package stopwords to C:\Users\JIANWEI
## [nltk_data]     LI\AppData\Roaming\nltk_data...
## [nltk_data]   Package stopwords is already up-to-date!

check r stopwords package

library(stopwords)

## 
## Attaching package: 'stopwords'

## The following object is masked from 'package:tm':
## 
##     stopwords

stopwords = stopwords::stopwords()

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()

processed_text = []

##loop through each item in the list
for answer in bychapter:
  #lower case
  answer = answer.lower() 
  #create tokens
  answer = word_tokenize(answer) 
  #take out stop words
  answer = [word for word in answer if word not in r.stopwords] 
  #stem the words
  answer = [ps.stem(word=word) for word in answer]
  #add it to our list
  processed_text.append(answer)

processed_text[0]

## ['tell', 'us', 'chapter', '.', 'proce', 'give', 'biographi', 'descend', ',', 'sun', 'pin', ',', 'born', 'hundr', 'year', 'famou', 'ancestor', "'s", 'death', ',', 'also', 'outstand', 'militari', 'geniu', 'time', '.', 'historian', 'speak', 'sun', 'tzu', ',', 'prefac', 'read', ':', '``', 'sun', 'tzu', 'feet', 'cut', 'yet', 'continu', 'discuss', 'art', 'war', '.', "''", '[', '3', ']', 'seem', 'like', ',', ',', '``', 'pin', "''", 'nicknam', 'bestow', 'mutil', ',', 'unless', 'stori', 'invent', 'order', 'account', 'name', '.', 'crown', 'incid', 'career', ',', 'crush', 'defeat', 'treacher', 'rival', 'p', '`', 'ang', 'chuan', ',', 'found', 'briefli', 'relat']

Create the dictionary and term document matrix in Python.

##python chunk
import gensim
from gensim import corpora
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

dictionary = corpora.Dictionary(processed_text)

#create a TDM
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_text]

#doc_term_matrix

Find the most likely number of dimensions using the coherence functions from the lecture.

# #python chunk
# 
# #figure out the coherence scores
# def compute_coherence_values(dictionary, doc_term_matrix, clean_text, start = 2, stop = 10, step = 2):
#     coherence_values = []
#     model_list = []
#     for num_topics in range(start, stop, step):
#       #generate LSA model
#       model = LsiModel(doc_term_matrix, num_topics=num_topics, id2word = dictionary)  # train model
#       model_list.append(model)
#       coherencemodel = CoherenceModel(model=model, texts=clean_text, dictionary=dictionary, coherence='c_v')
#       coherence_values.append(coherencemodel.get_coherence())
#     return model_list, coherence_values
# 
# def plot_graph(dictionary, doc_term_matrix, clean_text, start, stop, step):
#     model_list, coherence_values = compute_coherence_values(dictionary, doc_term_matrix,clean_text, start, stop, step)
#     # Show graph
#     x = range(start, stop, step)
#     plt.plot(x, coherence_values)
#     plt.xlabel("Number of Topics")
#     plt.ylabel("Coherence score")
#     plt.legend(("coherence_values"), loc='best')
#     plt.show()
# 
# start,stop,step=2,7,1
# plot_graph(dictionary, doc_term_matrix, processed_text, start, stop, step)

Create the LSA model in Python with the optimal number of dimensions from the previous step.

##python chunk

topics = 5
words = 10
lsamodel = LsiModel(doc_term_matrix,num_topics=topics,id2word=dictionary)

print(lsamodel.print_topics(num_topics = topics,num_words = words))

## [(0, '0.964*"," + 0.204*"." + 0.043*"armi" + 0.041*"line" + 0.037*";" + 0.037*"--" + 0.034*"thousand" + 0.030*"battl" + 0.030*"oper" + 0.030*"war"'), (1, '0.793*"." + -0.227*"," + 0.225*"--" + 0.138*";" + 0.134*"`" + 0.129*"[" + 0.128*"]" + 0.123*"\'\'" + 0.122*"``" + 0.119*"armi"'), (2, '-0.766*"--" + -0.245*"divis" + -0.225*"-" + 0.173*"`" + 0.143*"``" + 0.140*"\'\'" + -0.138*"corp" + -0.132*"line" + -0.108*"|" + -0.107*"two"'), (3, '-0.302*"armi" + 0.296*"--" + 0.275*"`" + -0.241*"line" + -0.223*"upon" + -0.210*"oper" + 0.204*"\'\'" + 0.200*"``" + -0.175*"may" + -0.175*";"'), (4, '-0.287*"`" + 0.272*"\'\'" + 0.271*"``" + -0.269*"sun" + -0.252*"wu" + -0.226*"war" + 0.211*"enemi" + -0.182*"ch" + 0.167*"ground" + 0.156*":"')]

Interpretation

Interpret your space - can you see the differences between books/novels? Explain the results from your analysis (at least a 1 paragraph-length explanation).

ANSWER: Since we are not able to plot the compute_coherence_values, it is not clear visually how it affect our model. However we can see plot_neighbors or plot_wordlist to see the graphic of the vector space

Discussion Question

Thinking of your current job or prospective career, propose a research project to address a problem or question in your industry that would use latent semantic analysis. Spend time thinking about this and write roughly a paragraph describing the problem/question and how you would address it with text data and latent semantic analysis.

ANSWER: In this current assignment. I believe this will help me process the twitter data using the similar logic

Latent Semantic Analysis

Jianwei Li

2021-02-28