To implement the Empirical Topic Modeling from First Principles we have chosen the below three subjects to analyse:

Coffee
Cloth
Paleontology

Using the below python code we have extracted the 70 websites text and its respective URL’s

#Let's import the packages into 
import os
import pip
package_name='django'
pip.main(['install',package_name])

reload(sys)
sys.setdefaultencoding('utf8') 


from google import search
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
import urllib
from urllib2 import Request,  urlopen
from urllib2 import HTTPError 
import re
import ssl

#%%
search_terms = ['coffee', 'palaeontology', 'cloth']

for sterm in search_terms:
    urllist = []
    for url in search(sterm, tld='co.in', lang='en', stop=70):
        url = re.sub("#.*$","",url)
        url = re.sub(".*.pdf$","https://www.google.co.in/",url) # remove pdf links
        urllist.append(url)

# drop duplicate URL        
urllist = list(set(urllist))

text3 = []
for url in urllist:
        req = Request(url, headers={'User-Agent': 'Mozilla/43.0.4'}) 
        # Some website doesn't allow automated requested so we are using Mozilla ;
        # Mozilla Firefox need not be the default browser. It only needs to be installed in the system
        # Further your System antivirus might kill python process because of continuous request so disable antivirus for some time
        try:
            page = urlopen(req).read()
            soup = BeautifulSoup(page)
            [s.extract( ) for s in soup('script')] # remove all scripts from page source
            [s.extract( ) for s in soup('style')]  # remove all css styles from page source
            temp3 = soup.get_text(' ')
            temp3 = re.sub('\n',' ',temp3)
            temp3 = re.sub('\s+',' ',temp3)
            text3.append(temp3)
            
        except HTTPError:
            print('IDK')
            text3.append('NA')
        
out = pd.DataFrame({'url':urllist,
                        'text':text3
                        })
    
out_text = pd.DataFrame({'text':out.text})
    
out_xls = pd.DataFrame({'url':out.url})

out_text.to_csv("C:\\Users\\Latitude Owner\\Desktop\\group_assign\\"+sterm+"_google_search.csv")
    
out_xls.to_csv("C:\\Users\\Latitude Owner\\Desktop\\group_assign\\"+sterm+"_google_search_list.csv")

The websites url and the text has been extracted into two different excel for each of the corresponding subjects. Using the above python code we haave extracted 70 url’s and text for each of the corresponding subjects. The Data of the websites url are then fed into the below R snippet for analysing the data through Emprical Topic Modelling from First Principle.

Analysing the extracted data:

Let’s assign the Library and define the functions.

rm(list=ls())                   # Clear workspace

#--------------------------------------------------------#
# Step 0 - Assign Library & define functions             #
#--------------------------------------------------------#

library(text2vec)
library(data.table)
library(stringr)
library(tm)
library(RWeka)
library(tokenizers)
library(slam)
library(wordcloud)
library(igraph)
library(maptpx)

The text.clean is the function defined to clean the data which has been inputted in the x variable. We use the regex to remove the html tags from the inputted data. The data which has been inputted would be converted into lower characters and would be removed from the white spaces, html tags, the leading and trailing white spaces.

text.clean = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex for removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # Keep only ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # keep only alpha numeric 
  x  =  tolower(x)                          # convert to lower case characters
  x  =  removeNumbers(x)                    # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

The function distill.cog is defined to display the COG’s in the program. The below function has 3 inputs. the TCM adjacent matrix is inputted through the mat1 variable, the title for the graph is inputted through the title variable, the number of central nodes are inputted via the variable s and the maximum numer of connections for the cog is inputted through the variable k1. The Kamada Kawai layout is used to create the COG’s.

distill.cog = function(mat1, # input TCM ADJ MAT
                       title, # title for the graph
                       s,    # no. of central nodes
                       k1){  # max no. of connections  
  
  a = colSums(mat1) # collect colsums into a vector obj a
  b = order(-a)     # nice syntax for ordering vector in decr order  
  
  mat2 = mat1[b,b]  #
  
  diag(mat2) =  0
  
  ## +++ go row by row and find top k adjacencies +++ ##
  
  wc = NULL
  
  for (i1 in 1:s){ 
    thresh1 = mat2[i1,][order(-mat2[i1, ])[k1]]
    mat2[i1, mat2[i1,] < thresh1] = 0   # wow. didn't need 2 use () in the subset here.
    mat2[i1, mat2[i1,] > 0 ] = 1
    word = names(mat2[i1, mat2[i1,] > 0])
    mat2[(i1+1):nrow(mat2), match(word,colnames(mat2))] = 0
    wc = c(wc,word)
  } # i1 loop ends
  
  
  mat3 = mat2[match(wc, colnames(mat2)), match(wc, colnames(mat2))]
  ord = colnames(mat2)[which(!is.na(match(colnames(mat2), colnames(mat3))))]  # removed any NAs from the list
  mat4 = mat3[match(ord, colnames(mat3)), match(ord, colnames(mat3))]
  graph <- graph.adjacency(mat4, mode = "undirected", weighted=T)    # Create Network object
  graph = simplify(graph) 
  V(graph)$color[1:s] = "green"
  V(graph)$color[(s+1):length(V(graph))] = "pink"
  
  graph = delete.vertices(graph, V(graph)[ degree(graph) == 0 ])
  
  plot(graph, 
       layout = layout.kamada.kawai, 
      main = title)

  } # func ends

Let’s input the extracted text to perform the analysis. The text of the three subjects are inputted into three files seperately.

#--------------------------------------------------------#
# Step 1 - Reading text data                             #
#--------------------------------------------------------#

search_terms = c('dark chocolate', 'coffee', 'palaeontology')

file.mi = read.csv(paste0("C:\\Users\\Latitude Owner\\Desktop\\group_assign\\palaeontology_google_search.csv"))
file.lin = read.csv(paste0("C:\\Users\\Latitude Owner\\Desktop\\group_assign\\cloth_google_search.csv"))
file.der = read.csv(paste0("C:\\Users\\Latitude Owner\\Desktop\\group_assign\\coffee_google_search.csv"))

The minimum row count is found from the 4 files and the data is consolidated row by row for the minimum count of the rows in the four files.

# file.cr = file.cr[!is.na(file.cr$text)|file.cr$text != '',]

file.mi = file.mi[!is.na(file.mi$text)|file.mi$text != '',]

file.lin = file.lin[!is.na(file.lin$text)|file.lin$text != '',]

file.der = file.der[!is.na(file.der$text)|file.der$text != '',]

n = min(nrow(file.mi),nrow(file.lin),nrow(file.der))

data = data.frame(id = 1:n, 
                      text1 = file.mi$text[1:n],
                      text2 = file.lin$text[1:n],
                      text3 = file.der$text[1:n],
                      stringsAsFactors = F)
data$text = paste(data$text1,data$text2,data$text3)

The consolidated text are written into the variable data. The dimensions of the data is displayed below:

dim(data)

## [1] 59  5

Let’s input the stopwords for analysing the data. The stopwords includes the words the sentence forming terms, the pronouns, collective verbs, tenses, joining terms and so on that is used to create a sentence.

# Read Stopwords list
stpw1 = readLines(file.choose()) # stopwords list from git'
stpw2 = tm::stopwords('english') # tm package stop word list; tokenizer package has the same name function
comn  = unique(c(stpw1, stpw2))  # Union of two list #'solid state chemistry','microeconomics','linguistic'
stopwords = unique(gsub("'"," ",comn))  # final stop word lsit after removing punctuation

Now let’s clean the text by invoking the function text.clean and remove the words present in the stopwords file. The white space is also removed from the data.

x  = text.clean(data$text)             # pre-process text corpus
x  =  removeWords(x,stopwords)         # removing stopwords created above
x  =  stripWhitespace(x)               # removing white space
# x  =  stemDocument(x)

Let’s create the Document Term Matrix for the data stripped from white spaces using the text2vec package. The tokens are created and the analysis is performed on the text based on the bigrams. Only the terms that has an occurance of greater than ten is selected for creating the DTM. The terms thus selected is then ordered in the descending order and sorted.

#--------------------------------------------------------#
####### Create DTM using text2vec package                #
#--------------------------------------------------------#

t1 = Sys.time()

tok_fun = word_tokenizer

it_0 = itoken( x,
               #preprocessor = text.clean,
               tokenizer = tok_fun,
               ids = data$id,
               # progressbar = T
               )

vocab = create_vocabulary(it_0,
                          #ngram = c(2L, 2L)
                          #stopwords = stopwords
)

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 10)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

vectorizer = vocab_vectorizer(pruned_vocab)

dtm_0  = create_dtm(it_0, vectorizer)

# Sort bi-gram with decreasing order of freq
tsum = as.matrix(t(rollup(dtm_0, 1, na.rm=TRUE, FUN = sum))) # find sum of freq for each term
tsum = tsum[order(tsum, decreasing = T),]       #terms in decreasing order of freq
head(tsum)

##           coffee            cloth    palaeontology      association 
##             2875             1435              801              543 
## palaeontological             view 
##              539              493

tail(tsum)

##  waiting windmill   vienna    satin  wowhead  serving 
##       10       10       10       10       10       10

From the data sorted in the descending order the top 1000 bigrams are then extracted and converted into unigrams.The corpus is then converted into a Document Term Matrix.

# select Top 1000 bigrams to unigram
if (length(tsum) > 1000) {n = 1000} else {n = length(tsum)}
tsum = tsum[1:n]

#-------------------------------------------------------
# Code bi-grams as unigram in clean text corpus

text2 = x
text2 = paste("",text2,"")

pb <- txtProgressBar(min = 1, max = (length(tsum)), style = 3) ; i = 0

for (term in names(tsum)){
  i = i + 1
  focal.term = gsub("_", " ",term)        # in case dot was word-separator
  replacement.term = term
  text2 = gsub(paste("",focal.term,""),paste("",replacement.term,""), text2)
  # setTxtProgressBar(pb, i)
}


it_m = itoken(text2,
              # preprocessor = text.clean,
              tokenizer = tok_fun,
              ids = data$id,
              # progressbar = T
              )

vocab = create_vocabulary(it_m
                          #ngram = c(2L, 2L),
                          #stopwords = stopwords
)

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 1)
# doc_proportion_max = 0.5,
# doc_proportion_min = 0.001)

vectorizer = vocab_vectorizer(pruned_vocab)

dtm_m  = create_dtm(it_m, vectorizer)
dim(dtm_m)

## [1]    59 20763

dtm = as.DocumentTermMatrix(dtm_m, weighting = weightTf)

print(difftime(Sys.time(), t1, units = 'sec'))

## Time difference of 41.58838 secs

From the document term matrix the rows with less than 5 tokens are removed and the coloums with terms occurring less than 4 times in the corpus are removed. The other terms are alone retained.

# some basic clean-up ops
dim(dtm)

## [1]    59 20763

a0 = apply(dtm, 1, sum)   # apply sum operation to dtm's rows. i.e. get rowSum
  dtm = dtm[(a0 > 5),]    # retain only those rows with token rowSum >5, i.e. delete empty rows
  dim(dtm); rm(a0)        # delete a0 object

## [1]    59 20763

a0 = apply(dtm, 2, sum)   # use apply() to find colSUms this time
  dtm = dtm[, (a0 > 6)]     # retain only those terms that occurred > 4 times in the corpus
  # dtm = dtm[, (a0 > 4)]     # retain only those terms that occurred > 4 times in the corpus
  dim(dtm); rm(a0)

## [1]   59 3717

For the Document Term Matrix where the terms has occured more than 6 times the Word cloud summary is displayed below.

# view summary wordlcoud
a0 = apply(dtm, 2, sum)     # colSum vector of dtm
  a0[1:5]                   # view what a0 obj is like

##       bru tropicana       taj      tata   locator 
##         7        20         7        23         9

  a1 = order(as.vector(a0), decreasing = TRUE)     # vector of token locations
  a0 = a0[a1]     # a0 ordered asper token locations
  a0[1:5]         # view a0 now

##           coffee            cloth    palaeontology      association 
##             2875             1435              801              543 
## palaeontological 
##              539

#windows() # opens new image window
wordcloud(names(a0), a0,     # invoke wordcloud() func. Use ?wordcloud for more info
          scale=c(4,1), 
          3, # min.freq 
          max.words = 100,
          colors = brewer.pal(8, "Dark2"))
title(sub = "Quick Summary Wordcloud")

Now let’s create the Term Co-Matrix and view the distilled COG for the full corpus.

#------------------------------------------------------#
# Step 1a - Term Co-occurance Matrix                             #
#------------------------------------------------------#

pruned_vocab = prune_vocabulary(vocab,
                                term_count_min = 5)

vectorizer = vocab_vectorizer(pruned_vocab, grow_dtm = FALSE, skip_grams_window = 3L)
tcm = create_tcm(it_m, vectorizer)

tcm.mat = as.matrix(tcm)
adj.mat = tcm.mat + t(tcm.mat)

diag(adj.mat) = 0     # set diagonals of the adj matrix to zero --> node isn't its own neighor
a0 = order(apply(adj.mat, 2, sum), decreasing = T)
adj.mat = as.matrix(adj.mat[a0[1:50], a0[1:50]])

# windows()
distill.cog(adj.mat, 'Distilled COG for full corpus',  10,  10)

Now that we have the DTM and the TCM for the full corpus let’s run a model based text analytics on the text recieved on the four different subjects.

The size of the input matrix is

Here, since we have chosen three topics we override the value of k as three in the Bayes Fator Model Selection. k defines the number of optiml number of topics present in the topic and the topic model is run on it.

#################################################
## --- Step 2: model based text analytics ------ ###
#################################################

K = 3     # overriding model fit criterion

# -- run topic model for selected K -- #
summary( simfit <- topics(dtm,  K=K, verb=2), nwrd = 12 )

## 
## Estimating on a 59 document collection.
## Fitting the 3 topic model.
## log posterior increase: 16845.7, 559.4, done. (L = -805106)
## 
## Top 12 phrases by topic-over-null term lift (and usage %):
## 
## [1] 'gashonga', 'faint', 'chapada', 'cupped', 'grapefruit', 'yukiro', 'nyamasheke', 'nib', 'finca', 'bodied', 'muscovado', 'mpemba' (72.3) 
## [2] 'inscite', 'wilmington', 'positions', 'booking', 'deadlines', 'derivative', 'equations', 'buzzword', 'applicants', 'euler', 'cubes', 'dt' (22.9) 
## [3] 'aberdeen', 'ocr', 'jp', 'scanned', 'creativecommons', 'qe', 'serial', 'bhl', 'digitized', 'mercury', 'avenel', 'periodicals' (4.8) 
## 
## Dispersion = 4.6

rownames1 = gsub(" ", ".", rownames(simfit$theta));  rownames(simfit$theta) = rownames1;

The structure of the output of the topic model is as displayed below

## List of 6
##  $ K    : int 3
##  $ theta: num [1:3717, 1:3] 1.86e-05 5.32e-05 1.86e-05 6.11e-05 5.35e-05 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ phrase: chr [1:3717] "bru" "tropicana" "taj" "tata" ...
##   .. ..$ topic : chr [1:3] "1" "2" "3"
##  $ omega: num [1:59, 1:3] 0.234011 0.000183 0.000386 0.817266 0.577235 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ document: chr [1:59] "1" "2" "3" "4" ...
##   .. ..$ topic   : chr [1:3] "1" "2" "3"
##  $ BF   : NULL
##  $ D    :List of 3
##   ..$ dispersion: num 4.6
##   ..$ pvalue    : num 0
##   ..$ df        : num 195357
##  $ X    :List of 6
##   ..$ i       : int [1:32759] 59 59 59 59 58 57 57 57 57 57 ...
##   ..$ j       : int [1:32759] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ v       : num [1:32759] 7 20 7 23 9 19 41 76 7 11 ...
##   ..$ nrow    : int 59
##   ..$ ncol    : int 3717
##   ..$ dimnames:List of 2
##   .. ..$ Docs : chr [1:59] "1" "2" "3" "4" ...
##   .. ..$ Terms: chr [1:3717] "bru" "tropicana" "taj" "tata" ...
##   ..- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##   ..- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
##  - attr(*, "class")= chr "topics"

From the topic modelling we otain two outputs the Theta and the omega. The theta value is analogous to the factor loadings and the omega value is analogous to the factor scores. The dimensions of the theta and omega obtained from the topic modelling is:

dim(simfit$theta)   # analogous to factor loadings

## [1] 3717    3

From the dimensions of theta we can identify that there are four topics and 4749 terms in the topic.

dim(simfit$omega)   # analogous to factor scores

## [1] 59  3

From the dimensions of omega we can identify that there are four topics and 59 rows of data in it.

The first 5 values of the factor loadings and the factorscores are:

simfit$theta[1:5,]

##            topic
## phrase                 1            2            3
##   bru       1.860192e-05 4.983756e-09 4.223444e-04
##   tropicana 5.315651e-05 4.983762e-09 1.206623e-03
##   taj       1.860192e-05 4.983756e-09 4.223444e-04
##   tata      6.113064e-05 4.983762e-09 1.387610e-03
##   locator   5.353881e-05 2.551842e-04 8.115983e-09

simfit$omega[1:5,]

##         topic
## document            1         2            3
##        1 0.2340107723 0.4638817 0.3021075014
##        2 0.0001832721 0.9996619 0.0001548401
##        3 0.0003855060 0.9990870 0.0005275367
##        4 0.8172661647 0.1817034 0.0010304797
##        5 0.5772350636 0.4200335 0.0027314231

For the values of omega we can infer the revelance of the each term in that document.

Now lets compute the lift for all the terms across all the topics. First the sum of each term in the dtm is calculated.

# ----------------------------------------------------------#
### Step 2a - compute LIFT for all terms across all topics ###
# ----------------------------------------------------------#

tst = round(ncol(dtm)/100)
a = rep(tst,99)
b = cumsum(a);rm(a)
b = c(0,b,ncol(dtm))

ss.col = c(NULL)
for (i in 1:(length(b)-1)) {
  tempdtm = dtm[,(b[i]+1):(b[i+1])]
  s = colSums(as.matrix(tempdtm))
  ss.col = c(ss.col,s)
  # print(i)
}

head(ss.col)

##       bru tropicana       taj      tata   locator     omaha 
##         7        20         7        23         9        19

Theta is the term probability for a term to belong to a particular topic and Omega is the Document proportion of a topic in a particular document.

Lift has the same dimensions of theta. Hence a blank lift is defined by multiplying it with zero. We then calculate each coloumn’s marginal occurance probability also called as the term probability. The term probabilty is the total frequency of each term occurance/ sum of terms in the DTM.

theta = simfit$theta
lift = theta*0;       # lift will have same dimn as the theta matrix

sum1 = sum(dtm)
pterms = ss.col/sum1     # each column's marginal occurrence probability

head(pterms)

##          bru    tropicana          taj         tata      locator 
## 6.257934e-05 1.787981e-04 6.257934e-05 2.056178e-04 8.045915e-05 
##        omaha 
## 1.698582e-04

We then calculate the lift for each term topic. The lift is calculated as the (the probaility of the term occurring in a particular topic[theta])/(probaility of the frequency of the term occurring in the entire context/ total number of terms in the entire document).

for (i in 1:nrow(theta)){  
  for (j in 1:ncol(theta)){
    ptermtopic = 0; pterm = 0;
    ptermtopic = theta[i, j]
    pterm = pterms[i]
    lift[i, j] = ptermtopic/pterm     # divide each cell by the column's marg. occurr. proby.
  }
}   

dim(lift); head(lift, 15)

## [1] 3717    3

##             topic
## phrase               1            2            3
##   bru        0.2972534 7.963900e-05 6.748943e+00
##   tropicana  0.2972990 2.787368e-05 6.748520e+00
##   taj        0.2972534 7.963900e-05 6.748943e+00
##   tata       0.2973022 2.423798e-05 6.748490e+00
##   locator    0.6654160 3.171600e+00 1.008708e-04
##   omaha      1.3906849 1.152551e-04 4.068825e-05
##   franchise  1.3906867 7.674104e-05 1.885553e-05
##   scooters   1.3906874 6.142288e-05 1.017206e-05
##   phys       1.3906793 2.382980e-04 1.104395e-04
##   newsnow    1.3906825 1.674554e-04 7.027970e-05
##   khazana    1.3809963 1.769762e-04 5.991910e-02
##   sanjeev    1.3810014 1.415813e-04 5.987987e-02
##   batangas   1.3907125 7.067519e-05 1.005307e-04
##   franchises 1.3906998 1.474209e-04 1.116750e-04
##   tribal     1.3907146 8.077131e-05 1.148922e-04

lift[25:35,]

##              topic
## phrase               1            2            3
##   florals     1.390712 6.060900e-05 8.555898e-05
##   intense     1.390706 3.546685e-05 4.825628e-05
##   ethiopian   1.390709 5.022407e-05 7.110222e-05
##   bran        1.390714 6.818512e-05 9.625386e-05
##   faint       1.390716 7.792585e-05 1.100044e-04
##   apparent    1.390713 7.294335e-05 9.630130e-05
##   mouthfeel   1.390711 5.454810e-05 7.700309e-05
##   cultivars   1.390712 6.084302e-05 8.596505e-05
##   peach       1.390703 2.185026e-05 3.340670e-05
##   bittersweet 1.390702 1.474273e-05 2.081164e-05
##   malt        1.390711 5.454810e-05 7.700309e-05

Lets generate a censored lift matrix. The censored lift matrix is obtained by choosing the lift randomly such that the relevant value in the topic has the highest value when compared to the other terms in the document.

# Generate A censored Lift matrix
censored.lift = lift
for (i in 1:nrow(lift)){
  censored.lift[i,][censored.lift[i,] < max(censored.lift[i,])] = 0   # hard assigning tokens to topics
} 
head(censored.lift, 10);

##            topic
## phrase             1      2        3
##   bru       0.000000 0.0000 6.748943
##   tropicana 0.000000 0.0000 6.748520
##   taj       0.000000 0.0000 6.748943
##   tata      0.000000 0.0000 6.748490
##   locator   0.000000 3.1716 0.000000
##   omaha     1.390685 0.0000 0.000000
##   franchise 1.390687 0.0000 0.000000
##   scooters  1.390687 0.0000 0.000000
##   phys      1.390679 0.0000 0.000000
##   newsnow   1.390683 0.0000 0.000000

Now lets calculate the ETA ETA is similar to Omega but the way ETA is calculated is different from the way Omega is calculated. Omega is coming from a random cross but the ETA is calculated by taking into consideration all the terms which has LIFT greater than 1 and the document proportion is calculated for each document

#----------------------------------------------------------------#
# Step 2b - Calculate ETA - each document's score on each topic  #
#----------------------------------------------------------------#

t = Sys.time()

if(nrow(dtm) < 100) {k1 = 10} else {k1= 100}   # to avoid machine choking up in v small datasets

tst = ceiling(nrow(dtm)/k1)  # now using 1% of the rows at a time
a = rep(tst, (k1 - 1))
b = cumsum(a);rm(a)    # cumsum() is cumulative sum.
b = c(0, b, nrow(dtm))  # broke the supermassive dtm into chunks of 1% ncol each
  a0 = which(b > nrow(dtm));    # sometimes, rounding errors cause out of bound errors
  if (length(a0) > 0) {b = b[-a0]}

eta.new = NULL
for (i1 in 1:K){
  
  a2 = c(NULL)
  for (i in 1:(length(b)-1)) {
    tempdtm = dtm[(b[i]+1):(b[i+1]),]
    a = matrix(rep(lift[, i1], nrow(tempdtm)), nrow(tempdtm), ncol(tempdtm), byrow = TRUE)
    a1 = rowSums(as.matrix(tempdtm * a))
    a2 = c(a2, a1); rm(a, a1, tempdtm)
      } # i ends
  
  eta.new = cbind(eta.new, a2); rm(a2)
  
  } # i1 ends

Sys.time() - t  # will take longer than lift building coz ncol is v v high now

## Time difference of 0.120007 secs

rownames(eta.new) = rownames(simfit$omega)
colnames(eta.new) = colnames(simfit$theta)

The dimensions of the ETA is:

## [1] 59  3

The values of the ETA would be like:

##           1          2         3
## 1 1097.9014  2250.4843 2201.4851
## 2 1658.9466 12716.6726  360.5073
## 3 1143.9519  6038.4616  313.0930
## 4 1206.0561  1237.2288  251.7239
## 5  447.8624   889.4596  126.1623
## 6  910.8961  1574.9957  249.1155

Now lets calculate the ETA proportion which is the topic proportions for each document and view the output of the ETA proportion.

eta.propn = eta.new / rowSums(eta.new)   # calc topic proportions for each document
  eta.propn [1:5,]

##           1         2          3
## 1 0.1978247 0.4055021 0.39667322
## 2 0.1125768 0.8629590 0.02446419
## 3 0.1526184 0.8056109 0.04177076
## 4 0.4475147 0.4590815 0.09340375
## 5 0.3060248 0.6077684 0.08620678

Now lets plot the word cloud for ech topic.

df.top.terms = data.frame(NULL)    # can't fit ALL terms in plot, so choose top ones with max loading

k = 3
for (i in 1:k){       # For each topic 
  a0 = which(censored.lift[,i] > 1) # terms with lift greator than 1 for topic i
  freq = theta[a0, i] # Theta for terms greator than 1
  freq = sort(freq, decreasing = T) # Terms with higher probilities for topic i
  
  # Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
  n = ifelse(length(freq) >= 100, 100, length(freq))
  top_word = as.matrix(freq[1:n])
  
  top.terms = row.names(top_word)
  df.top.terms.t = data.frame(topic = i, top.terms =top.terms, stringsAsFactors = F )
  df.top.terms = rbind(df.top.terms, df.top.terms.t  )
  
} # i loop ends

#pdf(file = paste0(K,' Topic Model results.pdf')) # use pdf() func to save plot directly as PDFs in your getwd()

k = 3
for (i in 1:K){       # For each topic 
  
  a0 = which(censored.lift[,i] > 1) # terms with lift greator than 1 for topic i
  freq = theta[a0,i] # Theta for terms greator than 1
  freq = sort(freq, decreasing = T) # Terms with higher probilities for topic i
  
  # Auto Correction -  Sometime terms in topic with lift above 1 are less than 100. So auto correction
  n = ifelse(length(freq) >= 100, 100, length(freq))
  top_word = as.matrix(freq[1:n])
  
  # SUB TCM
  sub.tcm = adj.mat[colnames(adj.mat) %in% names(a0),colnames(adj.mat) %in% names(a0)]
  
  #   Plot wordcloud
  # windows()
  wordcloud(rownames(top_word), top_word,  scale=c(4,.2), 1,
            random.order=FALSE, random.color=FALSE, 
            colors=brewer.pal(8, "Dark2"))
  mtext(paste("Latent Topic",i), side = 3, line = 2, cex=2)
  
  # PLot TCM
  # windows()
  distill.cog(sub.tcm, '',  5,  5)
  mtext(paste("Term co-occurrence - Topic",i), side = 3, line = 2, cex=2)
  
} # i loop ends

To display the topic in which the rows are loaded the most.

show.top.loading.rows = function(eta.obj, number.of.units){
  
  K = ncol(eta.obj)    # no. of topic factors
  n = number.of.units
  top.loaders = NULL
  for (i in 1:K){
    a0 = order(eta.obj[,i], decreasing = TRUE)
    a1 = rownames(eta.obj[a0[1:n],])
    top.loaders = cbind(a1, top.loaders)
  } # i loop ends

  a2 = matrix()
  return(top.loaders)
  
} # func ends

show.top.loading.rows(eta.propn, 20)

##       a1   a1   a1  
##  [1,] "34" "2"  "49"
##  [2,] "43" "3"  "11"
##  [3,] "59" "9"  "29"
##  [4,] "1"  "50" "25"
##  [5,] "51" "27" "42"
##  [6,] "33" "53" "31"
##  [7,] "55" "26" "40"
##  [8,] "39" "35" "38"
##  [9,] "13" "58" "54"
## [10,] "12" "5"  "41"
## [11,] "46" "56" "23"
## [12,] "32" "7"  "21"
## [13,] "28" "6"  "15"
## [14,] "17" "30" "32"
## [15,] "54" "20" "37"
## [16,] "58" "12" "10"
## [17,] "47" "8"  "16"
## [18,] "19" "17" "57"
## [19,] "48" "28" "52"
## [20,] "36" "18" "14"

show.top.loading.rows(eta.new, 10)

##       a1   a1   a1  
##  [1,] "34" "2"  "49"
##  [2,] "43" "3"  "57"
##  [3,] "59" "49" "54"
##  [4,] "49" "53" "11"
##  [5,] "1"  "57" "29"
##  [6,] "51" "9"  "40"
##  [7,] "33" "7"  "25"
##  [8,] "57" "26" "41"
##  [9,] "54" "58" "39"
## [10,] "39" "54" "59"

Comments:

From the topic model results we are able to identify that the topic model was able to seperate each subject from the other subjects well. For this analysis we have taken three topics: Coffee, Cloth and Palaeontology. From the Latent Topic 1 we can identify that Coffee has been the most occurring term with the terms cup, full, city, espresso following it. From the COG we were also able to identify the link between the terms good coffee, add coffee, history coffee. From the Latent Topic 2 we can identify that cloth has been the most occurring term with the terms research, shop,contact, events etc. following it. From the COG we were also able to identify the link between the terms fabric shop, cloth contact. Though the terms had matched correctly few terms from Palaeontology like the study, research etc has also been linked with the cloth From the Latent Topic 3 we can identify that Palaeontology had been well mapped and correctly too. The terms palaeontology, association, biodiversity, library etc has been well connected with each other. The COG also displays good relationship with each other.
When we will look into the majority of the LIFT values in the Censored LIFT matrix we dont find the LIFT values in more than one topic. Therefore this indicates that there aint much number of mixed tokens and that the LIFt values has been correctly designated. The ETA scores in the document topic proportions has also been clearly assigned based on the topic.
From the above exercise we were able to learn: The method to extract the wesites and the related text into an excel and analyse the extracted data using the Topic Modelling Technique. We learnt to create the Wordcloud and the COG’s for the related text and infer data from the images. We have learnt to clean the raw data and to include stop words so as to perform better analysis. We have learnt on how to create the Document Term Matrix from the Text Corpus and on how to create the Term Co Matrix We have also learnt on how to compute the lift for each term in all the topics. We were able to relate the theta and the omega values to the lift matrix and to calculate the ETA for each document in the topic.

TABA_Group_Assignment_2_Task_1

Cris Alfo_Goutham_ Perumal_Vishnu

December 20, 2016

Analysing the extracted data:

Comments: