This vignette explains how to analyse unstructured data like text documents, by using the tm package and topicmodels package and Latent Dirichlet Allocation method to discover the possible hidden topics within the documents and assigning each of the topics to the documents.
This is really useful if you have a large corpus of documents and you want to explore within them and the relationships between them without having to read, parse and categorise all of them.
For more information, read this blog https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
Note: This vignette assumes that tm package and topicmodels packages are already loaded.
Set working directory to where the .txt documents are located, load the tm library package and move the files into the corpus for analysis
#set working directory to where the corpus is stored on c drive
setwd("C:/Users/Benjibex/Documents/assignment1")
The working directory was changed to C:/Users/Benjibex/Documents/assignment1 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
#load tm library assuming tm package already installed
library(tm)
#load document files into corpus
filenames <- list.files(getwd(),pattern="*.txt")
files <- lapply(filenames,readLines)
docs <- Corpus(VectorSource(files))
#go to parent directory
setwd("../")
getwd()
[1] "C:/Users/Benjibex/Documents"
Format the files to enable analysis of the terms, then create the Document-Term Matrix which will be analysed in topicmodels package in the next steps
#Remove punctuation - replace punctuation marks with " "
docs <- tm_map(docs, removePunctuation)
#Transform to lower case
docs <- tm_map(docs,content_transformer(tolower))
#Strip digits
docs <- tm_map(docs, removeNumbers)
#Remove stopwords from standard stopword list
docs <- tm_map(docs, removeWords, stopwords("english"))
#Strip whitespace (cosmetic?)
docs <- tm_map(docs, stripWhitespace)
#Stem document to ensure words that have same meaning or different verb forms of the same word arent duplicated
docs <- tm_map(docs,stemDocument)
#Create document-term matrix
dtm <- DocumentTermMatrix(docs)
dtm
<<DocumentTermMatrix (documents: 28, terms: 4521)>>
Non-/sparse entries: 11938/114650
Sparsity : 91%
Maximal term length: 80
Weighting : term frequency (tf)
rownames(dtm) <- filenames
Use LDA to identify “n”" topics (in this case I have chosen 5) through a process of iterative allocation of the documents to each topic
#Load Topic models
library(topicmodels)
#Run Latent Dirichlet Allocation (LDA) using Gibbs Sampling
#set burn in
burnin <-1000
#set iterations
iter<-2000
#thin the spaces between samples
thin <- 500
#set random starts at 5
nstart <-5
#use random integers as seed
seed <- list(254672,109,122887,145629037,2)
# return the highest probability as the result
best <-TRUE
#set number of topics
k <-5
#run the LDA model
ldaOut <- LDA(dtm,k, method="Gibbs", control=
list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
#view the top 6 terms for each of the 5 topics, create a matrix and write to csv
terms(ldaOut,6)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
[1,] "univers" "compani" "limit" "said" "peopl"
[2,] "will" "year" "audit" "firm" "work"
[3,] "public" "“" "servic" "tropfest" "busi"
[4,] "new" "entrepreneur" "qualiti" "analyt" "help"
[5,] "valu" "employe" "global" "brand" "world"
[6,] "market" "busi" "profession" "partner" "program"
ldaOut.terms <- as.matrix(terms(ldaOut,6))
#view the topic assignment for each document
topics(ldaOut)
01_AFR1.txt 01_AFR2.txt 01_AFR3.txt 01_AFR4.txt 02_EY1.txt 02_EY2.txt
2 4 4 4 5 5
02_EY3.txt 02_EY4.txt 02_EY5.txt 02_EY6.txt 02_EY7.txt 02_EY8.txt
1 3 2 1 5 4
03_EY1.txt 03_EY10.txt 03_EY2.txt 03_EY3.txt 03_EY4.txt 03_EY5.txt
2 5 4 4 5 2
03_EY6.txt 03_EY7.txt 03_EY8.txt 03_EY9.txt 04_EY1.txt 05_UNSW1.txt
4 4 4 5 4 5
05_UQ1.txt 05_USYD1.txt 05_UTS1.txt 05_UWA1.txt
1 5 1 2
#create a matrix and write to csv
ldaOut.topics <-as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
Finally calculate the probabilities of each document being associated with each topic
#Find probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
#investigate topic probabilities data.frame
summary(topicProbabilities)
V1 V2 V3 V4
Min. :0.03176 Min. :0.01160 Min. :0.02826 Min. :0.01077
1st Qu.:0.09810 1st Qu.:0.07499 1st Qu.:0.06686 1st Qu.:0.07643
Median :0.12333 Median :0.10939 Median :0.08570 Median :0.16297
Mean :0.19975 Mean :0.19625 Mean :0.11739 Mean :0.24264
3rd Qu.:0.20316 3rd Qu.:0.28758 3rd Qu.:0.12788 3rd Qu.:0.42299
Max. :0.85475 Max. :0.61286 Max. :0.82767 Max. :0.61438
V5
Min. :0.05076
1st Qu.:0.11648
Median :0.17366
Mean :0.24399
3rd Qu.:0.34787
Max. :0.73058
str(topicProbabilities)
'data.frame': 28 obs. of 5 variables:
$ V1: num 0.1563 0.0951 0.2126 0.1446 0.117 ...
$ V2: num 0.4043 0.108 0.0817 0.0661 0.0755 ...
$ V3: num 0.0782 0.1028 0.0863 0.124 0.1396 ...
$ V4: num 0.164 0.612 0.484 0.533 0.325 ...
$ V5: num 0.1968 0.0823 0.1356 0.1322 0.3434 ...
The summary statistics of the topic probabilities are quite meaningless as the vector of each row’s probability sums to 1. Further analysis on the importance of the topic allocation would be required.
Kailash Awati and Sensanalytics Consulting Pty Ltd, Sept 29 2015 “A gentle introduction to topic modeling using r” https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/