ContextBase_TopicModeling

The Problem
Our Solution

Section 1 - Data Import

Section 2 - Document Term Matrix

Section 3 - Topic Modeling

Section 4 - Tables and Charts

Section 5 - Conclusions

Section 6 - Appendix
Section 6a - Required Packages
Section 6b - Session Information

Section 7 - References

The Problem

The volume and complexity of text-based business data has increased exponentially, and now greatly exceeds the processing capabilities of humans. The vast majority of online Business data is unorganized, and exists in textual form such as emails, support tickets, chats, social media, surveys, articles, and documents. Manually sorting through online Business data, (in order to gain hidden insights), would be difficult, expensive, and impossibly time-consuming.

The internet has also introduced complexity to the demands from Customers to Businesses. The effectiveness of Marketing has been affected by the complexity of the Internet. Growing databases of Customer Responses present difficulties of interpreting the basic requirements from Customers, and the indicators of Customer intent are getting more complex.

New Machine Learning methods are required for improved extraction of business knowledge. New linguistic techniques are needed to mine Customer Response text data.

Our Solution

A result of the above demand is the attention Topic Modeling has been gaining in recent years. ContextBase provides Topic Modeling of Client text data to precisely refine Business Policies and Marketing material. Topic Modeling is a text mining method derived from Natural Language Programming and Machine Learning. Topic Models are a solution for classifying document terms into themes, (or “topics”), and is appliable to the analysis of themes within novels, documents, reviews, forums, discussions, blogs, and micro-blogs.

ContextBase begins the process of Topic Modeling with Data Scientist/Programmer awareness of the sensitivity of Topic Modeling algorithms. After the Topic Modeling of Client text data, ContextBase manually characterizes the resulting topics to refine the arbitariness of the topics. ContextBase also maintains awareness of Topic Model changes dependent on varying document contents.

The goal of ContextBase’s Topic Modeling of Client data is to accomplish the programmatic deduction of stable Topic Models. As a result, ContextBase Topic Modeling allows for improvement in the Client’s business processes.

This document is a Topic Modeling of customer response text data posted to https://www.yelp.com/. The programming language used is R. The analysis includes information on required R packages, session information, data importation, normalization of the text, creation of a document term matrix, Topic Modeling coding, and outputted tables/graphs demonstrating the results of Topic Modeling.

Section 1 - Data Import

The data imported for this project is a collection of 10,000 customer feedback comments, posted to https://www.yelp.com/. To reduce the extensive amount of time required to process 10,096 comments, only the first 1000 comments were selected for processing. The column of comment data within the dataframe was formatted as character variables for subsequent Natural Language Processing algorithms.

# Import Data
import_data <- read.csv("yelp.csv")
project_data <- data.frame(import_data$text)
rm(import_data)
names(project_data) <- "Data"
project_data$Data <- as.character(project_data$Data)

Section 2 - Document Term Matrix

The following function normalizes the text within the 1000 selected customer responses by removing numbers, punctuation, and white space. Upper case letters are converted to lower case, and irrelevant stop words, (“the”, “a”, “an”, etc.) are removed. Lastly, a “Document Term Matrix” is created to classify the dataset terms into frequency of usage.

# Text Mining Function
dtmCorpus <- function(df) {
  df_corpus <- Corpus(VectorSource(df))
  df_corpus <- tm_map(df_corpus, function(x) iconv(x, to='ASCII'))
  df_corpus <- tm_map(df_corpus, removeNumbers)
  df_corpus <- tm_map(df_corpus, removePunctuation)
  df_corpus <- tm_map(df_corpus, stripWhitespace)
  df_corpus <- tm_map(df_corpus, tolower)
  df_corpus <- tm_map(df_corpus, removeWords, stopwords('english'))
  DocumentTermMatrix(df_corpus)
}

Section 3 - Topic Modeling

Topic Modeling treats each document as a mixture of topics, and each topic as a mixture of words, (or “terms”). Each document may contain words from several topics in particular proportions. The content of documents usually merge continously with other documents, instead of existing in discrete groups. The same way as in the use of natural language by individuals usually merges continously.

An example of a two-topic Topic Model of a journalism document is the modeling of the journalism document into “local” and “national” topics. The first topic, “local”, would contain terms like “traffic”, “mayor”, “city council”, “neighborhood”, and the second topic might contain terms like, “Congress”, “federal”, and “USA”. Topic Modelling would also statistically examine the terms that are common between topics.

# Set parameters for Gibbs sampling
burnin <- 4000
iter <- 2000
thin <- 500
seed <- list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE

# Number of topics
k <- 4

# Create Document Term Model (dtm)
dtm <- dtmCorpus(project_data$Data[1:1000])

# Find the sum of words in each Document
rowTotals <- apply(dtm, 1, sum)

# Remove all docs without words
dtm.new <- dtm[rowTotals > 0,]

# Run LDA using Gibbs sampling
ldaOut <- LDA(dtm.new, k, method="Gibbs", control=list(nstart=nstart, seed=seed, best=best, burnin=burnin, iter=iter, thin=thin))

# Save the variable, "ldaOut", for faster processing
saveRDS(ldaOut, "ldaOut_Yelp.rds")

# Docs to topics
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics_Yelp.csv"))

# Top 6 terms in each topic
ldaOut.terms <- as.matrix(terms(ldaOut,6))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms_Yelp.csv"))

# Probabilities associated with each topic assignment
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities_Yelp.csv"))

Section 4 - Tables and Charts

Below are tables and graphic visualizations of the results of Topic Modeling of the Yelp dataset.

Figure 1: “Top Terms Per Topic”

This graph displays the top ten terms for the Yelp dataset’s five topics. The probability of the terms appearing in the topic is represented by the histogram bars. Each topic contains all terms (words) in the corpus, albeit with different probabilities.

Table 1: “Probability of Term being generated from Topic”

The first table displays the beta spread of terms per five topics. The beta spread allows for the characterization of topics by terms that have a high probability of appearing within the topic.

Table 1: Probability of Term being generated from Topic
term	topic1	topic2	topic3	topic4	log_ratio
add	0.0000059	0.0018733	0.0000060	0.0000057	8.3029168
ago	0.0005991	0.0011147	0.0000060	0.0000057	0.8957047
almost	0.0026753	0.0001226	0.0000060	0.0002338	-4.4481789
along	0.0017262	0.0000058	0.0000060	0.0000057	-8.2083880
also	0.0098528	0.0030405	0.0003667	0.0000627	-1.6962095

Table 2: Table of Topics Per Document

This table examines the first five dataset documents, (or customer responses comments), and matches those documents with the topic category that the document’s terms indicate the documents correspond, and includes the probabilities with which each topic is assigned to a document. The gamma of the topics per document represent the correspondence level of the document and topic. Each document is considered to be a mixture of all topics (4 in this case). The assignments in the first file list the topic with the highest probability.

The highest probability in each row corresponds to the topic assigned to that document. The “goodness of fit” of the primary assignment can be assessed by taking the ratio of the highest to second-highest probability and the second-highest to the third-highest probability and so on.

Table 2: Table of Topics Per Document
document	topic	gamma
1	1	0.1992188
2	1	0.1809816
3	1	0.2457627
4	1	0.3776596
5	1	0.2055556

Figure 2: Histogram of Topic Models

This figure is a histogram of the count and gamma spread of topics throughout the entire Yelp dataset.

Section 5 - Conclusions

The LDA algorithm Topic Model contains a voluminous amount of useful information. This analysis outputs the top terms in each topic, document to topic assignment, and the probabilities within the Topic Model.

Gibbs sampling usually finds an optimal solution, that is variable mathematically for specific analyses. A variety of trial settings of parameters allows ContextBase to optimize of the stability of Topic Modeling results.

Figure 1 demonstrates the arbitariness of Topic Models. ContextBase makes a manual determination of the specific topics:

The top words in Topic 1, “just”, “can”, “dont”, “one”, “get”, “also”, “make”, “better”, “think”, “around”, indicate this topic within the Yelp dataset is the topic of resolving a complaint with a business.
Topic 2’s top words, “good”, “food”, “try”, “chicken”, “restaurant”, “ordered”, “menu”, “cheese”, “pizza”, “lunch”, indicate Topic 2 refers to food ordered at restaurants.
Topic 3’s top words, “place”, “great”, “good”, “service”, “food”, “love”, “ive”, “best”, “always”, “like”, indicate Topic 3 refers to reasons customers liked businesses.
Topic 4’s top words, “back”, “time”, “really”, “even”, “like”, “got”, “didnt”, “first”, “much”, “want”, very possibly refer to reasons customers returned to businesses.

Table 1 allows for the verification of optimal beta spread generated by the LDA algorithm.

Table 2 provides a final determination of the topic of each document within the corpus of Yelp comments.

Figure 2 displays the gamma spread of the 4 topics determined by setting the value of “k”.

The generated Topic Model demonstrates that the primary topic assignments are optimal when the ratios of probabilities are highest. Different values of “k” optimise the topic distributions.

Section 6 - Appendix

Section 6a - Required Packages

The needed R programming language packages are installed and included in the package library. The R packages included are packages for Topic Modeling, Natural Language Processing, data manipulation, and plotting.

	List of Required Packages
Required Packages	‘topicmodels’ ‘tidytext’ ‘tidyr’ ‘tm’ ‘syuzhet’ ‘plyr’ ‘dplyr’ ‘data.table’ ‘stringr’ ‘ggplot2’ ‘knitr’

Section 6b - Session Information

Session information is provided for reproducible research. The Session Information below is for reference when running the required packages, and R code.

	Session Information
R Version	R version 3.6.0 (2019-04-26)
Platform	x86_64-w64-mingw32/x64 (64-bit)
Running	Windows 10 x64 (build 17763)
RStudio Citation	RStudio: Integrated Development Environment for R
RStudio Version	1.0.153

ContextBase - Topic Modeling

https://contextbase.github.io

All programming by John Akwei, ECMp ERMp Data Scientist

June 29, 2019

Table of Contents

The Problem

Our Solution

Section 1 - Data Import

Section 2 - Document Term Matrix

Section 3 - Topic Modeling

Section 4 - Tables and Charts

Figure 1: “Top Terms Per Topic”

Table 1: “Probability of Term being generated from Topic”

Table 2: Table of Topics Per Document

Figure 2: Histogram of Topic Models

Section 5 - Conclusions

Section 6 - Appendix

Section 6a - Required Packages

Section 6b - Session Information

Section 7 - References