This lab explores latent categorization of texts in two steps. First, we compile abstracts from 53,709 American sociology dissertations (yrs 1990-2013) into a corpus. Using topic modeling, a natural language processing technique, we then try to detect latent topics based on the manifest abstracts. The results are communicated via so called loadings that shows the prevalence of various topics in each abstract. Second, we will try to explore the latent topical structure in an x-dimensional space, using correspondence analysis. We do this by accumulating the topical loadings to topics and to the pre-given labels of the abstracts. The accumulated loadings are aggregated in a two-by-two table, where the loadings in each cell can be thought of as (probable) frequencies of the intersection between a topic and a label.
By the end of the lab, you should (among other things) get one of these:
Before we do all this, let us briefly consider the lab in an intellectual-historic context.
The two main techniques here, topic modeling and correspondence analysis, are more closely related than one might expect. They both fall back on a long-standing European tradition of latent classifications, stretching back to Linné, via Benzécri and Dirichlet. Topic modeling is, if not a sibling, so at least a first cousin to correspondence analysis: manifest flowers have latent classes; sets of observations are explained by unobserved groups; expressed cultural preferences are organized along latent dimensions of socio-cultural-economic capital.
Theoretically, we’ll brace against Bourdieu’s sociological theory. The lab is geared towards the term Distinction; how is something (e.g. a topic) different, i.e. distinct, from something else?
A question could be: what is the exchange rate between the academic capital associated with a certain topical position in an academic field and other forms of capital?
Or: should we, in terms of our personal strategy as academics, be alike or distinct?
In other words:
i.e., how can I climb the academic ivory tower?
i.e., how can I wedge something new into the circle of academic hegemony?
Befor you start, you’re going to need soma data:
https://mega.nz/#!P4sgTa5L!AWStLHQkD9RWESnE4gKRU_770GNRB-vPjtAqcPJCJLM Better unzip and save to your working directory.
The below script is a very basic introduction to topic modeling, a natural language processing technique to determine latent topics that guides the words used in a set of documents. “Poetics” (41 [2013]) has a special issues on topic modeling that should get you started, at least conceptually.
The script has the following structure:
1. Preparations: load packages and get the data and create a corpus
2. Clean up: clean up the corpus and export to csv
3. First training: fit the model and export the results
4. Model evaluation and re-training: evaluate appropriate no of topics using harmonic means
# Clear the workspace
rm(list=ls())
# Install necessary packages...
# "NLP"
# "tm"
# "SnowballC"
# "topicmodels"
# "Rmpfr"
#load necessary packages
# basic nlp:
library(NLP)
# text mining library:
library(tm)
# snowball C for stemmig
library(SnowballC)
# for modeling the topics:
library(topicmodels)
# Rmpfr needed for evaluating optimal no of topics
library(Rmpfr)
# Set the working directory using setwd.
# Load the data, treating our text strings as strings
# rather than as categories, then load the sample data:
soctest <- read.csv ("sociologysample.csv", stringsAsFactors=FALSE)
#Check it out
str(soctest)
# the data we are interested in is found in the column "text".
Topic modeling is computationally heavy and takes long time. Typically, models are fitted using servers (rather than personal laptops). The data set provided for this lab is only a very small (random) sample of the original corpus. Use it to test the scripts.
# Now we need to do three things:
# 1. create a text source from the text coulmn,
soctest_text <- paste(soctest$text)
# 2. transform it to a vector,
soctest_source <- VectorSource(soctest_text)
# 3. create a corpus of the rows in the vector
corp <- Corpus(soctest_source)
# Time to clean up the corpus we just created:
# Depending on your version of R, you may have to use the
# gear "lazy=T" within the brackets below, as so: ## corp <- tm_map(corp, removePunctuation, lazy=T) ##
# change everything to lower case
corp <- tm_map(corp, content_transformer(tolower))
# remove punctuation
corp <- tm_map(corp, removePunctuation)
# strip from the corpus numbers
corp <- tm_map(corp, removeNumbers)
# remove stopwords (in this case "english")
corp <- tm_map(corp, removeWords, stopwords("english"))
# Here, add stop words that "blur" your results.
myStopwords <- c("here", "add", "your", "own", "stop", "words")
corp <- tm_map(corp, removeWords, myStopwords)
# Remove redundant whitespace
corp <- tm_map(corp, stripWhitespace)
#Stem document
corp <- tm_map(corp,stemDocument)
Let’s consider the cleaning procedure above. With the possible exception of stripping the corpus from excess whitespace and punctuation, cleaning the corpus is by no mean a neutral, objective process. Stemming , for example, is a rather coarse, heuristic process to decrease the number of unique tokens in the corpus. E.g., “fishing”, “fished”, and “fisher” are reduced to the stem (and root word), “fish”. Does that really make sense, semantically? The stoplist is an equally critical issue. We want it to be conservative (i.e. short) but effective (i.e. help increasing the distinctivness of the topics).
# Write the corpus (corp) to a cvs-file. Why? To inspect and play around with in other tm applications like Mallet or the S. toolbox.
corp_df <- data.frame(Text = do.call(rbind,lapply(corp, function(x) x$"content")),stringsAsFactors=F)
write.table(corp_df, "Corp_Dataframe.csv",row.names=F,col.names=F)
# Now write the corpus to a document-term matrix:
dtm <- DocumentTermMatrix(corp)
# Set parameters for Gibbs sampling, a randomized algorithm for obtaining a sequence
# of observations from a multivariate probability distribution.
# This sequence will be used to approximate the joint distribution of topics and words.
burnin <- 200 # number of omitted Gibbs iterations at beginning
iter <- 400 # number of iterations
thin <- 200 # number of omitted in-between Gibbs iterations
seed <-list(2003,5,63,100001,765) #seeds can be set to enable reproducibility
nstart <- 5 # number of repeated random starts
best <- TRUE # only continue model on the best model
k <- 5 # Set the number of topics (more on the number of topics below)
# Run LDA using Gibbs sampling. It's going to take a while.
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
# Now write out the results:
# docs to topics (top topic per doc)
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
# top 10 terms in each topic, our topic summary
ldaOut.terms <- as.matrix(terms(ldaOut,10))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
# probabilities associated with each topic assignment,
# i.e. topic loadings for the documents. This document can be glued with metadata from the original dataset
# to see how topic loadings accumulates to e.g. year, institution or labels.
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
The single most important parameter of the Gibbs sampler is K, the number of topics. Above, we used K=5. But why not some other number? The issue of how to determine the appropriate numbers of topics is a bit of an unfinished business of topic modeling practices. One straight forward approach is to rely on our knowledge interest: A quick overview of a corpus requires perhaps 10 topics, something more refined 25, and a really detailed topical inspection perhaps 100 topics. Though robust, this approach is still somewhat unsatisfying. Ponweiser, M. (2012) chap: 3.7 lists a number of more structured approaches (http://epub.wu.ac.at/3558/1/main.pdf). Here we will just briefly look into harmonic means, which is a way to estimate the Goodness-of-fit. Goodness-of-fit means how well the model fits the training data. From Su & Liao (n.d.), p.2 (aviable here: http://people.cs.umass.edu/~jcsu/reports/Project3.pdf) we have that:
“Since we want to discover distinct topics, the topic distribution of a document should have higher probability only on a small number of topics. In other words, if emk [the probability of topic k in document m] has higher probability only for some k, overall, the harmonic mean is small.We consider the model to have better goodness-of-fit on the training data by checking whether the harmonic mean becomes lesser with each epoch of Gibbs sampling.”
# Fist create the harmonic mean function:
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# Update the parameters:
k = 5
burnin = 200
iter = 400
keep = 50
# Fit the model again
fitted <- LDA(dtm, k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where "keep" indicates that the log-likelihood for every
# iteration is evaluated and stored. This returns all log-likelihood
# values including burnin (that we do not want), therefore these need to be omitted before
# calculating the harmonic mean:
logLiks <- fitted@logLiks[-c(1:(burnin/keep))]
# then calculate using our function created above:
harmonicMean(logLiks)
# generate numerous topic models with different numbers of topics
# (this takes a couple of minutes to run)
sequ <- seq(2, 30, 3) # in this case a sequence of numbers from 1 to 30, by threes
fitted_many <- lapply(sequ, function(k) LDA(dtm, k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L@logLiks[-c(1:(burnin/keep))])
# compute harmonic means again
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l") # Higher --> Better...
# compute optimum number of topics
sequ[which.max(hm_many)]
# Remodel from top according to the result:
burnin <- 200
iter <- 400
thin <- 200
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 26
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
ldaOut.terms <- as.matrix(terms(ldaOut,10))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
The structure of the script is the following:
1. Simple CA: Simple and straight forward correspondence analysis.
2. Elaborate diagnostics and visualizations: On the same data, add diagnostic tools and a little more sophisticated visualizations.
3. Add supplementary data: Add a third supplementary category to the analysis.
The data used for this part of the lab is acumulated from a topic model on the whole sample of 53,709 dissertation abstracts. The topic loadings of the documents are accumulated to our two variables: labels and topics. Inspect the file LAB1_TM_RL_CT.txt (once you’ve loaded it, see below) to get a sense of the data structure.
We need these packages:
“ca” to perform correspondence analysis
“FactoMineR” to perform correspondence analysis (yes, we need them both)
“vcd” to handle categorical data
“gplots” for visualization
“plotly” for even more visualization
“stats” for cluster analysis
“factoextra” to diagnose and visualize the results
“CAinterprTools” to diagnose and visualize the results
“CAinterprTools” and “factoextra” are availiable through Github, so we will need “devtools” to get them. Like so:
library("devtools")
install_github("gianmarcoalberti/CAinterprTools")
install_github("kassambara/factoextra")
# Clear the workspace:
rm(list=ls())
# Load packages:
library("ca")
library("FactoMineR")
library("vcd")
library("gplots")
library("plotly")
library("stats")
library("factoextra")
library("CAinterprTools")
# FIRST ROUND
# Set working directory using e.g. the "setwd" and then get the data "LAB1_TM_RL_CT.txt":
mydata <- read.table("LAB1_TM_RL_CT.txt", header=TRUE)
# First simplified round:
res.ca <- CA(mydata, graph = FALSE)
print(res.ca)
# Have a look in R,
# nb.dec: number of decimal printed
# nbelements: number of row/column variables to be written. To have all the elements, use nbelements = Inf.
# ncp: Number of dimensions to be printed
summary(res.ca, nb.dec = 2, nbelements = 5, ncp = 3)
# lets plot the result:
fviz_ca_biplot(res.ca)
# and make it look better..:
fviz_ca_biplot(res.ca, col.row="orange", col.col ="steelblue") +
theme_minimal()
# Same, but labels and topics separately..:
fviz_ca_row(res.ca, col.row="orange", shape.row = 1) +
theme_minimal()
fviz_ca_col(res.ca, col.col ="steelblue", shape.row = 1) +
theme_minimal()
# Export the results, by adding "file" (best in txt-format) and include all elements (change to "Inf"):
summary(res.ca, nb.dec = 2, nbelements = Inf, ncp = 3, file = "result_round1.txt")
# Done!
# SECOND ROUND
# Second, slightly more complicated round:
# Let's do it properly, still same data
# Start with a closer look at the chisq statistics and the eigenvalues:
# Let's look at the so called trace (the sum of eigenvalues)
# which shows you how much of the variance is captured in the original two-by-two table:
eig <- get_eigenvalue(res.ca)
trace <- sum(eig$eigenvalue)
trace
# excellent...
# Chi-square statistics
chi2 <- trace*sum(as.matrix(mydata))
chi2
# Is it significant (i.e. are rows and columns interrelated?)
# Degrees of freedom:
df <- (nrow(mydata) - 1) * (ncol(mydata) - 1)
df
# P-value:
pval <- pchisq(chi2, df = df, lower.tail = FALSE)
pval
# p-value basically zero, so ok in the case of our data. But we want to know if the
# interrelatedness is also large enough to care about.
# A commonly accepted measure of the "size" of the interrelatedness is the sqrt of the total trace
# If below 0.2, we can't really say anything about our data and we should abort or restructure our study.
cor.coef <- sqrt(trace)
cor.coef
# Ok, so we're good. Given the fact that the two variables are related, how many dimensions do
# we need to capture and display that interrelatedness properly? Let's start by looking at eigenvalues per dimension:
eigenvalues <- get_eigenvalue(res.ca)
head(round(eigenvalues, 2))
# By first inspection, the model does not perform all that well, we need three dimensions to
# explain more than 50% of the trace.
# However, given the high value of trace, we're still ok even with a 2-dimensional solution.
# Can we be more rigorous about this?
# Let's display the eigenvalues, scree plot style:
fviz_screeplot(res.ca)
# Lets apply the average rule, i.e. only include dimensions that explain more data than the average dimension, [i.e.: 1/(34-1)].
aver.rule(mydata)
# The average rule thus overestimates the number of viable dimensions...
# How about the Malinvaud's test? This sequential test checks the significance
# of the remaining dimensions once the first k ones have been selected.
malinvaud(mydata)
# Unfortunately, empirical tests has showed that the number of dimensions
# are overestimated as the table's grand total increases with Malinvaud's test as well.
# This is certainly true in our case.
# We can also look into the significance of the dimensions using permutation tests.
# This can be done for any pair of possible dimensions typically starting with 1,2,
# then 3,4, etcetera. Our dimensions are significant up until (but not including) the 24th dimension,
# so this doesn't give us any guidance:
sig.dim.perm(mydata,23,24)
# CONCLUSION: Back to the scree plot and go for the 2- or 3-dimension solution.
# So far we have been looking into the quality of the whole model.
# We also need to assess every object (in our case, every topic and every label)
# We will look at the COSINE and the CONTRIBUTION.
# The total CONTRIBUTION (ctr) of a row, on explaining the variations retained by Dim.1 and Dim.2,
# is calculated as follow : (C1 * Eig1) + (C2 * Eig2),
# where C1 and C2 are the contributions of the row to dimensions 1 and 2,
# respectively. Eig1 and Eig2 are the eigenvalues of dimensions 1 and 2, respectively.
# The quality of representation of the rows on the factor map is called the
# SQUARED COSINE (cos2) or the SQUARED CORRELATIONS.
# The cos2 measures the degree of association between rows/columns and a particular axis.
# ctr and cos2 values are included in the output (see above),
# but we can also visualize them.
# First, lets look at the contribution to the dimensions from rows and coumns.
# We can compare each singular contribution with the average contribution
# to get a grip on which topics/labels matter and in which dimension.
# ctr for the rows (labels)..:
fviz_contrib(res.ca, choice = "row", axes = 1)
fviz_contrib(res.ca, choice = "row", axes = 2)
fviz_contrib(res.ca, choice = "row", axes = 3)
fviz_contrib(res.ca, choice = "row", axes = 1:3)
# and for the columns (topics)..:
fviz_contrib(res.ca, choice = "col", axes = 1)
fviz_contrib(res.ca, choice = "col", axes = 2)
fviz_contrib(res.ca, choice = "col", axes = 3)
fviz_contrib(res.ca, choice = "col", axes = 1:3)
# cos2 for the rows (labels)..:
fviz_cos2(res.ca, choice = "row", axes = 1:2)
fviz_cos2(res.ca, choice = "row", axes = 1:3)
# looks like this
fviz_ca_row(res.ca, col.row ="cos2")+
scale_color_gradient2(low="white", mid="orange",
high="darkred", midpoint=0.5)+theme_minimal()
# and for the columns (topics)..:
fviz_cos2(res.ca, choice = "col", axes = 1:2)
fviz_cos2(res.ca, choice = "col", axes = 1:3)
# looks like this
fviz_ca_col(res.ca, col.col="cos2")+
scale_color_gradient2(low="white", mid="steelblue",
high="darkblue", midpoint=0.5)+theme_minimal()
# Now together:
fviz_ca_biplot(res.ca, col.row="cos2", col.col ="cos2") +
scale_color_gradient2(low="white", mid="steelblue", high="darkblue", midpoint=0.5) +
theme_minimal()
# Looks ok, albeit not perfect! Then again, this is real data.
# Since possibly 3 dimensions are in play here (judging by the scree plot), lets visualize them together:
# The file "LAB2_CAResult.txt" is basically just the file we exported above with marginal values
# ("mv", i.e. row and column totals) added and the cos2 sum for d1-3.
ca_res <- read.table("LAB2_CAResult.txt", header=TRUE)
summary(ca_res)
plot_ly(ca_res, x = d1, y = d2, z=d3, text = paste("Object: ", object), type="scatter3d", mode = "markers", color = cos2, colors = "RdPu", size = mv, opacity = cos2)
# Size is determined by mv, thus the representing each topic's/label's accumulated loading
# (i.e. its importance in the TM).
# Color represent cos2 (quality of fit): light=poor, dark=good.
Let’s stop here for a moment and consider the CA as an analytical tool. Notice that many of the labels and topics that are prominent in the topic model have low cos2 values.
# Best to export as a webpage.
# Back to a 2 dimensional representation, using the cos2 only for the first two dimensions:
plot_ly(ca_res, x = d1, y = d2, text = paste("Object: ", object),
mode = "markers", color = twodcos2, colors = "RdPu", size = mv, opacity = twodcos2)
# THIRD ROUND
# Can we introduce an additional variable? Yes, in CA this can be done using SUPPLEMENTARY CATEGORIES (sc).
# sc are included in our solution without affecting the original space or positions of topics and labels.
# We are going to use author univ. affiliation (institution) as a supplementary category.
# The accumulated loadings/institution are added as extra rows in our dataset in the file "LAB3_TMSupLarge.txt".
mydataSUP <- read.table("LAB3_TMSupLarge.txt", header=TRUE)
# run the ca as before, but specify the supplementary rows:
ressup.ca <- CA (mydataSUP, row.sup = 26:152, col.sup = NULL,
graph = FALSE)
print(ressup.ca)
summary(ressup.ca, nb.dec = 2, nbelements = 5, ncp = 3)
# Note that our original soultion is intact, the supplemntary categories (i.e. institutions)
# only add interpretative information.
# Export the results:
summary(ressup.ca, nb.dec = 2, nbelements = Inf, ncp = 3, file = "result_round3.txt")
# and plot them:
fviz_ca_biplot(ressup.ca, col.row="orange", col.col ="steelblue", col.row.sup ="lightgrey") +
theme_minimal()
# Not fun to look at... Can we improve the representation?
# Lets start by clustering the original objects (topics/labels) in the map based on their scores in d1-3.
# We need the score data arranged in 1 column per dimension, and we need this in a 3 AND 2 dimensional version.
# Load "LAB4_Clus3D.txt":
scores3 <- read.table("LAB4_Clus3D.txt", header=TRUE)
# and "LAB5_Clus2D.txt":
scores2 <- read.table("LAB5_Clus2D.txt", header=TRUE)
# Hierarchical Clustering based on Euclidian distances,
# minimizing the total within-cluster variance (Ward's method)
# First for the three dimensional solution
d <- dist(scores3, method = "euclidean") # create the distance matrix
fit <- hclust(d, method="ward.D") # fit the distances
plot(fit) # display dendrogram. To me, it looks like a 5 or 6 cluster solution.
groups <- cutree(fit, k=6) # cut tree into 6 clusters
# draw dendrogram with red borders around the 6 clusters
rect.hclust(fit, k=6, border="red")
print(groups)
# We will want to have this on a csv file.
write.csv(groups, "cl_result3.csv")
# Then for the two dimensional solution
d <- dist(scores2, method = "euclidean") # create the distance matrix
fit <- hclust(d, method="ward.D") # fit the distances
plot(fit) # display dendrogram. To me, it looks like a 5 or 6 cluster solution.
groups <- cutree(fit, k=6) # cut tree into 6 clusters
# draw dendrogram with red borders around the 6 clusters
rect.hclust(fit, k=6, border="red")
print(groups)
# We will want to have this on a csv file.
write.csv(groups, "cl_result2.csv")
# Some rearrangement necessary in excel, but nothing strange:
# just add the group no. per label and topic from our cluster analysis.
# This has been done in the file "LAB6_CAresultSuplClus.txt".
Res_All <- read.table("LAB6_CAresultSuplClus.txt", header=TRUE)
# Plot the three dimensional solution:
plot_ly(Res_All, x = d1, y = d2, z=d3, text = paste("Object: ", object), type="scatter3d",
mode = "markers", color = cl3D, colors = "Accent", size = mv, opacity = cos23D)
# Save as webpage.
# Inspect "Res_All": the column "cldiff" shows the difference between the two solutions.
# Only two objects switches cluster between the two solutions.
# --> Lets go for the simpler solution and plot it:
plot_ly(Res_All, x = d1, y = d2, text = paste("Object: ", object),
mode = "markers", color = cl2D, colors = "Dark2", size = mv, opacity = cos22D)
# Save as webpage.
EXTRA CODE SNIPPETS FOR THE OVERACHEIVER.
Clustering objects in a joint space is disputed since the distance between row- and column objects does not have a clear statistical interpretation (i.e., might not be Euclidian). Empirical tests supports joint clustering, but for the purist, rows and columns are better clustered separately using FactoMiner package.
ca.factom <- CA(mydata, ncp=2, graph= FALSE)
resclust.rows<-HCPC(ca.factom, nb.clust=-1, metric="euclidean", method="ward", order=TRUE, graph.scale="inertia", graph=FALSE, cluster.CA="rows")
resclust.cols<-HCPC(ca.factom, nb.clust=-1, metric="euclidean", method="ward", order=TRUE, graph.scale="inertia", graph=FALSE, cluster.CA="columns")
# plots for row clusters
#2D
for(i in 2:2){
plot(resclust.rows, axes=c(1,i), choice="map", draw.tree=FALSE, ind.names=TRUE, new.plot=TRUE)
}
#3D
for(i in 2:2){
plot(resclust.rows, axes=c(1,i), choice="3D.map", draw.tree=TRUE, ind.names=TRUE, new.plot=TRUE)
}
# dendrogram
plot(resclust.rows, choice="tree", rect=TRUE, new.plot=TRUE)
# plots for col clusters
#2D
for(i in 2:2){
plot(resclust.cols, axes=c(1,i), choice="map", draw.tree=FALSE, ind.names=TRUE, new.plot=TRUE)
}
#3D
for(i in 2:2){
plot(resclust.cols, axes=c(1,i), choice="3D.map", draw.tree=TRUE, ind.names=TRUE, new.plot=TRUE)
}
# dendrogram
plot(resclust.cols, choice="tree", rect=TRUE, new.plot=TRUE)
Extra credit :-)