This lab explores latent categorization of texts in two steps. First, we compile abstracts from 53,709 American sociology dissertations (yrs 1990-2013) into a corpus. Using topic modeling, a natural language processing technique, we then try to detect latent topics based on the manifest abstracts. The results are communicated via so called loadings that shows the prevalence of various topics in each abstract. Second, we will try to explore the latent topical structure in an x-dimensional space, using correspondence analysis. We do this by accumulating the topical loadings to topics and to the pre-given labels of the abstracts. The accumulated loadings are aggregated in a two-by-two table, where the loadings in each cell can be thought of as (probable) frequencies of the intersection between a topic and a label.
By the end of the lab, you should (among other things) get one of these:
Before we do all this, let us briefly consider the lab in an intellectual-historic context.
The two main techniques here, topic modeling and correspondence analysis, are more closely related than one might expect. They both fall back on a long-standing European tradition of latent classifications, stretching back to Linné, via Benzécri and Dirichlet. Topic modeling is, if not a sibling, so at least a first cousin to correspondence analysis: manifest flowers have latent classes; sets of observations are explained by unobserved groups; expressed cultural preferences are organized along latent dimensions of socio-cultural-economic capital.
Theoretically, we’ll brace against Bourdieu’s sociological theory. The lab is geared towards the term Distinction; how is something (e.g. a topic) different, i.e. distinct, from something else?
A question could be: what is the exchange rate between the academic capital associated with a certain topical position in an academic field and other forms of capital?
Or: should we, in terms of our personal strategy as academics, be alike or distinct?
In other words:
i.e., how can I climb the academic ivory tower?
i.e., how can I wedge something new into the circle of academic hegemony?
Befor you start, you’re going to need soma data:
https://mega.nz/#!P4sgTa5L!AWStLHQkD9RWESnE4gKRU_770GNRB-vPjtAqcPJCJLM Better unzip and save to your working directory.
The below script is a very basic introduction to topic modeling, a natural language processing technique to determine latent topics that guides the words used in a set of documents. “Poetics” (41 [2013]) has a special issues on topic modeling that should get you started, at least conceptually.
The script has the following structure:
1. Preparations: load packages and get the data and create a corpus
2. Clean up: clean up the corpus and export to csv
3. First training: fit the model and export the results
4. Model evaluation and re-training: evaluate appropriate no of topics using harmonic means
# Clear the workspace
rm(list=ls())
# Install necessary packages...
# "NLP"
# "tm"
# "SnowballC"
# "topicmodels"
# "Rmpfr"
#load necessary packages
# basic nlp:
library(NLP)
# text mining library:
library(tm)
# snowball C for stemmig
library(SnowballC)
# for modeling the topics:
library(topicmodels)
# Rmpfr needed for evaluating optimal no of topics
library(Rmpfr)
# Set the working directory using setwd.
# Load the data, treating our text strings as strings
# rather than as categories, then load the sample data:
soctest <- read.csv ("sociologysample.csv", stringsAsFactors=FALSE)
#Check it out
str(soctest)
## 'data.frame': 199 obs. of 2 variables:
## $ obs : int 39628 39745 49780 18794 17548 6848 34226 12683 1170 34437 ...
## $ text: chr "Ontogenic variables, such as the experience of the parent's abuse as a child and current_depression or substance abuse, were ex"| __truncated__ "This study_analyzes attitudes of area residents toward mineral_extraction and processing in the Northern Great_Plains_region of"| __truncated__ "This dissertation is an experiment in thinking with the story, not about the story in order to erase_the_boundaries between ana"| __truncated__ "The concepts of parent_empowerment and participation, currently popular in the disability field, that emerge from the American_"| __truncated__ ...
# the data we are interested in is found in the column "text".
Topic modeling is computationally heavy and takes long time. Typically, models are fitted using servers (rather than personal laptops). The data set provided for this lab is only a very small (random) sample of the original corpus. Use it to test the scripts.
# Now we need to do three things:
# 1. create a text source from the text coulmn,
soctest_text <- paste(soctest$text)
# 2. transform it to a vector,
soctest_source <- VectorSource(soctest_text)
# 3. create a corpus of the rows in the vector
corp <- Corpus(soctest_source)
# Time to clean up the corpus we just created:
# Depending on your version of R, you may have to use the
# gear "lazy=T" within the brackets below, as so: ## corp <- tm_map(corp, removePunctuation, lazy=T) ##
# change everything to lower case
corp <- tm_map(corp, content_transformer(tolower))
# remove punctuation
corp <- tm_map(corp, removePunctuation)
# strip from the corpus numbers
corp <- tm_map(corp, removeNumbers)
# remove stopwords (in this case "english")
corp <- tm_map(corp, removeWords, stopwords("english"))
# Here, add stop words that "blur" your results.
myStopwords <- c("here", "add", "your", "own", "stop", "words")
corp <- tm_map(corp, removeWords, myStopwords)
# Remove redundant whitespace
corp <- tm_map(corp, stripWhitespace)
#Stem document
corp <- tm_map(corp,stemDocument)
Let’s consider the cleaning procedure above. With the possible exception of stripping the corpus from excess whitespace and punctuation, cleaning the corpus is by no mean a neutral, objective process. Stemming , for example, is a rather coarse, heuristic process to decrease the number of unique tokens in the corpus. E.g., “fishing”, “fished”, and “fisher” are reduced to the stem (and root word), “fish”. Does that really make sense, semantically? The stoplist is an equally critical issue. We want it to be conservative (i.e. short) but effective (i.e. help increasing the distinctivness of the topics).
# Write the corpus (corp) to a cvs-file. Why? To inspect and play around with in other tm applications like Mallet or the S. toolbox.
corp_df <- data.frame(Text = do.call(rbind,lapply(corp, function(x) x$"content")),stringsAsFactors=F)
write.table(corp_df, "Corp_Dataframe.csv",row.names=F,col.names=F)
# Now write the corpus to a document-term matrix:
dtm <- DocumentTermMatrix(corp)
# Set parameters for Gibbs sampling, a randomized algorithm for obtaining a sequence
# of observations from a multivariate probability distribution.
# This sequence will be used to approximate the joint distribution of topics and words.
burnin <- 200 # number of omitted Gibbs iterations at beginning
iter <- 400 # number of iterations
thin <- 200 # number of omitted in-between Gibbs iterations
seed <-list(2003,5,63,100001,765) #seeds can be set to enable reproducibility
nstart <- 5 # number of repeated random starts
best <- TRUE # only continue model on the best model
k <- 5 # Set the number of topics (more on the number of topics below)
# Run LDA using Gibbs sampling. It's going to take a while.
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
# Now write out the results:
# docs to topics (top topic per doc)
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
# top 10 terms in each topic, our topic summary
ldaOut.terms <- as.matrix(terms(ldaOut,10))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
# probabilities associated with each topic assignment,
# i.e. topic loadings for the documents. This document can be glued with metadata from the original dataset
# to see how topic loadings accumulates to e.g. year, institution or labels.
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
The single most important parameter of the Gibbs sampler is K, the number of topics. Above, we used K=5. But why not some other number? The issue of how to determine the appropriate numbers of topics is a bit of an unfinished business of topic modeling practices. One straight forward approach is to rely on our knowledge interest: A quick overview of a corpus requires perhaps 10 topics, something more refined 25, and a really detailed topical inspection perhaps 100 topics. Though robust, this approach is still somewhat unsatisfying. Ponweiser, M. (2012) chap: 3.7 lists a number of more structured approaches (http://epub.wu.ac.at/3558/1/main.pdf). Here we will just briefly look into harmonic means, which is a way to estimate the Goodness-of-fit. Goodness-of-fit means how well the model fits the training data. From Su & Liao (n.d.), p.2 (aviable here: http://people.cs.umass.edu/~jcsu/reports/Project3.pdf) we have that:
“Since we want to discover distinct topics, the topic distribution of a document should have higher probability only on a small number of topics. In other words, if emk [the probability of topic k in document m] has higher probability only for some k, overall, the harmonic mean is small.We consider the model to have better goodness-of-fit on the training data by checking whether the harmonic mean becomes lesser with each epoch of Gibbs sampling.”
# Fist create the harmonic mean function:
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# Update the parameters:
k = 5
burnin = 200
iter = 400
keep = 50
# Fit the model again
fitted <- LDA(dtm, k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where "keep" indicates that the log-likelihood for every
# iteration is evaluated and stored. This returns all log-likelihood
# values including burnin (that we do not want), therefore these need to be omitted before
# calculating the harmonic mean:
logLiks <- fitted@logLiks[-c(1:(burnin/keep))]
# then calculate using our function created above:
harmonicMean(logLiks)
## [1] -208499.4
# generate numerous topic models with different numbers of topics
# (this takes a couple of minutes to run)
sequ <- seq(2, 30, 3) # in this case a sequence of numbers from 1 to 30, by threes
fitted_many <- lapply(sequ, function(k) LDA(dtm, k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L@logLiks[-c(1:(burnin/keep))])
# compute harmonic means again
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l") # Higher --> Better...
# compute optimum number of topics
sequ[which.max(hm_many)]
## [1] 26
# Remodel from top according to the result:
burnin <- 200
iter <- 400
thin <- 200
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 26
ldaOut <-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
ldaOut.terms <- as.matrix(terms(ldaOut,10))
write.csv(ldaOut.terms,file=paste("LDAGibbs",k,"TopicsToTerms.csv"))
topicProbabilities <- as.data.frame(ldaOut@gamma)
write.csv(topicProbabilities,file=paste("LDAGibbs",k,"TopicProbabilities.csv"))
The structure of the script is the following:
1. Simple CA: Simple and straight forward correspondence analysis.
2. Elaborate diagnostics and visualizations: On the same data, add diagnostic tools and a little more sophisticated visualizations.
3. Add supplementary data: Add a third supplementary category to the analysis.
The data used for this part of the lab is acumulated from a topic model on the whole sample of 53,709 dissertation abstracts. The topic loadings of the documents are accumulated to our two variables: labels and topics. Inspect the file LAB1_TM_RL_CT.txt (once you’ve loaded it, see below) to get a sense of the data structure.
We need these packages:
“ca” to perform correspondence analysis
“FactoMineR” to perform correspondence analysis (yes, we need them both)
“vcd” to handle categorical data
“gplots” for visualization
“plotly” for even more visualization
“stats” for cluster analysis
“factoextra” to diagnose and visualize the results
“CAinterprTools” to diagnose and visualize the results
“CAinterprTools” and “factoextra” are availiable through Github, so we will need “devtools” to get them. Like so:
library("devtools")
install_github("gianmarcoalberti/CAinterprTools")
install_github("kassambara/factoextra")
# Clear the workspace:
rm(list=ls())
# Load packages:
library("ca")
library("FactoMineR")
library("vcd")
library("gplots")
library("plotly")
library("stats")
library("factoextra")
library("CAinterprTools")
# FIRST ROUND
# Set working directory using e.g. the "setwd" and then get the data "LAB1_TM_RL_CT.txt":
mydata <- read.table("LAB1_TM_RL_CT.txt", header=TRUE)
# First simplified round:
res.ca <- CA(mydata, graph = FALSE)
print(res.ca)
## **Results of the Correspondence Analysis (CA)**
## The row variable has 25 categories; the column variable has 34 categories
## The chi square of independence between the two variables is equal to 21771.99 (p-value = 0 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
# Have a look in R,
# nb.dec: number of decimal printed
# nbelements: number of row/column variables to be written. To have all the elements, use nbelements = Inf.
# ncp: Number of dimensions to be printed
summary(res.ca, nb.dec = 2, nbelements = 5, ncp = 3)
##
## Call:
## CA(X = mydata, graph = FALSE)
##
## The chi square of independence between the two variables is equal to 21771.99 (p-value = 0 ).
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.14 0.07 0.06 0.04 0.03 0.02 0.01
## % of var. 32.59 16.50 13.96 8.36 5.99 4.59 3.24
## Cumulative % of var. 32.59 49.08 63.04 71.40 77.39 81.98 85.22
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.01 0.01 0.01 0.01 0.01 0.01 0.00
## % of var. 2.79 2.59 1.84 1.75 1.51 1.20 0.80
## Cumulative % of var. 88.02 90.61 92.45 94.19 95.70 96.90 97.70
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## % of var. 0.75 0.48 0.36 0.26 0.19 0.15 0.06
## Cumulative % of var. 98.45 98.93 99.29 99.55 99.73 99.88 99.94
## Dim.22 Dim.23 Dim.24
## Variance 0.00 0.00 0.00
## % of var. 0.04 0.02 0.00
## Cumulative % of var. 99.97 100.00 100.00
##
## Rows (the 5 first)
## Iner*1000 Dim.1 ctr
## SBS__Sociology | 3.44 | -0.01 0.06
## H__Languages__Societies_and_Cultures | 13.63 | 0.13 0.76
## SBS__Social_and_Developmental_Psychology | 27.43 | -0.54 12.18
## H__History | 46.74 | 0.99 24.71
## SBS__Clinical_Psychology | 27.97 | -0.72 11.51
## cos2 Dim.2 ctr cos2
## SBS__Sociology 0.02 | -0.05 2.12 0.43
## H__Languages__Societies_and_Cultures 0.08 | 0.23 4.66 0.24
## SBS__Social_and_Developmental_Psychology 0.61 | 0.11 1.04 0.03
## H__History 0.73 | 0.04 0.06 0.00
## SBS__Clinical_Psychology 0.57 | 0.09 0.33 0.01
## Dim.3 ctr cos2
## SBS__Sociology | 0.01 0.13 0.02 |
## H__Languages__Societies_and_Cultures | 0.09 0.97 0.04 |
## SBS__Social_and_Developmental_Psychology | 0.22 4.74 0.10 |
## H__History | 0.04 0.08 0.00 |
## SBS__Clinical_Psychology | 0.41 8.88 0.19 |
##
## Columns (the 5 first)
## Iner*1000 Dim.1 ctr
## LIFE_COURSE | 0.55 | 0.02 0.01
## SOCIOLOGY_OF_DEVIANCE | 18.55 | -0.55 6.11
## SOCIOLOGY_OF_RACE | 2.94 | 0.07 0.07
## MEDICAL_SOCIOLOGY | 20.91 | -0.32 1.49
## CHILD_DEVELOPMENT | 15.16 | -0.51 6.61
## cos2 Dim.2 ctr cos2
## LIFE_COURSE 0.02 | -0.11 0.46 0.58
## SOCIOLOGY_OF_DEVIANCE 0.45 | 0.04 0.07 0.00
## SOCIOLOGY_OF_RACE 0.03 | 0.22 1.34 0.32
## MEDICAL_SOCIOLOGY 0.10 | -0.34 3.32 0.11
## CHILD_DEVELOPMENT 0.60 | 0.15 1.20 0.06
## Dim.3 ctr cos2
## LIFE_COURSE | 0.01 0.00 0.00 |
## SOCIOLOGY_OF_DEVIANCE | 0.42 8.49 0.27 |
## SOCIOLOGY_OF_RACE | -0.04 0.05 0.01 |
## MEDICAL_SOCIOLOGY | 0.35 4.07 0.12 |
## CHILD_DEVELOPMENT | 0.05 0.16 0.01 |
# lets plot the result:
fviz_ca_biplot(res.ca)
# and make it look better..:
fviz_ca_biplot(res.ca, col.row="orange", col.col ="steelblue") +
theme_minimal()
# Same, but labels and topics separately..:
fviz_ca_row(res.ca, col.row="orange", shape.row = 1) +
theme_minimal()
fviz_ca_col(res.ca, col.col ="steelblue", shape.row = 1) +
theme_minimal()
# Export the results, by adding "file" (best in txt-format) and include all elements (change to "Inf"):
summary(res.ca, nb.dec = 2, nbelements = Inf, ncp = 3, file = "result_round1.txt")
# Done!
# SECOND ROUND
# Second, slightly more complicated round:
# Let's do it properly, still same data
# Start with a closer look at the chisq statistics and the eigenvalues:
# Let's look at the so called trace (the sum of eigenvalues)
# which shows you how much of the variance is captured in the original two-by-two table:
eig <- get_eigenvalue(res.ca)
trace <- sum(eig$eigenvalue)
trace
## [1] 0.4236834
# excellent...
# Chi-square statistics
chi2 <- trace*sum(as.matrix(mydata))
chi2
## [1] 21771.99
# Is it significant (i.e. are rows and columns interrelated?)
# Degrees of freedom:
df <- (nrow(mydata) - 1) * (ncol(mydata) - 1)
df
## [1] 792
# P-value:
pval <- pchisq(chi2, df = df, lower.tail = FALSE)
pval
## [1] 0
# p-value basically zero, so ok in the case of our data. But we want to know if the
# interrelatedness is also large enough to care about.
# A commonly accepted measure of the "size" of the interrelatedness is the sqrt of the total trace
# If below 0.2, we can't really say anything about our data and we should abort or restructure our study.
cor.coef <- sqrt(trace)
cor.coef
## [1] 0.6509097
# Ok, so we're good. Given the fact that the two variables are related, how many dimensions do
# we need to capture and display that interrelatedness properly? Let's start by looking at eigenvalues per dimension:
eigenvalues <- get_eigenvalue(res.ca)
head(round(eigenvalues, 2))
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 0.14 32.59 32.59
## Dim.2 0.07 16.50 49.08
## Dim.3 0.06 13.96 63.04
## Dim.4 0.04 8.36 71.40
## Dim.5 0.03 5.99 77.39
## Dim.6 0.02 4.59 81.98
# By first inspection, the model does not perform all that well, we need three dimensions to
# explain more than 50% of the trace.
# However, given the high value of trace, we're still ok even with a 2-dimensional solution.
# Can we be more rigorous about this?
# Let's display the eigenvalues, scree plot style:
fviz_screeplot(res.ca)
# Lets apply the average rule, i.e. only include dimensions that explain more data than the average dimension, [i.e.: 1/(34-1)].
aver.rule(mydata)
# The average rule thus overestimates the number of viable dimensions...
# How about the Malinvaud's test? This sequential test checks the significance
# of the remaining dimensions once the first k ones have been selected.
malinvaud(mydata)
## K Dimension Eigen value Chi-square df p value
## [1,] 0 1 1.380719e-01 21771.990230 792 0.000000
## [2,] 1 2 6.988721e-02 14676.833023 736 0.000000
## [3,] 2 3 5.912926e-02 11085.510650 682 0.000000
## [4,] 3 4 3.540051e-02 8047.011411 630 0.000000
## [5,] 4 5 2.539022e-02 6227.870937 580 0.000000
## [6,] 5 6 1.944857e-02 4923.133153 532 0.000000
## [7,] 6 7 1.373788e-02 3923.721548 486 0.000000
## [8,] 7 8 1.183939e-02 3217.767588 442 0.000000
## [9,] 8 9 1.098634e-02 2609.371870 400 0.000000
## [10,] 9 10 7.798590e-03 2044.812296 360 0.000000
## [11,] 10 11 7.393798e-03 1644.062964 322 0.000000
## [12,] 11 12 6.388752e-03 1264.114877 286 0.000000
## [13,] 12 13 5.086631e-03 935.813483 252 0.000000
## [14,] 13 14 3.382218e-03 674.424706 220 0.000000
## [15,] 14 15 3.171794e-03 500.621301 190 0.000000
## [16,] 15 16 2.019720e-03 337.631027 162 0.000000
## [17,] 16 17 1.533044e-03 233.842832 136 0.000000
## [18,] 17 18 1.093068e-03 155.063690 112 0.004430
## [19,] 18 19 7.878105e-04 98.893740 90 0.244606
## [20,] 19 20 6.225466e-04 58.410203 70 0.837020
## [21,] 20 21 2.519994e-04 26.419146 52 0.998796
## [22,] 21 22 1.548279e-04 13.469551 36 0.999769
## [23,] 22 23 8.651542e-05 5.513348 22 0.999857
## [24,] 23 24 2.077445e-05 1.067545 10 0.999768
# Unfortunately, empirical tests has showed that the number of dimensions
# are overestimated as the table's grand total increases with Malinvaud's test as well.
# This is certainly true in our case.
# We can also look into the significance of the dimensions using permutation tests.
# This can be done for any pair of possible dimensions typically starting with 1,2,
# then 3,4, etcetera. Our dimensions are significant up until (but not including) the 24th dimension,
# so this doesn't give us any guidance:
sig.dim.perm(mydata,23,24)
## [1] "It is estimated that your iterations will take 6.49 minutes."
## [1] "R is not in interactive() mode. Resample-based tests will be conducted. Please take note of the progress bar."
## ===========================================================================
## [1] "p value of dim 23 = 0.002"
## [1] "p value of dim 24 = 0.559"
# CONCLUSION: Back to the scree plot and go for the 2- or 3-dimension solution.
# So far we have been looking into the quality of the whole model.
# We also need to assess every object (in our case, every topic and every label)
# We will look at the COSINE and the CONTRIBUTION.
# The total CONTRIBUTION (ctr) of a row, on explaining the variations retained by Dim.1 and Dim.2,
# is calculated as follow : (C1 * Eig1) + (C2 * Eig2),
# where C1 and C2 are the contributions of the row to dimensions 1 and 2,
# respectively. Eig1 and Eig2 are the eigenvalues of dimensions 1 and 2, respectively.
# The quality of representation of the rows on the factor map is called the
# SQUARED COSINE (cos2) or the SQUARED CORRELATIONS.
# The cos2 measures the degree of association between rows/columns and a particular axis.
# ctr and cos2 values are included in the output (see above),
# but we can also visualize them.
# First, lets look at the contribution to the dimensions from rows and coumns.
# We can compare each singular contribution with the average contribution
# to get a grip on which topics/labels matter and in which dimension.
# ctr for the rows (labels)..:
fviz_contrib(res.ca, choice = "row", axes = 1)
fviz_contrib(res.ca, choice = "row", axes = 2)
fviz_contrib(res.ca, choice = "row", axes = 3)
fviz_contrib(res.ca, choice = "row", axes = 1:3)
# and for the columns (topics)..:
fviz_contrib(res.ca, choice = "col", axes = 1)
fviz_contrib(res.ca, choice = "col", axes = 2)
fviz_contrib(res.ca, choice = "col", axes = 3)
fviz_contrib(res.ca, choice = "col", axes = 1:3)
# cos2 for the rows (labels)..:
fviz_cos2(res.ca, choice = "row", axes = 1:2)
fviz_cos2(res.ca, choice = "row", axes = 1:3)
# looks like this
fviz_ca_row(res.ca, col.row ="cos2")+
scale_color_gradient2(low="white", mid="orange",
high="darkred", midpoint=0.5)+theme_minimal()
# and for the columns (topics)..:
fviz_cos2(res.ca, choice = "col", axes = 1:2)
fviz_cos2(res.ca, choice = "col", axes = 1:3)
# looks like this
fviz_ca_col(res.ca, col.col="cos2")+
scale_color_gradient2(low="white", mid="steelblue",
high="darkblue", midpoint=0.5)+theme_minimal()
# Now together:
fviz_ca_biplot(res.ca, col.row="cos2", col.col ="cos2") +
scale_color_gradient2(low="white", mid="steelblue", high="darkblue", midpoint=0.5) +
theme_minimal()
# Looks ok, albeit not perfect! Then again, this is real data.
# Since possibly 3 dimensions are in play here (judging by the scree plot), lets visualize them together:
# The file "LAB2_CAResult.txt" is basically just the file we exported above with marginal values
# ("mv", i.e. row and column totals) added and the cos2 sum for d1-3.
ca_res <- read.table("LAB2_CAResult.txt", header=TRUE)
summary(ca_res)
## object d1 d2
## B__Business : 1 Min. :-0.71679 Min. :-0.76267
## c01LifeCourse : 1 1st Qu.:-0.33100 1st Qu.:-0.20198
## c02SociologyOfDeviance: 1 Median :-0.01385 Median : 0.04293
## c03SociologyOfRace : 1 Mean : 0.03038 Mean : 0.06320
## c04MedicalSociology : 1 3rd Qu.: 0.28437 3rd Qu.: 0.22376
## c05ChildDevelopment : 1 Max. : 1.17557 Max. : 1.36793
## (Other) :53
## d3 mv cos2 twodcos2
## Min. :-1.00760 Min. : 269.9 Min. :0.1200 Min. :0.0500
## 1st Qu.:-0.14124 1st Qu.: 947.5 1st Qu.:0.3500 1st Qu.:0.2600
## Median : 0.01790 Median : 1241.6 Median :0.5900 Median :0.4200
## Mean :-0.02575 Mean : 1741.9 Mean :0.5612 Mean :0.4415
## 3rd Qu.: 0.14195 3rd Qu.: 1670.8 3rd Qu.:0.7500 3rd Qu.:0.6200
## Max. : 0.77119 Max. :26763.4 Max. :0.9700 Max. :0.8600
##
plot_ly(ca_res, x = d1, y = d2, z=d3, text = paste("Object: ", object), type="scatter3d", mode = "markers", color = cos2, colors = "RdPu", size = mv, opacity = cos2)
# Size is determined by mv, thus the representing each topic's/label's accumulated loading
# (i.e. its importance in the TM).
# Color represent cos2 (quality of fit): light=poor, dark=good.
Let’s stop here for a moment and consider the CA as an analytical tool. Notice that many of the labels and topics that are prominent in the topic model have low cos2 values.
# Best to export as a webpage.
# Back to a 2 dimensional representation, using the cos2 only for the first two dimensions:
plot_ly(ca_res, x = d1, y = d2, text = paste("Object: ", object),
mode = "markers", color = twodcos2, colors = "RdPu", size = mv, opacity = twodcos2)
# THIRD ROUND
# Can we introduce an additional variable? Yes, in CA this can be done using SUPPLEMENTARY CATEGORIES (sc).
# sc are included in our solution without affecting the original space or positions of topics and labels.
# We are going to use author univ. affiliation (institution) as a supplementary category.
# The accumulated loadings/institution are added as extra rows in our dataset in the file "LAB3_TMSupLarge.txt".
mydataSUP <- read.table("LAB3_TMSupLarge.txt", header=TRUE)
# run the ca as before, but specify the supplementary rows:
ressup.ca <- CA (mydataSUP, row.sup = 26:152, col.sup = NULL,
graph = FALSE)
print(ressup.ca)
## **Results of the Correspondence Analysis (CA)**
## The row variable has 25 categories; the column variable has 34 categories
## The chi square of independence between the two variables is equal to 21771.99 (p-value = 0 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$row.sup$coord" "coord. for supplementary rows"
## 11 "$row.sup$cos2" "cos2 for supplementary rows"
## 12 "$call" "summary called parameters"
## 13 "$call$marge.col" "weights of the columns"
## 14 "$call$marge.row" "weights of the rows"
summary(ressup.ca, nb.dec = 2, nbelements = 5, ncp = 3)
##
## Call:
## CA(X = mydataSUP, row.sup = 26:152, col.sup = NULL, graph = FALSE)
##
## The chi square of independence between the two variables is equal to 21771.99 (p-value = 0 ).
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.14 0.07 0.06 0.04 0.03 0.02 0.01
## % of var. 32.59 16.50 13.96 8.36 5.99 4.59 3.24
## Cumulative % of var. 32.59 49.08 63.04 71.40 77.39 81.98 85.22
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.01 0.01 0.01 0.01 0.01 0.01 0.00
## % of var. 2.79 2.59 1.84 1.75 1.51 1.20 0.80
## Cumulative % of var. 88.02 90.61 92.45 94.19 95.70 96.90 97.70
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## % of var. 0.75 0.48 0.36 0.26 0.19 0.15 0.06
## Cumulative % of var. 98.45 98.93 99.29 99.55 99.73 99.88 99.94
## Dim.22 Dim.23 Dim.24
## Variance 0.00 0.00 0.00
## % of var. 0.04 0.02 0.00
## Cumulative % of var. 99.97 100.00 100.00
##
## Rows (the 5 first)
## Iner*1000 Dim.1 ctr
## SBS__Sociology | 3.44 | -0.01 0.06
## H__Languages__Societies_and_Cultures | 13.63 | 0.13 0.76
## SBS__Social_and_Developmental_Psychology | 27.43 | -0.54 12.18
## H__History | 46.74 | 0.99 24.71
## SBS__Clinical_Psychology | 27.97 | -0.72 11.51
## cos2 Dim.2 ctr cos2
## SBS__Sociology 0.02 | -0.05 2.12 0.43
## H__Languages__Societies_and_Cultures 0.08 | 0.23 4.66 0.24
## SBS__Social_and_Developmental_Psychology 0.61 | 0.11 1.04 0.03
## H__History 0.73 | 0.04 0.06 0.00
## SBS__Clinical_Psychology 0.57 | 0.09 0.33 0.01
## Dim.3 ctr cos2
## SBS__Sociology | 0.01 0.13 0.02 |
## H__Languages__Societies_and_Cultures | 0.09 0.97 0.04 |
## SBS__Social_and_Developmental_Psychology | 0.22 4.74 0.10 |
## H__History | 0.04 0.08 0.00 |
## SBS__Clinical_Psychology | 0.41 8.88 0.19 |
##
## Columns (the 5 first)
## Iner*1000 Dim.1 ctr
## LIFE_COURSE | 0.55 | 0.02 0.01
## SOCIOLOGY_OF_DEVIANCE | 18.55 | -0.55 6.11
## SOCIOLOGY_OF_RACE | 2.94 | 0.07 0.07
## MEDICAL_SOCIOLOGY | 20.91 | -0.32 1.49
## CHILD_DEVELOPMENT | 15.16 | -0.51 6.61
## cos2 Dim.2 ctr cos2
## LIFE_COURSE 0.02 | -0.11 0.46 0.58
## SOCIOLOGY_OF_DEVIANCE 0.45 | 0.04 0.07 0.00
## SOCIOLOGY_OF_RACE 0.03 | 0.22 1.34 0.32
## MEDICAL_SOCIOLOGY 0.10 | -0.34 3.32 0.11
## CHILD_DEVELOPMENT 0.60 | 0.15 1.20 0.06
## Dim.3 ctr cos2
## LIFE_COURSE | 0.01 0.00 0.00 |
## SOCIOLOGY_OF_DEVIANCE | 0.42 8.49 0.27 |
## SOCIOLOGY_OF_RACE | -0.04 0.05 0.01 |
## MEDICAL_SOCIOLOGY | 0.35 4.07 0.12 |
## CHILD_DEVELOPMENT | 0.05 0.16 0.01 |
##
## Supplementary rows (the 5 first)
## Dim.1 cos2 Dim.2
## Michigan_State_University | -0.09 0.11 | -0.09
## University_of_California,_Berkeley | 0.41 0.78 | -0.02
## The_University_of_Wisconsin_-_Madison | 0.18 0.25 | -0.20
## The_University_of_Texas_at_Austin | -0.01 0.00 | 0.08
## University_of_Michigan | 0.11 0.16 | -0.05
## cos2 Dim.3 cos2
## Michigan_State_University 0.13 | -0.03 0.01 |
## University_of_California,_Berkeley 0.00 | 0.03 0.00 |
## The_University_of_Wisconsin_-_Madison 0.30 | -0.08 0.05 |
## The_University_of_Texas_at_Austin 0.20 | -0.01 0.00 |
## University_of_Michigan 0.03 | 0.02 0.01 |
# Note that our original soultion is intact, the supplemntary categories (i.e. institutions)
# only add interpretative information.
# Export the results:
summary(ressup.ca, nb.dec = 2, nbelements = Inf, ncp = 3, file = "result_round3.txt")
# and plot them:
fviz_ca_biplot(ressup.ca, col.row="orange", col.col ="steelblue", col.row.sup ="lightgrey") +
theme_minimal()
# Not fun to look at... Can we improve the representation?
# Lets start by clustering the original objects (topics/labels) in the map based on their scores in d1-3.
# We need the score data arranged in 1 column per dimension, and we need this in a 3 AND 2 dimensional version.
# Load "LAB4_Clus3D.txt":
scores3 <- read.table("LAB4_Clus3D.txt", header=TRUE)
# and "LAB5_Clus2D.txt":
scores2 <- read.table("LAB5_Clus2D.txt", header=TRUE)
# Hierarchical Clustering based on Euclidian distances,
# minimizing the total within-cluster variance (Ward's method)
# First for the three dimensional solution
d <- dist(scores3, method = "euclidean") # create the distance matrix
fit <- hclust(d, method="ward.D") # fit the distances
plot(fit) # display dendrogram. To me, it looks like a 5 or 6 cluster solution.
groups <- cutree(fit, k=6) # cut tree into 6 clusters
# draw dendrogram with red borders around the 6 clusters
rect.hclust(fit, k=6, border="red")
print(groups)
## SBS:Sociology
## 1
## H:LanguagesSocietiesandCultures
## 1
## SBS:SocialandDevelopmentalPsychology
## 2
## H:History
## 3
## SBS:ClinicalPsychology
## 2
## HS:Nursing
## 2
## SBS:PublicAffairsPublicPolicyandPlanning
## 4
## Ed:SpecialEducationandGuidance
## 1
## SBS:PoliticalScience
## 3
## B:Business
## 4
## SBS:Anthropology
## 3
## Ed:HigherEducation
## 5
## SBS:Economics
## 4
## HS:PublicHealth
## 2
## SBS:CommunicationandInformationSciences
## 1
## Ed:Education
## 5
## H:Religion
## 3
## Ed:SupplementalCurricula
## 1
## Ed:SchoolAdministration
## 5
## H:EnglishLiterature
## 6
## Ed:MulticulturalEducation
## 5
## Ed:ProfessionalandContinuingEducation
## 1
## H:FineandPerformingArts
## 6
## HS:Medicine
## 2
## Ed:CurriculumDevelopment
## 5
## LifeCourse
## 1
## SociologyOfDeviance
## 2
## SociologyOfRace
## 1
## MedicalSociology
## 2
## ChildDevelopment
## 2
## GenderSexuality
## 1
## MethodExperiements
## 2
## MethodQuantitative
## 2
## Organizations
## 4
## SocialMovement
## 3
## Family
## 2
## HistoricalSociology
## 3
## Religion
## 3
## Marriage
## 2
## WorkAndOccupation
## 4
## Household
## 4
## Education
## 5
## PoliticalSociology
## 4
## MediaLaw
## 1
## SocialWork
## 2
## SocialTheory
## 1
## MethodModels
## 1
## CommunityDevelopment
## 1
## SociologyOfLiterature
## 6
## SocialPsychologyAttitudes
## 1
## UrbanSociology
## 4
## MethodQualitative
## 1
## SymbolicInteraction
## 3
## InternationalComparative
## 4
## Ethnicity
## 1
## SociologyOfAging
## 2
## Adolescence
## 2
## SociologyOfCulture
## 1
## Stratification
## 4
# We will want to have this on a csv file.
write.csv(groups, "cl_result3.csv")
# Then for the two dimensional solution
d <- dist(scores2, method = "euclidean") # create the distance matrix
fit <- hclust(d, method="ward.D") # fit the distances
plot(fit) # display dendrogram. To me, it looks like a 5 or 6 cluster solution.
groups <- cutree(fit, k=6) # cut tree into 6 clusters
# draw dendrogram with red borders around the 6 clusters
rect.hclust(fit, k=6, border="red")
print(groups)
## SBS:Sociology
## 1
## H:LanguagesSocietiesandCultures
## 1
## SBS:SocialandDevelopmentalPsychology
## 2
## H:History
## 3
## SBS:ClinicalPsychology
## 2
## HS:Nursing
## 2
## SBS:PublicAffairsPublicPolicyandPlanning
## 4
## Ed:SpecialEducationandGuidance
## 2
## SBS:PoliticalScience
## 3
## B:Business
## 4
## SBS:Anthropology
## 3
## Ed:HigherEducation
## 5
## SBS:Economics
## 4
## HS:PublicHealth
## 2
## SBS:CommunicationandInformationSciences
## 3
## Ed:Education
## 5
## H:Religion
## 3
## Ed:SupplementalCurricula
## 1
## Ed:SchoolAdministration
## 5
## H:EnglishLiterature
## 6
## Ed:MulticulturalEducation
## 5
## Ed:ProfessionalandContinuingEducation
## 1
## H:FineandPerformingArts
## 6
## HS:Medicine
## 2
## Ed:CurriculumDevelopment
## 5
## LifeCourse
## 1
## SociologyOfDeviance
## 2
## SociologyOfRace
## 1
## MedicalSociology
## 2
## ChildDevelopment
## 2
## GenderSexuality
## 1
## MethodExperiements
## 2
## MethodQuantitative
## 2
## Organizations
## 4
## SocialMovement
## 3
## Family
## 2
## HistoricalSociology
## 3
## Religion
## 3
## Marriage
## 2
## WorkAndOccupation
## 4
## Household
## 4
## Education
## 5
## PoliticalSociology
## 4
## MediaLaw
## 1
## SocialWork
## 2
## SocialTheory
## 1
## MethodModels
## 1
## CommunityDevelopment
## 1
## SociologyOfLiterature
## 6
## SocialPsychologyAttitudes
## 1
## UrbanSociology
## 4
## MethodQualitative
## 1
## SymbolicInteraction
## 3
## InternationalComparative
## 4
## Ethnicity
## 1
## SociologyOfAging
## 2
## Adolescence
## 2
## SociologyOfCulture
## 1
## Stratification
## 4
# We will want to have this on a csv file.
write.csv(groups, "cl_result2.csv")
# Some rearrangement necessary in excel, but nothing strange:
# just add the group no. per label and topic from our cluster analysis.
# This has been done in the file "LAB6_CAresultSuplClus.txt".
Res_All <- read.table("LAB6_CAresultSuplClus.txt", header=TRUE)
# Plot the three dimensional solution:
plot_ly(Res_All, x = d1, y = d2, z=d3, text = paste("Object: ", object), type="scatter3d",
mode = "markers", color = cl3D, colors = "Accent", size = mv, opacity = cos23D)
# Save as webpage.
# Inspect "Res_All": the column "cldiff" shows the difference between the two solutions.
# Only two objects switches cluster between the two solutions.
# --> Lets go for the simpler solution and plot it:
plot_ly(Res_All, x = d1, y = d2, text = paste("Object: ", object),
mode = "markers", color = cl2D, colors = "Dark2", size = mv, opacity = cos22D)
# Save as webpage.
EXTRA CODE SNIPPETS FOR THE OVERACHEIVER.
Clustering objects in a joint space is disputed since the distance between row- and column objects does not have a clear statistical interpretation (i.e., might not be Euclidian). Empirical tests supports joint clustering, but for the purist, rows and columns are better clustered separately using FactoMiner package.
ca.factom <- CA(mydata, ncp=2, graph= FALSE)
resclust.rows<-HCPC(ca.factom, nb.clust=-1, metric="euclidean", method="ward", order=TRUE, graph.scale="inertia", graph=FALSE, cluster.CA="rows")
resclust.cols<-HCPC(ca.factom, nb.clust=-1, metric="euclidean", method="ward", order=TRUE, graph.scale="inertia", graph=FALSE, cluster.CA="columns")
# plots for row clusters
#2D
for(i in 2:2){
plot(resclust.rows, axes=c(1,i), choice="map", draw.tree=FALSE, ind.names=TRUE, new.plot=TRUE)
}
#3D
for(i in 2:2){
plot(resclust.rows, axes=c(1,i), choice="3D.map", draw.tree=TRUE, ind.names=TRUE, new.plot=TRUE)
}
# dendrogram
plot(resclust.rows, choice="tree", rect=TRUE, new.plot=TRUE)
# plots for col clusters
#2D
for(i in 2:2){
plot(resclust.cols, axes=c(1,i), choice="map", draw.tree=FALSE, ind.names=TRUE, new.plot=TRUE)
}
#3D
for(i in 2:2){
plot(resclust.cols, axes=c(1,i), choice="3D.map", draw.tree=TRUE, ind.names=TRUE, new.plot=TRUE)
}
# dendrogram
plot(resclust.cols, choice="tree", rect=TRUE, new.plot=TRUE)
Extra credit :-)