Natural Language Processing and Topic Modeling

Chris Bail, PhD
Computational Sociology, Duke University

Recap

 

Last class, we learned how to collect vast amounts of text based data (and non-text data)

Recap

 

But collecting lots of text-based data will quickly make you overwhelmed once you realize that you cannot possibly read it all.

Today's Agenda

 

Thankfully, computer scientists and computational linguists have produced a variety of exciting new tools for automated content analysis.

Today's Agenda

 

Unfortunately, these techniques require quite a few steps. First, we need to transform words into numbers, and then we need to go over some common pitfalls of topic models

Today's Agenda

 

1) Creating a corpus
2) What is topic modeling?
3) Conventional LDA
4) Structural Topic Models

CREATING A CORPUS

First, Set Your Working Directory

 

setwd("/Users/christopherandrewbail/Desktop/Dropbox/TEACHING/Computational Soc Fall 2015/Course Dropbox")

The tm Package

 

install.packages("tm")
library(tm)

Let's Load some Data

 

Let's read in some political blogs data I put in our dropbox:

blog_data<-read.csv("poliblogs2008.csv", stringsAsFactors = FALSE)

Let's take a peek:

 

colnames(blog_data)

And another one:

 

blog_data$documents[12]
[1] "Historian Richard Brookhiser puts is succinctly at NRO's The Corner this morning:But a man who could not have used certain restrooms forty years ago is in the center ring, not as a freak in the manner of Alberto Fujimori or Sonia Gandhi, nor even as a faction fighter in the style of Jesse Jackson, but as a real player. One of our great national sins is being obliterated, as the years pass, by the virtues of our national system. I don't agree with Obama and I don't particularly like him, but I am proud of this moment. Those of us of a certain age may be surprised that an even bigger deal isn't being made out of the fact that an African American just won a huge victory in a state that is 96% white. Other pundits are marveling at the Obama phenomenon with equal surprise: David Brooks: Barack Obama has won the Iowa caucuses. You’d have to have a heart of stone not to feel moved by this. An African-American man wins a closely fought campaign in a pivotal state. He beats two strong opponents, including the mighty Clinton machine. He does it in a system that favors rural voters. He does it by getting young voters to come out to the caucuses. This is a huge moment. It’s one of those times when a movement that seemed ethereal and idealistic became a reality and took on political substance. Iowa won’t settle the race, but the rest of the primary season is going to be colored by the glow of this result. Whatever their political affiliations, Americans are going to feel good about the Obama victory, which is a story of youth, possibility and unity through diversity — the primordial themes of the American experience. Peggy Noonan: As for Sen. Obama, his victory is similarly huge. He won the five biggest counties in Iowa, from the center of the state to the South Dakota border. He carried the young in a tidal wave. He outpolled Mrs. Clinton among women. He did it with a classy campaign, an unruffled manner, and an appeal on the stump that said every day, through the lines: Look at who I am and see me, the change that you desire is right here, move on with me and we will bring it forward together.Andrew Sullivan: Look at their names: Huckabee and Obama. Both came from nowhere - from Arkansas and Hawaii. Both campaigned as human beings, not programmed campaign robots with messages honed in focus groups. Both faced powerful and monied establishments in both parties. And both are running two variants on the same message: change, uniting America again, saying goodbye to the bitterness of the polarized past, representing ordinary voters against the professionals. Neither has been ground down by long experience, but neither is a neophyte. You have a Republican educated in a Bible college; and a Democrat who is the most credible African-American candidate for the presidency in history. Their respective margins were far larger than most expected. And the hope they have unleashed is palpable. E.J. Dionne: Change, particularly generational change, was also at the heart of Barack Obama's victory over Hillary Rodham Clinton and John Edwards. Young voters and independents flocked to the Illinois senator. Media entrance polls showed that Obama defeated Clinton by better than 5 to 1 among voters under age 30, and such voters made up almost as large a share of the caucus electorate as voters over 65, a strongly pro-Clinton group. Among independents, Obama beat Clinton by better than 2 to 1. Matthew Yglesias: I think the manner of Barack Obama's win is pretty impressive. I can't be the only one who was a bit inclined toward a cynical roll of the eyes at the idea of winning on the back of unprecedented turnout, mobilizing new voters, brining in young people, etc. That sounds like the kind of thing that people say they're going to do but never deliver on. But he did deliver. That's impressive. Perhaps the best line written about last night's Obama win is a touch more negative. From Powerline: CONCLUDING THOUGHTS: Iowa has given its seal of approval to (1) a one-term Senator who stands for \"hope\" and \"change\" and (2) a tacky, big spending governor who doesn't know much about foreign policy but did stay at a Holiday Inn Express. The common demoninator here, other than a patent lack of qualifications for the presidency, is likeability. Well done, (small fraction of) Iowa."

The Joys of Character Encoding

 

blog_data$documents <- iconv(blog_data$documents, "latin1", "ASCII", sub="")

Create a Corpus

 

blog_corpus <- Corpus(VectorSource(as.vector(blog_data$documents))) 

From Words to Numbers..

Topic Models

"Pre-Processing"" Text

 

blog_corpus <- tm_map(blog_corpus, content_transformer(removePunctuation)) 

"Pre-Processing"" Text

 

blog_corpus <- tm_map(blog_corpus,  content_transformer(tolower)) 

"Pre-Processing"" Text

 

blog_corpus <- tm_map(blog_corpus , content_transformer(stripWhitespace))

Stop Words

 

stoplist <- read.csv("english_stopwords.csv", header=TRUE, stringsAsFactors = FALSE)
stoplist<-stoplist$stopword
blog_corpus  <- tm_map(blog_corpus , content_transformer(removeWords), stoplist)

Stemming

 

blog_corpus  <- tm_map(blog_corpus , content_transformer(stemDocument), language = "english")

Document Term Matrix

 

Blog_DTM <- DocumentTermMatrix(blog_corpus, control = list(wordLengths = c(2, Inf)))

Inspect the Document-Term Matrix

 

inspect(Blog_DTM[1:20,1:20])

Remove Sparse Terms

 

DTM <- removeSparseTerms(Blog_DTM , 0.990) 

I've now removed terms that only appear in .01 of all documents.

Inspect the Popular Terms

 

The following line finds all the words that occur more than 3,000 times in the dataset:

findFreqTerms(Blog_DTM, 3000)

Assigning the Number of Topics

 

k<-7

The Topic Models Package

 

library(topicmodels)

Setting Control Parameters

 

control_LDA_Gibbs <- list(alpha = 50/k, estimate.beta = T, 
                          verbose = 0, prefix = tempfile(), 
                          save = 0, 
                          keep = 50, 
                          seed = 980,  for reproducibility
                          nstart = 1, best = T,
                          delta = 0.1,
                          iter = 2000, 
                          burnin = 100, 
                          thin = 2000) 

Our First Topic Model

 

my_first_topic_model <- LDA(Blog_DTM, k, method = "Gibbs", control = control_LDA_Gibbs)

Getting the most Popular Terms by Topic

 

terms(my_first_topic_model, 30)

Determining K (the Number of Topics)

 

many_models <- mclapply(seq(2, 35, by = 1), function(x) {LDA(Blog_DTM, x, method = "Gibbs", control = control_LDA_Gibbs)} )

(Hat tip to Achiim Edelman for this nice function.)

Plotting the log likelihoods

 

many_models.logLik <- as.data.frame(as.matrix(lapply(many_models, logLik)))

We can then plot the results to see where we get decreasing returns for increasing the number of topics:

plot(2:35, unlist(lda.models.gibbs.logLik), xlab="Number of Topics", ylab="Log-Likelihood")

Then We Repeat...

 

k<-10
my_first_topic_model <- LDA(Blog_DTM, k, method = "Gibbs", control = control_LDA_Gibbs)

Finding the Topic Assignments

 

This line tells us which document is assigned to which topic:

topic_assignments_by_docs <- topics(my_first_topic_model)

STRUCTURAL TOPIC MODELING

The STM Package

 

Structural topic models are a somewhat recent innovation that enables you to improve identification of latent topics in unstructured text data using meta data that describe different properties of a text (e.g. the year in which it was written).

Read-in texts from .csv

 

library(stm)
data <- read.csv("poliblogs2008.csv")
processed <- textProcessor(data$documents, metadata = data)
out <- prepDocuments(processed$documents,processed$vocab,processed$meta)
docs <- out$documents
vocab <- out$vocab
meta <-out$meta

Note: the following slides use code from a vignette written by the STM package's authors

Running the model

 

 poliblogPrevFit <- stm(out$documents, out$vocab, K = 20, prevalence =~ rating + s(day), max.em.its = 75, data = out$meta, init.type = "Spectral")

Searching for K...

 

storage <- searchK(out$documents, out$vocab, K = c(7, 10),
 prevalence =~ rating + s(day), data = meta)

We don't have time to run the models

  So let's load a workspace into our Rstudio session that contains the results of the model (in other words, we are going to create the objects that would have been created had we waited a long time.)

load(url("http://goo.gl/VPdxlS"))

This is a URL generated by the authors of the package

Get Top Words...

 

labelTopics(poliblogPrevFit, c(3, 7, 20))

Plot Quotes...

thoughts3 <- findThoughts(poliblogPrevFit, texts = shortdoc, n = 2, topics = 3)$docs[[1]]
thoughts20 <- findThoughts(poliblogPrevFit, texts = shortdoc, n = 2, topics = 20)$docs[[1]]

Plot Top Words and Topic Frequency...

plot.STM(poliblogPrevFit, type = "summary", xlim = c(0, .3))

Plot Topics against Metadata!

 

The STM package includes a variety of nice tools for examining the relationship between topic assignments and meta data, such as the STMCorrVizz and STMbrowser packages.

NOW YOU TRY IT

Run a topic model (either LDA or S)

COMMON PITFALLS OF TOPIC MODELING

Choosing 'k'

 

It is easy to “read the tea leaves.”

Once you have assignments, what should you do with them?

 

Probabilities in regression models?

Dichotomous dummies?

What's the cut-off?

 

NEXT WEEK:

Visualization

Now that we've collected and classified data, this class will teach you how to analyze them using basic visualization techniques (e.g. scatterplots, line graphs, and bar charts). This class is therefore designed to be a “launch point” for some of R's more stunning visualization capabilities (heatmaps, network diagrams, streamgraphs, etc.) that Kieran will cover in his visualization seminar this fall