Using R to get insights from text

CIPD HR Analytics Conference - June 5, 2018

You can follow this presentation at http://rpubs.com/keithmcnulty/cipd_breakout

Code for this work is available at http://rpubs.com/keithmcnulty/facebook_analysis

Purpose of today

Who are we?

Dillon Nagle

  • Global Manager of Recruiting Analytics, McKinsey & Company
  • Gamer and frustrated Man Utd fan
  • Learned R 2 years ago

Keith McNulty

  • Global Head of People Analytics, McKinsey & Company
  • Wants to be a gamer, even more frustrated Everton Fan
  • Learned R 2 years ago

Why are we here?

One of the most powerful features of statistical programming languages like R is the ability to analyze text.

Text analytics can help you

  • Process and handle large numbers of documents in seconds
  • Draw out important words or phrases that have greater meaning than others
  • Understand some of the big subjects or topics that appear in documents
  • Understand sentiment (positive or negative), and even emotions in written documents

Why is this useful?

One example of how we use these techniques is to rapidly analyze open text survey responses, and identify the main themes that are arising among employees. There are many other potential uses also.

We can’t show you real data from our work, so instead we are going to demonstrate by analyzing the posts from Keith’s Facebook timeline. Each post can be regarded as a short ‘document’. Keith posted almost 5,000 times in 10 years on Facebook.

How will this session work?

We don’t have very long. We will spend about 25 minutes demonstrating the methods and showing the results. We will then spend about 10 minutes discussing the results and in Q&A.

We won’t go into too much technical detail because we don’t have time, but the code will be available for you afterwards to review at this link.

Finding the most important words

Loading the data

We have pre-prepared a csv file containing a list of around 5,000 posts which Keith made to his Facebook timeline over ten years.

We will load this file into R, and use the tm text mining packages to turn this into a corpus, or list of documents, with each post being a document. Examples

library(tm)

keiths_posts <- read.csv("keiths_posts.csv")
corpus <- (tm::VectorSource(keiths_posts))
corpus <- tm::Corpus(corpus)

Cleaning the data

For text analysis, we usually don’t care about

  • numbers and punctuation, so we usually remove them from the corpus
  • upper or lower case, so we usually just convert the entire corpus to lower case.
  • stopwords, which are words like and, or that have little meaning in documents, so we want them removed also
# remove punctuation

corpus <- tm::tm_map(corpus, content_transformer(removePunctuation))

# convert to lower case and remove stopwords

corpus <- tm::tm_map(corpus, content_transformer(tolower))
corpus <- tm::tm_map(corpus, content_transformer(removeWords), 
                     stopwords("english"))

TF-IDF

TF-IDF is a common technique to determine which words are important. For a given term in a given document, this can be calculated as follows:

  • the number of times that the term appears in the document
  • divided by the number of documents that term appears in

This formula means that terms which appear a lot in a document but less frequently across the whole corpus are considered to be more meaningful in that document.

Term Document Matrix

We usually use a Term-Document Matrix to calculate all the TF-IDF statistics for every word in every document.

The (i,j)-th entry is the importance of word i in document j.

# created weighted tf-idf term document matrix

TDM_tfidf <-tm::TermDocumentMatrix(corpus, 
                                   control = list(weighting = function(x) 
                                     weightTfIdf(x, normalize = FALSE)))

Wordclouds

Usually a good way of demonstrating word frequency is to show a wordcloud. Wordclouds are easy in R using the wordcloud package.

Let’s put the top 100 most important terms in a wordcloud according to how important they are based on their TF-IDF ranking.

# generate tf_idf wordcloud

cp <- brewer.pal(7,"YlOrRd")
wordcloud::wordcloud(d$word, d$freq, max.words = 100, 
                     random.order = FALSE, colors = cp)

Results

Working out topics in the corpus

Topic Modelling

Topic modelling uses a mathematical technique called Latent Dirichlet Allocation (LDA) to work out which words seem to appear together more frequently in documents. There is a good chance that these groups of words point to specific subjects or topics in the corpus.

You need to tell the LDA process exactly how many topics you want it to look for. There are ways to find out the optimal number of topics. We won’t look at them here, but it turns out that 8 topics is a good number for Keith’s Facebook corpus.

Running the LDA process

The topicmodels R package has easy functions for performing LDA on a document-term matrix. You just have to let it know how many topics you want it to find. You can also ask it to calculate what frequency each topic occurs in the corpus, so you can tell if some subjects are coming up more than others.

# Generate Document Term Matrix

DTM <- tm::DocumentTermMatrix(corpus)

# Find 8 topics

corpus_lda <- topicmodels::LDA(DTM, k = 8, control = list(seed = 1234))

Calculating how often topics appear in the corpus

Once the topics have been discovered, you can find out how often each topic appears in the corpus, which is an indicator of the popularity of the topic.

But what are the topics?

You need to look at the results of the topic modelling process, and the lists of words in each topic, to try to work out what each topic is.

Some topics are easy to interpret and obvious, others are harder to work out unless you know the context of the situation.

Other things you can do with R

  • Analyze the sentiment and emotions of statements or documents (eg negative or positive, anger, joy, etc)

  • Use the shiny package to make a web app that allows others to perform similar analysis even if they can’t use R

  • Much, much more!

Questions?