You can follow this presentation at http://rpubs.com/keithmcnulty/cipd_breakout
Code for this work is available at http://rpubs.com/keithmcnulty/facebook_analysis
Dillon Nagle
Keith McNulty
One of the most powerful features of statistical programming languages like R is the ability to analyze text.
Text analytics can help you
One example of how we use these techniques is to rapidly analyze open text survey responses, and identify the main themes that are arising among employees. There are many other potential uses also.
We can’t show you real data from our work, so instead we are going to demonstrate by analyzing the posts from Keith’s Facebook timeline. Each post can be regarded as a short ‘document’. Keith posted almost 5,000 times in 10 years on Facebook.
We don’t have very long. We will spend about 25 minutes demonstrating the methods and showing the results. We will then spend about 10 minutes discussing the results and in Q&A.
We won’t go into too much technical detail because we don’t have time, but the code will be available for you afterwards to review at this link.
We have pre-prepared a csv file containing a list of around 5,000 posts which Keith made to his Facebook timeline over ten years.
We will load this file into R, and use the tm text mining packages to turn this into a corpus, or list of documents, with each post being a document. Examples
library(tm)
keiths_posts <- read.csv("keiths_posts.csv")
corpus <- (tm::VectorSource(keiths_posts))
corpus <- tm::Corpus(corpus)For text analysis, we usually don’t care about
# remove punctuation
corpus <- tm::tm_map(corpus, content_transformer(removePunctuation))
# convert to lower case and remove stopwords
corpus <- tm::tm_map(corpus, content_transformer(tolower))
corpus <- tm::tm_map(corpus, content_transformer(removeWords),
stopwords("english"))TF-IDF is a common technique to determine which words are important. For a given term in a given document, this can be calculated as follows:
This formula means that terms which appear a lot in a document but less frequently across the whole corpus are considered to be more meaningful in that document.
We usually use a Term-Document Matrix to calculate all the TF-IDF statistics for every word in every document.
The (i,j)-th entry is the importance of word i in document j.
# created weighted tf-idf term document matrix
TDM_tfidf <-tm::TermDocumentMatrix(corpus,
control = list(weighting = function(x)
weightTfIdf(x, normalize = FALSE)))Usually a good way of demonstrating word frequency is to show a wordcloud. Wordclouds are easy in R using the wordcloud package.
Let’s put the top 100 most important terms in a wordcloud according to how important they are based on their TF-IDF ranking.
# generate tf_idf wordcloud
cp <- brewer.pal(7,"YlOrRd")
wordcloud::wordcloud(d$word, d$freq, max.words = 100,
random.order = FALSE, colors = cp)Topic modelling uses a mathematical technique called Latent Dirichlet Allocation (LDA) to work out which words seem to appear together more frequently in documents. There is a good chance that these groups of words point to specific subjects or topics in the corpus.
You need to tell the LDA process exactly how many topics you want it to look for. There are ways to find out the optimal number of topics. We won’t look at them here, but it turns out that 8 topics is a good number for Keith’s Facebook corpus.
The topicmodels R package has easy functions for performing LDA on a document-term matrix. You just have to let it know how many topics you want it to find. You can also ask it to calculate what frequency each topic occurs in the corpus, so you can tell if some subjects are coming up more than others.
# Generate Document Term Matrix
DTM <- tm::DocumentTermMatrix(corpus)
# Find 8 topics
corpus_lda <- topicmodels::LDA(DTM, k = 8, control = list(seed = 1234))Once the topics have been discovered, you can find out how often each topic appears in the corpus, which is an indicator of the popularity of the topic.
You need to look at the results of the topic modelling process, and the lists of words in each topic, to try to work out what each topic is.
Some topics are easy to interpret and obvious, others are harder to work out unless you know the context of the situation.
Analyze the sentiment and emotions of statements or documents (eg negative or positive, anger, joy, etc)
Use the shiny package to make a web app that allows others to perform similar analysis even if they can’t use R
Much, much more!