Abstract
Text analytics methods are powerful tools for the extraction of meaningful information from text string data. In this document we illustrate how methods including regexp-based string extraction, TF-IDF and Topic Modelling can draw insight from text data using a Facebook timeline example. The methods illustrated here are transferrable to any form of text data, for example open text survey responses or comments.Masses of Human Resource data exists in text format. Methodologies are now available which allow us to extract meaningful information from this text data. We demonstrate these methodologies using an example of a Facebook timeline htm file. In particular, we do the following types of analyses:
Use regular expression (regexp) grammar to extract meaningful text data from strings of code.
Perform time series analysis on date and time data extracted using regexp methods.
Perform frequency analysis on words in timeline posts to identify the most important or most meaningful words used over a specified time period.
Perform topic modelling on timeline posts to identify and interpret common topics of posting.
Here we load the necessary R libraries for conducting the analysis, and we load the base data file timeline.htm which contains all posts made during on the Facebook profile. This file can be found by downloading Facebook data, extracting the zip and looking in the html directory.
# install libraries if needed
list.of.packages <- c("stringr",
"tm",
"RColorBrewer",
"topicmodels",
"tidytext",
"ggplot2",
"dplyr",
"ldatuning",
"openxlsx",
"webshot",
"slam",
"wordcloud",
"plotly")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if (length(new.packages)) {
install.packages(new.packages, repos = "http://cran.us.r-project.org")
}
# load libraries
library(stringr)
library(tm)
library(RColorBrewer)
library(topicmodels)
library(tidytext)
library(ggplot2)
library(dplyr)
library(ldatuning)
library(openxlsx)
library(webshot)
library(plotly)
# set directories
homedir <- getwd()
setwd("C:/R/facebook_analytics") # <----- change to directory where facebook data file was unzipped
if(!dir.exists("results")) {
dir.create("results")
}
# read in HTML code
doc.html <- readLines("html/timeline.htm", encoding = "UTF-8")
setwd("./results")
Here we use regexp grammar to identify the dates and content of individual posts. This is done using the regexp grammar (.+?) between specific strings of text, which indicates that we desire any strings whic fall between the indicated strings.
We then separate the date data into hours, months and years. We perform simple dplyr group aggregation and use the results to graphically display Facebook activity over time using ggplot2 and plotly.
# unlist and create single character string from loaded file
doc.html.unlisted <- unlist(doc.html)
doc.html.combined <- paste(doc.html.unlisted, collapse = " ")
# create list of posts and dates using regexp
post_list <- stringr::str_match_all(doc.html.combined, '<div class=\"meta\">(.+?)</div><div class=\"comment\">(.+?)</div>')
#strip out dates and hours of activity
datetime_vec <- unlist(post_list[[1]][,2])
datetime_vecdate <- sub(' at.*', '', datetime_vec)
date_vec <- as.POSIXct(datetime_vecdate, format = "%A, %d %B %Y", tz="GMT")
year_vec <- format(date_vec, "%Y") %>% as.factor()
time_vec <- sub('.*at ', '', datetime_vec)
time_vec <- as.POSIXct(time_vec, format = "%H", tz="GMT")
hour_vec <- format(time_vec, "%H") %>% as.factor()
datemon_vec <- format(date_vec, "%B %Y") %>% as.factor()
date_df <- data.frame(year_vec, hour_vec, datemon_vec)
# analyze by year, dom, hour
year_totals <- date_df %>% dplyr::group_by(year_vec) %>% dplyr::summarize(total = n())
hour_totals <- date_df %>% dplyr::group_by(hour_vec) %>% dplyr::summarize(total = n())
date_totals <- date_df %>% dplyr::group_by(datemon_vec) %>% dplyr::summarize(total = n())
# create monthly activity scatter
date_totals$datemon_vec <- paste("01", date_totals$datemon_vec)
date_totals$datemon_vec <- as.Date(date_totals$datemon_vec, format = "%d %B %Y")
p <- ggplot2::qplot(data = date_totals, datemon_vec, total, xlab = "Year", ylab = "Posts")
p1 <- p + ggplot2::geom_smooth(method = "loess", size = 1.5) + ggplot2::ggtitle("Activity timeline")
# create activity by year bar chart
p2 <- plotly::plot_ly(year_totals, x = ~year_vec, y = ~total, type = 'bar') %>%
plotly::layout(title = "Posts by year", xaxis = list(title = "Year"), yaxis = list(title = "Posts")) %>%
plotly::add_lines(y = ~fitted(loess(total ~ as.numeric(year_vec))), line = list(color = 'red'),
name = "Loess Smoother", showlegend = TRUE)
# create hour of day bar chart
p3 <- plotly::plot_ly(hour_totals, x = ~hour_vec, y = ~total, type = 'bar', name = 'Posts') %>%
plotly::layout(title = "Posts by hour of day", xaxis = list(title = "Hour"), yaxis = list(title = "Posts")) %>%
plotly::add_lines(y = ~fitted(loess(total ~ as.numeric(hour_vec))), line = list(color = 'red'),
name = "Loess Smoother", showlegend = TRUE)
# display charts
p1
p2
p3
Here we use the tm package to form a corpus of documents, which each document being a timeline post. We then use the native functions of tm to remove stopwords, punctuation, white space, numbers and convert all text to lower case. We also remove sensitive personal content such as the names of family and friends using hidden code.
# create vector of posts stripped of html wrappers
post_vec <- unlist(post_list[[1]][,3])
# adjust encoding issues
post_vec <- gsub("'", "'", post_vec)
# create corpus
corpus <- (tm::VectorSource(post_vec))
corpus <- tm::Corpus(corpus)
# sort out encoding
corpus<- tm::tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
# remove punctuation
corpus <- tm::tm_map(corpus, content_transformer(removePunctuation))
# conver to lower case and remove stopwords
corpus <- tm::tm_map(corpus, content_transformer(tolower))
corpus <- tm::tm_map(corpus, content_transformer(removeWords),
stopwords("english"))
# strip whitespace
corpus <- tm::tm_map(corpus, stripWhitespace)
# remove numbers
corpus <- tm::tm_map(corpus, content_transformer(removeNumbers))
Here we use native functions in the tm package to create a Term-Document Matrix (TDM), which calculates how many times a given word appears in a given document. The TDM is weighted using a Term Frequency-Inverse Document Frequency formula, which is designed to give greater weight to words that appear more frequently in a particular document than they do in the overall corpus of documents. This is a standard measure to determine the importance of a word in the meaning of the document. We then rank the words according to their average TF-IDF weighting and then use a wordcloud to illustrate the 100 words of greatest importance in the corpus.
# created weighted tf-idf term document matrix
TDM_tfidf <-tm::TermDocumentMatrix(corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
# rank terms by average tfidf weighting
m <- as.matrix(TDM_tfidf)
v <- sort(rowMeans(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq = v)
# generate tf_idf wordcloud
cp <- brewer.pal(7,"YlOrRd")
wordcloud::wordcloud(d$word, d$freq, max.words = 100, random.order = FALSE, colors = cp)
Topic Modelling is a mathematical technique that uses Latent Dirichlet Allocation to determine which groups of terms appear most frequently together, thus giving a strong indication that they may all be associated with a similar topic or subject.
Topic Modelling requires instruction on the number of topics to generate from the corpus of documents. A technique known as LDA Tuning can be used to determine the number of topics that will give the clearest, most interpretable set of topics. We will not cover LDA tuning in this document, but it is recommended to follow this process to avoid generating too many or too few topics for good interpretability.
LDA tuning on this corpus of Facebook posts reveals the optimal number of topics to be around 8-10. Therefore we proceed to generate 8 topics from the corpus. We use the packaged tm as before to generate a Document-Term Matrix (the inverse of a Term-Document Matrix). WE ensure that we remove any empty documents from the corpus (which may have been created by earlier text cleaning) and then we use the package topicmodels to perform the LDA to determine the 8 groups of terms and the frequencies of occurence of each group of terms. We then use tidytext, dplyr, ggplot and plotly to generate the results graphically in order to allow interpretation.
# enter value of ntop (required number of topics - consider using LDA tuning functions to help)
ntop <- 8 # <--- desired number of topics
# Generate Document Term Matrix
DTM <- tm::DocumentTermMatrix(corpus)
# remove empty rows from DTM and corpus
rowTotals <- apply(DTM, 1, sum)
empty.rows <- DTM[rowTotals == 0, ]$dimnames[1][[1]]
corpus <- corpus[-as.numeric(empty.rows)]
DTM <- tm::DocumentTermMatrix(corpus)
# tune LDA process, setting a seed so that the output of the model is predictable
corpus_lda <- topicmodels::LDA(DTM, k = ntop, control = list(seed = 1234))
# tidy words in topics
corpus_topics <- tidytext::tidy(corpus_lda, matrix = "beta")
# generare top ten word lists
topic_top_terms <- corpus_topics %>%
dplyr::group_by(topic) %>%
dplyr::top_n(10, beta) %>%
dplyr::ungroup() %>%
dplyr::arrange(topic, -beta)
# get proportion of comments by most likely topic
prop <- topicmodels::get_topics(corpus_lda)
prop_topics <- table(prop) %>%
prop.table()
# write top terms and prop comments plots
p5 <- topic_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
p6 <- plotly::plot_ly(as.data.frame(prop_topics), x = ~prop, y = ~Freq, type = 'bar') %>%
plotly::layout(xaxis = list(title = "Topic ID"), yaxis = list(title = "Proportion of posts"))
# display charts
p5
p6
Time and date analysis is self-explanatory, except to say that times are in UTC/GMT. In the case of this timeline, I was resident in Australia for the majority of the period, so times could be adjusted for this by adding around 10 hours. This would render a very normal daily activity trend, with posting mostly in the mornings and early evenings.
Wordcloud analysis reveals the most important/meaningful words in posts and is self-explanatory. child1 and child2 are placeholders for my real children’s names, protected for their privacy.
The topics produced by Topic Modelling can be interpreted by looking at the individual words found commonly in each topic grouping. Examples include Happy Birthday postings (likely by others onto my timeline), children and family related posts, television related posts, including the Mad Men TV show, football related posts (including my support of Everton FC) and some posting about Australian political news (abbott = Tony Abbott, Prime Minister of Australia 2013-15). The top three topics by frequency appear to related to TV (topic 8), birthdays (topic 1) and family (topic 2).