Using text analytics to extract and mine Facebook timeline data

Keith McNulty

31 May 2018

Abstract

Text analytics methods are powerful tools for the extraction of meaningful information from text string data. In this document we illustrate how methods including regexp-based string extraction, TF-IDF and Topic Modelling can draw insight from text data using a Facebook timeline example. The methods illustrated here are transferrable to any form of text data, for example open text survey responses or comments.

Context

Masses of Human Resource data exists in text format. Methodologies are now available which allow us to extract meaningful information from this text data. We demonstrate these methodologies using an example of a Facebook timeline htm file. In particular, we do the following types of analyses:

Use regular expression (regexp) grammar to extract meaningful text data from strings of code.
Perform time series analysis on date and time data extracted using regexp methods.
Perform frequency analysis on words in timeline posts to identify the most important or most meaningful words used over a specified time period.
Perform topic modelling on timeline posts to identify and interpret common topics of posting.

Setup and data loading

Here we load the necessary R libraries for conducting the analysis, and we load the base data file timeline.htm which contains all posts made during on the Facebook profile. This file can be found by downloading Facebook data, extracting the zip and looking in the html directory.

# install libraries if needed

list.of.packages <- c("stringr", 
                      "tm", 
                      "RColorBrewer", 
                      "topicmodels", 
                      "tidytext",
                      "ggplot2", 
                      "dplyr", 
                      "ldatuning", 
                      "openxlsx", 
                      "webshot",
                      "slam",
                      "wordcloud",
                      "plotly")

new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if (length(new.packages)) {
  install.packages(new.packages, repos = "http://cran.us.r-project.org")
}



# load libraries

library(stringr)
library(tm)
library(RColorBrewer)
library(topicmodels)
library(tidytext)
library(ggplot2)
library(dplyr)
library(ldatuning)
library(openxlsx)
library(webshot)
library(plotly)

# set directories

homedir <- getwd()
setwd("C:/R/facebook_analytics") # <----- change to directory where facebook data file was unzipped

if(!dir.exists("results")) {
  dir.create("results")
}

# read in HTML code

doc.html <- readLines("html/timeline.htm", encoding = "UTF-8")
setwd("./results")

Time series analysis

Here we use regexp grammar to identify the dates and content of individual posts. This is done using the regexp grammar (.+?) between specific strings of text, which indicates that we desire any strings whic fall between the indicated strings.

We then separate the date data into hours, months and years. We perform simple dplyr group aggregation and use the results to graphically display Facebook activity over time using ggplot2 and plotly.

# unlist and create single character string from loaded file

doc.html.unlisted <- unlist(doc.html)
doc.html.combined <- paste(doc.html.unlisted, collapse = " ")

# create list of posts and dates using regexp

post_list <- stringr::str_match_all(doc.html.combined, '<div class=\"meta\">(.+?)</div><div class=\"comment\">(.+?)</div>')


#strip out dates and hours of activity

datetime_vec <- unlist(post_list[[1]][,2])

datetime_vecdate <- sub(' at.*', '', datetime_vec)

date_vec <- as.POSIXct(datetime_vecdate, format = "%A, %d %B %Y", tz="GMT")

year_vec <- format(date_vec, "%Y") %>% as.factor()

time_vec <- sub('.*at ', '', datetime_vec)

time_vec <- as.POSIXct(time_vec, format = "%H", tz="GMT")

hour_vec <- format(time_vec, "%H") %>% as.factor()

datemon_vec <- format(date_vec, "%B %Y") %>%  as.factor()

date_df <- data.frame(year_vec, hour_vec, datemon_vec)

# analyze by year, dom, hour

year_totals <- date_df %>% dplyr::group_by(year_vec) %>% dplyr::summarize(total = n())
hour_totals <- date_df %>% dplyr::group_by(hour_vec) %>% dplyr::summarize(total = n()) 
date_totals <- date_df %>% dplyr::group_by(datemon_vec) %>% dplyr::summarize(total = n()) 

# create monthly activity scatter

date_totals$datemon_vec <- paste("01", date_totals$datemon_vec)
date_totals$datemon_vec <- as.Date(date_totals$datemon_vec, format = "%d %B %Y")
p <- ggplot2::qplot(data = date_totals, datemon_vec, total, xlab = "Year", ylab = "Posts")
p1 <- p + ggplot2::geom_smooth(method = "loess", size = 1.5) + ggplot2::ggtitle("Activity timeline")


# create activity by year bar chart

p2 <- plotly::plot_ly(year_totals, x = ~year_vec, y = ~total, type = 'bar') %>% 
  plotly::layout(title = "Posts by year", xaxis = list(title = "Year"), yaxis = list(title = "Posts")) %>% 
  plotly::add_lines(y = ~fitted(loess(total ~ as.numeric(year_vec))), line = list(color = 'red'),
                    name = "Loess Smoother", showlegend = TRUE)


# create hour of day bar chart

p3 <- plotly::plot_ly(hour_totals, x = ~hour_vec, y = ~total, type = 'bar', name = 'Posts') %>% 
  plotly::layout(title = "Posts by hour of day", xaxis = list(title = "Hour"), yaxis = list(title = "Posts")) %>%  
  plotly::add_lines(y = ~fitted(loess(total ~ as.numeric(hour_vec))), line = list(color = 'red'),
                    name = "Loess Smoother", showlegend = TRUE)

# display charts

p1

p2

p3

Corpus creation and cleaning

Here we use the tm package to form a corpus of documents, which each document being a timeline post. We then use the native functions of tm to remove stopwords, punctuation, white space, numbers and convert all text to lower case. We also remove sensitive personal content such as the names of family and friends using hidden code.

# create vector of posts stripped of html wrappers

post_vec <- unlist(post_list[[1]][,3])


# adjust encoding issues

post_vec <- gsub("&#039;", "'", post_vec)

# create corpus

corpus <- (tm::VectorSource(post_vec))
corpus <- tm::Corpus(corpus)

# sort out encoding

corpus<- tm::tm_map(corpus, function(x) iconv(enc2utf8(x), sub = "byte"))

# remove punctuation

corpus <- tm::tm_map(corpus, content_transformer(removePunctuation))

# conver to lower case and remove stopwords

corpus <- tm::tm_map(corpus, content_transformer(tolower))
corpus <- tm::tm_map(corpus, content_transformer(removeWords), 
                     stopwords("english"))

# strip whitespace

corpus <- tm::tm_map(corpus, stripWhitespace) 

# remove numbers

corpus <- tm::tm_map(corpus, content_transformer(removeNumbers))

Term frequency analysis

Here we use native functions in the tm package to create a Term-Document Matrix (TDM), which calculates how many times a given word appears in a given document. The TDM is weighted using a Term Frequency-Inverse Document Frequency formula, which is designed to give greater weight to words that appear more frequently in a particular document than they do in the overall corpus of documents. This is a standard measure to determine the importance of a word in the meaning of the document. We then rank the words according to their average TF-IDF weighting and then use a wordcloud to illustrate the 100 words of greatest importance in the corpus.

# created weighted tf-idf term document matrix

TDM_tfidf <-tm::TermDocumentMatrix(corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))

# rank terms by average tfidf weighting

m <- as.matrix(TDM_tfidf)
v <- sort(rowMeans(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq = v)

# generate tf_idf wordcloud

cp <- brewer.pal(7,"YlOrRd")
wordcloud::wordcloud(d$word, d$freq, max.words = 100, random.order = FALSE, colors = cp)

Perform Topic Modelling

Topic Modelling is a mathematical technique that uses Latent Dirichlet Allocation to determine which groups of terms appear most frequently together, thus giving a strong indication that they may all be associated with a similar topic or subject.

Topic Modelling requires instruction on the number of topics to generate from the corpus of documents. A technique known as LDA Tuning can be used to determine the number of topics that will give the clearest, most interpretable set of topics. We will not cover LDA tuning in this document, but it is recommended to follow this process to avoid generating too many or too few topics for good interpretability.

LDA tuning on this corpus of Facebook posts reveals the optimal number of topics to be around 8-10. Therefore we proceed to generate 8 topics from the corpus. We use the packaged tm as before to generate a Document-Term Matrix (the inverse of a Term-Document Matrix). WE ensure that we remove any empty documents from the corpus (which may have been created by earlier text cleaning) and then we use the package topicmodels to perform the LDA to determine the 8 groups of terms and the frequencies of occurence of each group of terms. We then use tidytext, dplyr, ggplot and plotly to generate the results graphically in order to allow interpretation.

# enter value of ntop (required number of topics - consider using LDA tuning functions to help)

ntop <- 8  # <--- desired number of topics

# Generate Document Term Matrix

DTM <- tm::DocumentTermMatrix(corpus)

# remove empty rows from DTM and corpus

rowTotals <- apply(DTM, 1, sum) 

empty.rows <- DTM[rowTotals == 0, ]$dimnames[1][[1]]
corpus <- corpus[-as.numeric(empty.rows)]

DTM <- tm::DocumentTermMatrix(corpus)


# tune LDA process, setting a seed so that the output of the model is predictable

corpus_lda <- topicmodels::LDA(DTM, k = ntop, control = list(seed = 1234))

# tidy words in topics

corpus_topics <- tidytext::tidy(corpus_lda, matrix = "beta")


# generare top ten word lists

topic_top_terms <- corpus_topics %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(10, beta) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(topic, -beta)


# get proportion of comments by most likely topic

prop <- topicmodels::get_topics(corpus_lda)

prop_topics <- table(prop) %>% 
  prop.table() 


# write top terms and prop comments plots

p5 <- topic_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

p6 <- plotly::plot_ly(as.data.frame(prop_topics), x = ~prop, y = ~Freq, type = 'bar') %>% 
  plotly::layout(xaxis = list(title = "Topic ID"), yaxis = list(title = "Proportion of posts"))

# display charts

p5

p6

Conclusions

Time and date analysis is self-explanatory, except to say that times are in UTC/GMT. In the case of this timeline, I was resident in Australia for the majority of the period, so times could be adjusted for this by adding around 10 hours. This would render a very normal daily activity trend, with posting mostly in the mornings and early evenings.

Wordcloud analysis reveals the most important/meaningful words in posts and is self-explanatory. child1 and child2 are placeholders for my real children’s names, protected for their privacy.

The topics produced by Topic Modelling can be interpreted by looking at the individual words found commonly in each topic grouping. Examples include Happy Birthday postings (likely by others onto my timeline), children and family related posts, television related posts, including the Mad Men TV show, football related posts (including my support of Everton FC) and some posting about Australian political news (abbott = Tony Abbott, Prime Minister of Australia 2013-15). The top three topics by frequency appear to related to TV (topic 8), birthdays (topic 1) and family (topic 2).