Intro

For my digital shadow project, I looked into the volume, and the language used between myself and my girlfriend, and myself and my best friend. I’ll take you through my analysis step by step, but first, let’s grab some libraries, and some data. Let’s assume that we’ve already gotten our facebook messages in csv format and saved it in our directory as “fbData.csv”.

library(dplyr)
library(knitr)
library(tm)
library(ggplot2)
library(wordcloud)
library(scales)
library(SnowballC)
library(reshape2)

setwd("E:/DataViz2015")
messages = read.csv("fbData.csv")
kable(messages[25:35,], row.names = FALSE)
date fname tname message dateCut
10/14/2014 15:34 Lauren Ecker Coby Sonnenschein Durling I don’t have wifi now 10/1/2014
10/14/2014 14:47 Lauren Ecker Coby Sonnenschein Durling Oh 10/1/2014
10/14/2014 14:46 Lauren Ecker Coby Sonnenschein Durling Well can’t you tell them you need more then that 10/1/2014
10/14/2014 14:44 Lauren Ecker Coby Sonnenschein Durling I’m gonna loose wifi soon 10/1/2014
10/14/2014 14:43 Lauren Ecker Coby Sonnenschein Durling Well you’ll have to pay for gas too 10/1/2014
10/14/2014 14:42 Lauren Ecker Coby Sonnenschein Durling Why 10/1/2014
10/14/2014 14:41 Lauren Ecker Coby Sonnenschein Durling Got it 10/1/2014
10/14/2014 14:41 Lauren Ecker Coby Sonnenschein Durling Ohhhhhh! 10/1/2014
10/14/2014 14:34 Lauren Ecker Coby Sonnenschein Durling How far was it 10/1/2014
10/14/2014 14:30 Lauren Ecker Coby Sonnenschein Durling Good! How did you get there 10/1/2014
10/14/2014 14:04 Lauren Ecker Coby Sonnenschein Durling did you like them? 10/1/2014

Who do I message?

My hypothesis is that I message My friend Luke, and my girlfriend Lauren far more than any of my other contacts. Let’s take a look at my message frequency with all of my contacts to confirm that that’s the case.

# Coby will always be the recipient or the sender, so we can combine to and from correspondence like this:
messages$name = ifelse(grepl("Coby", as.character(messages$fname)), as.character(messages$tname), as.character(messages$fname))
messagesGrouped = group_by(messages, name)
messagesGrouped = summarise(messagesGrouped, count = length(name))
messagesGrouped = filter(messagesGrouped, count > 100)
ggplot(data = messagesGrouped, aes(x = name, y = count)) + 
  geom_bar(stat = "identity", fill = "steelblue") + 
  theme(axis.text.x  = element_text(angle=90, hjust = 1, vjust = .4))

Seems like I was right. If you notice that Luke has 2 entries here, because he had a middle name on Facebook, at a time, then you have a good eye! I’m not going to worry about that, as his name has been without the middle name for some time. Later, we’ll filter the data so that we only see more recent messages.

Timeframe

My message volume skyrocketed once I got Facebook Messengger on my phone. It has become one of my main forms of communication. Let’s take a look at my message volume over time.

messagesGroupedDate = group_by(messages, dateCut)
messagesGroupedDate = summarise(messagesGroupedDate, count = length(dateCut))
messagesGroupedDate$dateCut = as.Date(as.character(messagesGroupedDate$dateCut), "%m/%d/%Y")
ggplot(data = messagesGroupedDate, aes(x = dateCut, y = count)) + 
  geom_area(fill = "steelblue") + 
  theme(axis.text.x  = element_text(angle=90)) +
  scale_x_date(labels = date_format("%m/%y"), breaks = date_breaks("2 months"))

Looks like my message volume really took off around June 2014 or so.

Trimming the fat

Now let’s take just messages from either Lauren or Luke since June 2014. This will give us a bit of a cleaner dataset.

messages$dateCut = as.Date(as.character(messages$dateCut), "%m/%d/%Y")
messages = filter(messages, name %in% c("Lauren Ecker", "Lucas Padovani") & dateCut > as.Date("2014-06-01","%Y-%m-%d"))

table(messages$name)
## 
##   Lauren Ecker Lucas Padovani 
##          24985          16970
table(messages$dateCut)
## 
## 2014-07-01 2014-08-01 2014-09-01 2014-10-01 2014-11-01 2014-12-01 
##        174        958       3352       3293       2882       1959 
## 2015-01-01 2015-02-01 2015-03-01 2015-04-01 2015-05-01 2015-06-01 
##       3392       2929       4801       5449       4285       3373 
## 2015-07-01 2015-08-01 2015-09-01 
##       1847       1338       1923
messagesGrouped = group_by(messages, name, dateCut)
messagesGrouped = summarise(messagesGrouped, count = length(message))

Visualizing volume over time

Now that we’ve got nice clean data, let’s see what our volume looks like over time.

ggplot(data = messagesGrouped, aes(y = count, x = dateCut, color = name)) + 
  geom_line(size = 2) +
  annotate("rect", xmin = as.Date("2014-09-01"), xmax = as.Date("2014-12-01"), 
            ymin = -Inf, ymax = Inf, alpha = 0.2, fill = "red") +
  
  annotate("text", x = as.Date("2014-10-15"), y = 3500, label = "Lauren studied abroad", size = 3) +
  
  annotate("rect", xmin = as.Date("2015-06-01"), xmax = as.Date("2015-08-01"), 
            ymin = -Inf, ymax = Inf, alpha = 0.2, fill = "red") +
  
  annotate("text", x = as.Date("2015-07-15"), y = 3000, label = "Lauren and I lived together", size = 3)

It seems that my volume of conversation with Lauren is more consistent than it is with Luke. It seems that during Lauren’s time abroad, we spoke over Facebook more, and when we lived together this summer, we spoke over Facebook less. My Volume with Luke had a very large spike in the spring of 2015. We’ll look into this later.

Content Analysis

We’ll need to monkey around with the data a little bit to do some content analysis. First, let’s create a document for each connection, and each month. i.e. Lauren->Coby December 2014. We’ll consider the collection of these documents our corpus.

# To get example.com/something to not be garbage, let's split on "."
messages$message = gsub(".", " ", messages$message, fixed = TRUE)


messagesCorpus = group_by(messages, fname, tname, dateCut)
messagesCorpus = summarise(messagesCorpus, doc = paste(message, collapse = " "))

setwd("E:/DataViz2015/Documents")
for ( i in 1:length(messagesCorpus$doc)) {
  write(messagesCorpus$doc[i], paste(substr(messagesCorpus$fname[i], 0,5), substr(messagesCorpus$tname[i], 0, 5),
                                                           messagesCorpus$dateCut[i], ".txt", sep = ""))
}

We’ve now created a directory with all of our documents. Let’s load it back up as a corpus.

cname <- file.path("E:/DataViz2015", "Documents")
docs <- Corpus(DirSource(cname))

## Preprocessing      
docs <- tm_map(docs, removePunctuation)   # *Removing punctuation:*    
docs <- tm_map(docs, removeNumbers)      # *Removing numbers:*    
docs <- tm_map(docs, tolower)   # *Converting to lowercase:*    
docs <- tm_map(docs, removeWords, c(stopwords("english"), "okay", "well", "just", "like", "dont", "good", "yeah", "thats", "get", "want", "can", "youre", "didnt"))   # *Removing "stopwords" 
docs <- tm_map(docs, stemDocument)   # *Removing common word endings* (e.g., "ing", "es")   
docs <- tm_map(docs, stripWhitespace)   # *Stripping whitespace   
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)
dtms <- removeSparseTerms(dtm, 0.8)
rownames(dtms) = list.files(cname)

## Inverse Document Frequency Calculation

idf = function(x) {
  idfVec = vector(mode = "numeric")
  for (i in 1:length(x[,1])) {
    docCount = 0
    for (j in 1:length(x[1,])) {
      docCount = docCount + ifelse(x[i,j] != 0, 1, 0)
    }
    idfVec = c(idfVec, log(ncol(x)/docCount))
  }
  return(idfVec)
}

tfMatrix = as.data.frame(t(as.matrix(dtms)))
tfIdfMatrix = as.data.frame(tfMatrix*idf(tfMatrix))

# See what we did?
tfIdfMatrix[1:20, 1:2]
##            Coby Laure2014-07-01.txt Coby Laure2014-08-01.txt
## ability                           0                0.0000000
## able                              0                0.0000000
## absolutely                        0                0.0000000
## accept                            0                0.0000000
## account                           0                1.0986123
## across                            0                0.0000000
## act                               0                0.0000000
## action                            0                0.0000000
## actual                            0                0.0000000
## actually                          0                0.1625189
## add                               0                0.0000000
## afford                            0                2.0996442
## afternoon                         0                0.0000000
## age                               0                0.0000000
## ago                               0                0.0000000
## agree                             0                0.0000000
## agreed                            0                0.0000000
## ahaha                             0                0.0000000
## ahead                             0                0.0000000
## air                               0                0.0000000

Phew. Now that we have our text processed, let’s look at the results!

Let’s reshape the data and see which words are important. How about some wordclouds based on term frequency?

tfIdfMatrix$word = rownames(tfIdfMatrix)
melted = melt(tfIdfMatrix)
melted$fname = ifelse(substr(melted$variable, 1, 2) == "La", "Lauren", 
                      ifelse(substr(melted$variable, 1, 2) == "Lu", "Luke", "Coby"))
melted$tname = ifelse(melted$fname == "Coby", ifelse(grepl("Laur", melted$variable), "Lauren", "Luke"), "Coby")
melted$name = melted$name = ifelse(grepl("Coby", as.character(melted$fname)), as.character(melted$tname), as.character(melted$fname))
melted$dateCut = substr(as.character(melted$variable), 1, nchar(as.character(melted$variable)) - 4)
melted$dateCut = substr(melted$dateCut, nchar(melted$dateCut) - 9, nchar(melted$dateCut))
                        
melted = melted[order(-melted$value),]
head(melted[,c("word", "dateCut", "name")], 10)
##          word    dateCut   name
## 37203    bean 2014-10-01 Lauren
## 27561    lmao 2015-04-01   Luke
## 3045    rokan 2014-09-01 Lauren
## 38328    bean 2014-11-01 Lauren
## 2328     bean 2014-09-01 Lauren
## 13170   rokan 2015-06-01 Lauren
## 37472 florida 2014-10-01 Lauren
## 3453     bean 2014-10-01 Lauren
## 4578     bean 2014-11-01 Lauren
## 45078    bean 2015-05-01 Lauren

Hmm.. well clearly “rokan” and “bean” are key words here. But Let’s take those out for a second to see if there’s anything else.

melted2 = subset(melted, !melted$word %in% c("rokan", "bean"))
head(melted2[,c("word", "dateCut", "name")], 10)
##           word    dateCut   name
## 27561     lmao 2015-04-01   Luke
## 37472  florida 2014-10-01 Lauren
## 28686     lmao 2015-05-01   Luke
## 30029   reddit 2015-06-01   Luke
## 38597  florida 2014-11-01 Lauren
## 27779   reddit 2015-04-01   Luke
## 36394    gonna 2014-09-01 Lauren
## 27462    httpi 2015-04-01   Luke
## 42842 birthday 2015-03-01 Lauren
## 27095    black 2015-04-01   Luke

That’s better. Let’s get a quick visual of these words.

ggplot(data = group_by(melted2, word) %>% summarise(val = sum(value)) %>% filter(val > 85), aes(x = word, y = val)) + 
  geom_bar(stat = "identity", fill = "steelblue") +
   theme(axis.text.x  = element_text(angle=90, hjust = 1, vjust = .4))

Let’s get a better look at these words over time. Remember the April 2015 spike with Luke? Let’s see if we can learn some more about that.

melted3 = subset(melted, melted$word %in% c("florida", "black", "cops", "love", "bean", "youtube", "reddit", "beanest", "media", "lmao", "police", "sad"))
melted3$nameword = paste(melted3$name, melted3$word, sep = "-")
melted3grouped = group_by(melted3, nameword)
melted3grouped = summarize(melted3grouped, maxVal = max(value))
melted3grouped = subset(melted3grouped, melted3grouped$maxVal > 5)
melted3 = inner_join(melted3, melted3grouped)
melted3$dateCut = as.Date(melted3$dateCut)
head(melted3grouped)
## Source: local data frame [6 x 2]
## 
##         nameword    maxVal
##            (chr)     (dbl)
## 1    Lauren-bean 58.684784
## 2 Lauren-beanest 12.543720
## 3 Lauren-florida 35.407634
## 4    Lauren-love  7.586881
## 5     Lauren-sad  8.724585
## 6     Luke-black 21.339921
# We need to create a "name-word" field to have something unique to plot.  The last line simply removes the combinations that don't matter i.e. "Luke-bean".  
ggplot(data = melted3, aes(x = dateCut, y = value, fill = nameword)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ nameword) +
  theme(axis.text.x  = element_text(angle=90, hjust = 1, vjust = .4), legend.position = "none") +
  scale_x_date(labels = date_format("%m/%y"), breaks = date_breaks("2 months"))

It looks like the spike in Luke’s volume is potentially related to the theme of race and police violence that our country has been experiencing. April 2015 coincides with the death of Freddie Gray, and the Baltimore riots that followed. Words like youtube, reddit, and lmao also spiked during that time, but this is likely due to an increase in overall communication. What’s interesting is that our use of lmao seems to have begun during that time, and become a permanent part of our vocabulary.

On a more superficial note, it’s interesting to see the trend for Florida between Lauren and myself. We went in December 2014. My girlfriend and I seem to have a downward trend of “sad”, which is always nice to see. And as far as nicknames go, “bean” is declining, while “beanest” gains popularity.

Conclusion

Hopefully you’ve enjoyed analyzing my personal relationships with me. This has been a very interesting project for me, and I’ve gained a lot of experience working with text data, as well as extracting the data from html to csv in python, which was not covered here. I hope that this has been relatively easy to follow, educational, and demonstrative of basic data visualization priciples.