For my digital shadow project, I looked into the volume, and the language used between myself and my girlfriend, and myself and my best friend. I’ll take you through my analysis step by step, but first, let’s grab some libraries, and some data. Let’s assume that we’ve already gotten our facebook messages in csv format and saved it in our directory as “fbData.csv”.
library(dplyr)
library(knitr)
library(tm)
library(ggplot2)
library(wordcloud)
library(scales)
library(SnowballC)
library(reshape2)
setwd("E:/DataViz2015")
messages = read.csv("fbData.csv")
kable(messages[25:35,], row.names = FALSE)
| date | fname | tname | message | dateCut |
|---|---|---|---|---|
| 10/14/2014 15:34 | Lauren Ecker | Coby Sonnenschein Durling | I don’t have wifi now | 10/1/2014 |
| 10/14/2014 14:47 | Lauren Ecker | Coby Sonnenschein Durling | Oh | 10/1/2014 |
| 10/14/2014 14:46 | Lauren Ecker | Coby Sonnenschein Durling | Well can’t you tell them you need more then that | 10/1/2014 |
| 10/14/2014 14:44 | Lauren Ecker | Coby Sonnenschein Durling | I’m gonna loose wifi soon | 10/1/2014 |
| 10/14/2014 14:43 | Lauren Ecker | Coby Sonnenschein Durling | Well you’ll have to pay for gas too | 10/1/2014 |
| 10/14/2014 14:42 | Lauren Ecker | Coby Sonnenschein Durling | Why | 10/1/2014 |
| 10/14/2014 14:41 | Lauren Ecker | Coby Sonnenschein Durling | Got it | 10/1/2014 |
| 10/14/2014 14:41 | Lauren Ecker | Coby Sonnenschein Durling | Ohhhhhh! | 10/1/2014 |
| 10/14/2014 14:34 | Lauren Ecker | Coby Sonnenschein Durling | How far was it | 10/1/2014 |
| 10/14/2014 14:30 | Lauren Ecker | Coby Sonnenschein Durling | Good! How did you get there | 10/1/2014 |
| 10/14/2014 14:04 | Lauren Ecker | Coby Sonnenschein Durling | did you like them? | 10/1/2014 |
My hypothesis is that I message My friend Luke, and my girlfriend Lauren far more than any of my other contacts. Let’s take a look at my message frequency with all of my contacts to confirm that that’s the case.
# Coby will always be the recipient or the sender, so we can combine to and from correspondence like this:
messages$name = ifelse(grepl("Coby", as.character(messages$fname)), as.character(messages$tname), as.character(messages$fname))
messagesGrouped = group_by(messages, name)
messagesGrouped = summarise(messagesGrouped, count = length(name))
messagesGrouped = filter(messagesGrouped, count > 100)
ggplot(data = messagesGrouped, aes(x = name, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust = .4))
Seems like I was right. If you notice that Luke has 2 entries here, because he had a middle name on Facebook, at a time, then you have a good eye! I’m not going to worry about that, as his name has been without the middle name for some time. Later, we’ll filter the data so that we only see more recent messages.
My message volume skyrocketed once I got Facebook Messengger on my phone. It has become one of my main forms of communication. Let’s take a look at my message volume over time.
messagesGroupedDate = group_by(messages, dateCut)
messagesGroupedDate = summarise(messagesGroupedDate, count = length(dateCut))
messagesGroupedDate$dateCut = as.Date(as.character(messagesGroupedDate$dateCut), "%m/%d/%Y")
ggplot(data = messagesGroupedDate, aes(x = dateCut, y = count)) +
geom_area(fill = "steelblue") +
theme(axis.text.x = element_text(angle=90)) +
scale_x_date(labels = date_format("%m/%y"), breaks = date_breaks("2 months"))
Looks like my message volume really took off around June 2014 or so.
Now let’s take just messages from either Lauren or Luke since June 2014. This will give us a bit of a cleaner dataset.
messages$dateCut = as.Date(as.character(messages$dateCut), "%m/%d/%Y")
messages = filter(messages, name %in% c("Lauren Ecker", "Lucas Padovani") & dateCut > as.Date("2014-06-01","%Y-%m-%d"))
table(messages$name)
##
## Lauren Ecker Lucas Padovani
## 24985 16970
table(messages$dateCut)
##
## 2014-07-01 2014-08-01 2014-09-01 2014-10-01 2014-11-01 2014-12-01
## 174 958 3352 3293 2882 1959
## 2015-01-01 2015-02-01 2015-03-01 2015-04-01 2015-05-01 2015-06-01
## 3392 2929 4801 5449 4285 3373
## 2015-07-01 2015-08-01 2015-09-01
## 1847 1338 1923
messagesGrouped = group_by(messages, name, dateCut)
messagesGrouped = summarise(messagesGrouped, count = length(message))
Now that we’ve got nice clean data, let’s see what our volume looks like over time.
ggplot(data = messagesGrouped, aes(y = count, x = dateCut, color = name)) +
geom_line(size = 2) +
annotate("rect", xmin = as.Date("2014-09-01"), xmax = as.Date("2014-12-01"),
ymin = -Inf, ymax = Inf, alpha = 0.2, fill = "red") +
annotate("text", x = as.Date("2014-10-15"), y = 3500, label = "Lauren studied abroad", size = 3) +
annotate("rect", xmin = as.Date("2015-06-01"), xmax = as.Date("2015-08-01"),
ymin = -Inf, ymax = Inf, alpha = 0.2, fill = "red") +
annotate("text", x = as.Date("2015-07-15"), y = 3000, label = "Lauren and I lived together", size = 3)
It seems that my volume of conversation with Lauren is more consistent than it is with Luke. It seems that during Lauren’s time abroad, we spoke over Facebook more, and when we lived together this summer, we spoke over Facebook less. My Volume with Luke had a very large spike in the spring of 2015. We’ll look into this later.
We’ll need to monkey around with the data a little bit to do some content analysis. First, let’s create a document for each connection, and each month. i.e. Lauren->Coby December 2014. We’ll consider the collection of these documents our corpus.
# To get example.com/something to not be garbage, let's split on "."
messages$message = gsub(".", " ", messages$message, fixed = TRUE)
messagesCorpus = group_by(messages, fname, tname, dateCut)
messagesCorpus = summarise(messagesCorpus, doc = paste(message, collapse = " "))
setwd("E:/DataViz2015/Documents")
for ( i in 1:length(messagesCorpus$doc)) {
write(messagesCorpus$doc[i], paste(substr(messagesCorpus$fname[i], 0,5), substr(messagesCorpus$tname[i], 0, 5),
messagesCorpus$dateCut[i], ".txt", sep = ""))
}
We’ve now created a directory with all of our documents. Let’s load it back up as a corpus.
cname <- file.path("E:/DataViz2015", "Documents")
docs <- Corpus(DirSource(cname))
## Preprocessing
docs <- tm_map(docs, removePunctuation) # *Removing punctuation:*
docs <- tm_map(docs, removeNumbers) # *Removing numbers:*
docs <- tm_map(docs, tolower) # *Converting to lowercase:*
docs <- tm_map(docs, removeWords, c(stopwords("english"), "okay", "well", "just", "like", "dont", "good", "yeah", "thats", "get", "want", "can", "youre", "didnt")) # *Removing "stopwords"
docs <- tm_map(docs, stemDocument) # *Removing common word endings* (e.g., "ing", "es")
docs <- tm_map(docs, stripWhitespace) # *Stripping whitespace
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
dtms <- removeSparseTerms(dtm, 0.8)
rownames(dtms) = list.files(cname)
## Inverse Document Frequency Calculation
idf = function(x) {
idfVec = vector(mode = "numeric")
for (i in 1:length(x[,1])) {
docCount = 0
for (j in 1:length(x[1,])) {
docCount = docCount + ifelse(x[i,j] != 0, 1, 0)
}
idfVec = c(idfVec, log(ncol(x)/docCount))
}
return(idfVec)
}
tfMatrix = as.data.frame(t(as.matrix(dtms)))
tfIdfMatrix = as.data.frame(tfMatrix*idf(tfMatrix))
# See what we did?
tfIdfMatrix[1:20, 1:2]
## Coby Laure2014-07-01.txt Coby Laure2014-08-01.txt
## ability 0 0.0000000
## able 0 0.0000000
## absolutely 0 0.0000000
## accept 0 0.0000000
## account 0 1.0986123
## across 0 0.0000000
## act 0 0.0000000
## action 0 0.0000000
## actual 0 0.0000000
## actually 0 0.1625189
## add 0 0.0000000
## afford 0 2.0996442
## afternoon 0 0.0000000
## age 0 0.0000000
## ago 0 0.0000000
## agree 0 0.0000000
## agreed 0 0.0000000
## ahaha 0 0.0000000
## ahead 0 0.0000000
## air 0 0.0000000
Phew. Now that we have our text processed, let’s look at the results!
Let’s reshape the data and see which words are important. How about some wordclouds based on term frequency?
tfIdfMatrix$word = rownames(tfIdfMatrix)
melted = melt(tfIdfMatrix)
melted$fname = ifelse(substr(melted$variable, 1, 2) == "La", "Lauren",
ifelse(substr(melted$variable, 1, 2) == "Lu", "Luke", "Coby"))
melted$tname = ifelse(melted$fname == "Coby", ifelse(grepl("Laur", melted$variable), "Lauren", "Luke"), "Coby")
melted$name = melted$name = ifelse(grepl("Coby", as.character(melted$fname)), as.character(melted$tname), as.character(melted$fname))
melted$dateCut = substr(as.character(melted$variable), 1, nchar(as.character(melted$variable)) - 4)
melted$dateCut = substr(melted$dateCut, nchar(melted$dateCut) - 9, nchar(melted$dateCut))
melted = melted[order(-melted$value),]
head(melted[,c("word", "dateCut", "name")], 10)
## word dateCut name
## 37203 bean 2014-10-01 Lauren
## 27561 lmao 2015-04-01 Luke
## 3045 rokan 2014-09-01 Lauren
## 38328 bean 2014-11-01 Lauren
## 2328 bean 2014-09-01 Lauren
## 13170 rokan 2015-06-01 Lauren
## 37472 florida 2014-10-01 Lauren
## 3453 bean 2014-10-01 Lauren
## 4578 bean 2014-11-01 Lauren
## 45078 bean 2015-05-01 Lauren
Hmm.. well clearly “rokan” and “bean” are key words here. But Let’s take those out for a second to see if there’s anything else.
melted2 = subset(melted, !melted$word %in% c("rokan", "bean"))
head(melted2[,c("word", "dateCut", "name")], 10)
## word dateCut name
## 27561 lmao 2015-04-01 Luke
## 37472 florida 2014-10-01 Lauren
## 28686 lmao 2015-05-01 Luke
## 30029 reddit 2015-06-01 Luke
## 38597 florida 2014-11-01 Lauren
## 27779 reddit 2015-04-01 Luke
## 36394 gonna 2014-09-01 Lauren
## 27462 httpi 2015-04-01 Luke
## 42842 birthday 2015-03-01 Lauren
## 27095 black 2015-04-01 Luke
That’s better. Let’s get a quick visual of these words.
ggplot(data = group_by(melted2, word) %>% summarise(val = sum(value)) %>% filter(val > 85), aes(x = word, y = val)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust = .4))
Let’s get a better look at these words over time. Remember the April 2015 spike with Luke? Let’s see if we can learn some more about that.
melted3 = subset(melted, melted$word %in% c("florida", "black", "cops", "love", "bean", "youtube", "reddit", "beanest", "media", "lmao", "police", "sad"))
melted3$nameword = paste(melted3$name, melted3$word, sep = "-")
melted3grouped = group_by(melted3, nameword)
melted3grouped = summarize(melted3grouped, maxVal = max(value))
melted3grouped = subset(melted3grouped, melted3grouped$maxVal > 5)
melted3 = inner_join(melted3, melted3grouped)
melted3$dateCut = as.Date(melted3$dateCut)
head(melted3grouped)
## Source: local data frame [6 x 2]
##
## nameword maxVal
## (chr) (dbl)
## 1 Lauren-bean 58.684784
## 2 Lauren-beanest 12.543720
## 3 Lauren-florida 35.407634
## 4 Lauren-love 7.586881
## 5 Lauren-sad 8.724585
## 6 Luke-black 21.339921
# We need to create a "name-word" field to have something unique to plot. The last line simply removes the combinations that don't matter i.e. "Luke-bean".
ggplot(data = melted3, aes(x = dateCut, y = value, fill = nameword)) +
geom_bar(stat = "identity") +
facet_wrap(~ nameword) +
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust = .4), legend.position = "none") +
scale_x_date(labels = date_format("%m/%y"), breaks = date_breaks("2 months"))
It looks like the spike in Luke’s volume is potentially related to the theme of race and police violence that our country has been experiencing. April 2015 coincides with the death of Freddie Gray, and the Baltimore riots that followed. Words like youtube, reddit, and lmao also spiked during that time, but this is likely due to an increase in overall communication. What’s interesting is that our use of lmao seems to have begun during that time, and become a permanent part of our vocabulary.
On a more superficial note, it’s interesting to see the trend for Florida between Lauren and myself. We went in December 2014. My girlfriend and I seem to have a downward trend of “sad”, which is always nice to see. And as far as nicknames go, “bean” is declining, while “beanest” gains popularity.
Hopefully you’ve enjoyed analyzing my personal relationships with me. This has been a very interesting project for me, and I’ve gained a lot of experience working with text data, as well as extracting the data from html to csv in python, which was not covered here. I hope that this has been relatively easy to follow, educational, and demonstrative of basic data visualization priciples.