Adapted from: http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

Whatsapp is one of the world’s most popular messaging apps and billions of messages are exchanged on a daily basis. I was curious as to what treasures my archives hold and decided to dig a little deeper. In this post I will demonstrate how you can conduct simple text analysis on any Whatsapp chat.

Firstly, we can export all messages in a text document by going to “Group Info” for group chats or “Contact Info” for private chats. Media files will be excluded and I will perform the analysis for one of my private chats. Let’s read in the data file first, convert it into a corpus and inspect a few lines.

library("tm")
text <- readLines("D:\\wiley\\yh.txt")
docs <- Corpus(VectorSource(text))
length(docs)
## [1] 8386
inspect(docs[10:15])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6
## 
## [1] [8/2/17, 10:25:38 PM] YH: Yep from NIE! Do you want to cab tgt? :D                                    
## [2] [8/2/17, 10:35:10 PM] GHA: cab? woah                                                                  
## [3] [8/2/17, 10:35:16 PM] GHA: got time right?                                                            
## [4] [8/2/17, 10:35:27 PM] GHA: what time do you end?                                                      
## [5] [8/2/17, 10:35:37 PM] GHA: i end at 1230 but i think i can ask my tutor to let me off earlier         
## [6] [8/2/17, 11:49:27 PM] YH: Haha okay then! I don't have classes that day but I am still staying in hall

1. Word cloud analysis

We observe that there are 8386 entries in docs. Each entry corresponds to a message that was sent by either party. Each message starts with a date-time stamp followed by the name of the sender and then the content. For word cloud analysis, we are only interested in the contents of the messages, thus we would like to aggregate all contents across all messages into one text.

Let’s begin by performing some cleaning on the text to prepare it for analysis. We make use of the function tm_map to apply various inbuilt functions to the text. Where we require finer control over the function, we can create a content_transfomer function to remove specific characters in the text.

toSpace <- content_transformer(function (y , pattern) gsub(pattern, "", y))

toSpace is a content_transformer function which itself nests two more functions. Essentially, y is the text where the function is to be applied and pattern represents the specific characters to look out for. When applied, toSpace will find all instances of the pattern and replaces it with an empty string "", which is R’s way of removing text.

We start off by converting the text to lower casing, such that “Please” and “please” counts toward the same word. Otherwise, R interprets them as separate words. We also want to remove numbers for this analysis.

docs = tm_map(docs, tolower)
docs = tm_map(docs, removeNumbers)
inspect(docs[10:15])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6
## 
## [1] [//, :: pm] yh: yep from nie! do you want to cab tgt? :d                                    
## [2] [//, :: pm] gha: cab? woah                                                                  
## [3] [//, :: pm] gha: got time right?                                                            
## [4] [//, :: pm] gha: what time do you end?                                                      
## [5] [//, :: pm] gha: i end at  but i think i can ask my tutor to let me off earlier             
## [6] [//, :: pm] yh: haha okay then! i don't have classes that day but i am still staying in hall

We then proceed to remove the timestamp, the sender and any special characters in the text. We do this by using toSpace.

docs = tm_map(docs, toSpace, "\\[//, :: \\w{2}] ([a-z ]+|[^0-9A-Za-z///']+): ")
docs = tm_map(docs, toSpace, "[^0-9A-Za-z///' ]+")
inspect(docs[10:15])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6
## 
## [1] yep from nie do you want to cab tgt d                                      
## [2] cab woah                                                                   
## [3] got time right                                                             
## [4] what time do you end                                                       
## [5] i end at  but i think i can ask my tutor to let me off earlier             
## [6] haha okay then i don't have classes that day but i am still staying in hall

The first transformation takes care of the timestamp and the sender’s name. The pattern that toSpace looks out for is given in the third argument of tm_map, which is a regular expression. It looks a little more complicated as it should be as I created a generic one that can be applied to all my analyses. This is because I saved contacts in different ways which may require a more general regular expression to identify. In this case, since there are only two senders - “yh” and “gha”, I can actually hard code it to recognise both. The second transformation removes any special characters that may be caused by the use of non-english characters or emoticons.

docs = tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removeWords, c("image", "omitted", "can", "haha", "hahaha", "will", "just", "lol", "one"))
docs <- tm_map(docs, stemDocument)
inspect(docs[10:15])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 6
## 
## [1] yep nie want cab tgt d          cab woah                       
## [3] got time right                  time end                       
## [5] end think ask tutor let earlier okay class day still stay hall

In the next step, extra white spaces are removed such that Hi there becomes Hi there. Common english words such as “i”, “me” and “you” are also removed as these are usually not meaningful. We can go on to remove more specific words using the same function and providing a vector of words as the third argument. The words “image” and “omitted” stems from the sending of images which Whatsapp omits when exporting the chat. Last but not least, we use stemDocument to get to a word’s root such that “computational”, “computation” and “compute” becomes the same root word “comput”. This is useful as we are interested in the frequency which a certain idea, such as “compute”, is expressed and less concerned over the form which that idea is in expressed in.

The next step is to build a term-document matrix which is a table listing the frequency of each word. By default, TermDocumentMatrix only counts words which are at least three alphabets long, but we can adjust this to include two alphabets as well using the control argument. Let us plot a barchart of the top 30 most frequent words.

library("ggplot2")
dtm <- TermDocumentMatrix(docs, control = list(wordLengths=c(2, Inf)))
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
ggplot(d[1:30,])+
  geom_col(aes(reorder(d$word[1:30],freq), d$freq[1:30]), fill="darkblue")+
  xlab("Words")+
  ylab("Frequency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))+coord_flip()+theme(axis.text.x = element_text(angle = 0))

We can now generate the wordcloud using its library.

library("wordcloud")
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=150, random.order=FALSE, rot.per=0.2, 
          colors=brewer.pal(8, "Dark2"))

We can see that some strange words such as “chang” and “cours” appear due to the effect of word stemming, as well as the use of colloquial language. Adjusting the minimum length of word to 4 yields the following results:

dtm <- TermDocumentMatrix(docs, control = list(wordLengths=c(4, Inf)))
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.2, 
          colors=brewer.pal(8, "Dark2"))

2. Distribution analysis

We can probe further to find out the messaging pattern over time. For example, which day has the most number of messages exchanged? Who sent more messages? Let’s start again by re-reading in the text file.

library("stringr")
text <- readLines("D:\\wiley\\yh.txt")
date=c()
time=c()
for (i in 1:8386) {
  date[i] = str_extract(text[i], "\\d+/\\d+/\\d+")
  time[i] = str_extract(text[i], "\\d+:\\d+:\\d+ \\w{2}")
}
length(date)
## [1] 8386
length(time)
## [1] 8386

First, from each line in the text file, we extract the date and time stamps and place them in separate vectors. The function str_extract only grabs the first match in each line, thus there will not be any duplicates and there will be as many entries in the vector as there are lines in the text file.

library(tidyverse)
df = tibble(date=seq(as.Date("2017-02-01"), as.Date("2019-11-03"), by = "day"),freq=NA)

Next, we initially a dataframe with each row representing one day, from Feb 1 to Mar 11. Since there are 1006 days in this period, there are 1006 rows in the dataframe. Our goal here is to count the number of messages sent for every day and populate the freq column. We then initialise a second dataframe containing the extracted dates and times from the text file.

df2 = tibble(date=date, time=time)
df2$date = parse_date(df2$date, "%d/%m/%y")
df2$time = parse_time(df2$time)
df2 = na.omit(df2)

We can now count the number of messages each day by grouping the second dataframe by date.

df2 = df2 %>%
  group_by(date) %>%
  summarise(freq=n()) 
df2[1:10,]

If we immediately try to visualise the distribution, we obtain a rather messy graph. This is because there are some days with no messages exchanged. Such a graph would give the impression that messages were exchanged everyday.

df2 %>%
  ggplot()+
  geom_line(aes(date, freq))

The solution is to use the first dataframe and populate the freq column, where 0 would be assigned to dates with no messages. We can do a left join between the two dataframes. Doing such a join results in a redundant second column which should be dropped.

df_final = left_join(df, df2, by="date")
df_final = df_final[,c(1,3)]
df_final[is.na(df_final$freq.y),"freq.y"] = 0

df_final %>%
  arrange(desc(freq.y)) %>%
  head(10)

From the graph, we see that the most messages were exchanged on 18 Aug 2018, 19 July 2019 and 9 Oct 2019.

ggplot(df_final) +
  geom_line(aes(date,freq.y)) + ggtitle("Daily Number of Whatsapp Messages")

Now let’s find out who sent more messages. For simplicity, we will redo the analysis from the beginning; however we can ignore the text cleaning step as we are not interested in the contents.

The first few steps are to remove the datetime stamp, convert the text to lower casing, read each entry in text into a dataframe and finally extract name of the sender.

text <- readLines("D:\\wiley\\yh.txt")
text = gsub("\\[\\d+/\\d+/\\d+, \\d+:\\d+:\\d+ \\w{2}\\] ", "", text)
text = tolower(text)
df = data.frame(name=text)
df2 = df %>%
  mutate(name2 = str_extract(name, "[a-z0-9 ]+:"))
df2 = na.omit(df2)
df2 = df2 %>%
  select(name2) 

df2 %>%
  distinct(name2)

While we have successfully extracted the sender’s name, we have also inadvertently extracted some other contents due to the regular expression. This happens because the message contents could contain similar structures to the regular expression. Let’s remove the irrelevant contents and do a frequency count for each sender. Finally we do a simple visualisation to compare the difference.

df2 %>%
  filter(name2 == "yh:" | name2 == "gha:") %>%
  count(name2) %>%
  arrange(desc(n)) %>%
  ggplot(aes(reorder(name2,n), n))+
    geom_col(fill = "dodgerblue")+
    xlab("Name")+
    ylab("Number of Messages") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))+
    theme(axis.text.x = element_text(angle = 0))+
    geom_text(aes(label=n), nudge_y = 100)

Well, seems like my friend sent more messages than I did!

df2 %>% 
  filter(name2 == "yh:" | name2 == "gha:") %>%
  count(name2) %>%
  arrange(desc(n)) %>%
  mutate (pct = round(100 * n / sum(n)))