Ever get the feeling that one of your friends is a bit of a loose cannon in the group chat? Or that another friend is selectively non-responsive to you? These are testable hypotheses! In the following blog post, I show an example examination of group dynamics in an iMessage group chat using my family as a test case.
It’s 2020 and it seems that everyone has, at minimum, 1-2 group chats that punctuate their day with perpetual dings, chirps, buzzes – or whatever your text notification tone may be. For me, it is my family. Each member of the chat brings their own characteristic profile of emotion, response time, engagement, and writing style. I downloaded the group chat (sqlite database located at “~/Library/Messages/chat.db”) from May 2019 when I bought this laptop to January 2020 and examined each participant’s chat profile using the extensive plotting capabilities of R and accompanying packages.
library(knitr)
library(here)
library(ngram)
library(zoo)
library(stopwords)
library(patchwork)
library(scales)
library(wordcloud2)
library(textdata)
library(tidytext)
library(tidyverse)
theme_set(theme_minimal())
# Please excuse the wonky file path as I did not want to copy these data into my project folder :)
df_raw <- read_csv("../../../../../fun_projects/imessage/data/proc/fam_collapsed_011620.csv")
## Parsed with column specification:
## cols(
## index = col_double(),
## text = col_character(),
## handle_id = col_double(),
## date = col_date(format = ""),
## message_date = col_double(),
## timestamp = col_datetime(format = ""),
## month = col_double(),
## year = col_double(),
## is_sent = col_double(),
## message_id = col_double(),
## phone_number = col_character(),
## chat_id = col_double()
## )
# latency between messages in seconds to consider a message a new conversation, rather than a response
NEW_CONVO_LATENCY <- (2*60*60)
# stop words to exclude in text analysis
STOP_WORDS <- c(as.character(unlist(data_stopwords_smart)), "yeah", "he's", "he'll")
df <- df_raw %>%
mutate(date = as.Date(timestamp)) %>%
mutate(person = case_when(
phone_number == "+14087727115" ~ "Ashish",
phone_number == "+14088398996" ~ "Papa",
phone_number == "+14088393222" ~ "Mom",
phone_number == "+14088395957" ~ "Asha",
phone_number == "+14082039710" ~ "Anjali",
phone_number == "yathaarth.bhansali@autodesk.com" ~ "ram"
)) %>%
mutate(wc = map_dbl(text, wordcount)) %>%
filter(person != "ram") %>% # excluding Ram because he was only in the conversation momentarily
arrange(timestamp) %>%
mutate(time_diff_lag = difftime(timestamp, lag(timestamp))) %>%
mutate(time_diff_lead = difftime(lead(timestamp), timestamp)) %>%
mutate(response_to = lag(person)) %>%
mutate(response_to = ifelse(time_diff_lag > NEW_CONVO_LATENCY, "Start convo", response_to)) %>%
filter(!is.na(response_to)) %>%
mutate(msg_c_id = 1e5:(1e5+n()-1)) %>%
mutate(text = str_replace_all(text, "’", "'"))
wc_1 <- df %>%
group_by(person) %>%
summarize(wc = sum(wc)) %>%
ggplot(aes(x = fct_reorder(as.factor(person), wc, desc) , y = wc, fill = person)) +
geom_bar(stat="identity") +
labs(y = "Total word count", x = "") +
theme(legend.position = "none")
wc_2 <- df %>%
group_by(date, person) %>%
summarize(n = n(),
wc = sum(wc)) %>%
ungroup %>%
ggplot(aes(x = date, y = wc, color = person)) +
geom_point(alpha = .5) +
geom_smooth(se=F, size = .5, method="loess") +
labs(x = "", y= "Word count each day", color = "") +
scale_x_date(breaks = pretty_breaks(10))
wc_3 <- df %>%
group_by(date, person) %>%
summarize(n = n(),
wc = sum(wc)) %>%
ungroup %>%
ggplot(aes(x = date, y = wc, color = person)) +
# geom_point(alpha = .5) +
geom_smooth(se=F, size = .5, method="loess") +
labs(x = "", y= "", color = "") +
scale_x_date(breaks = pretty_breaks(10)) +
theme(legend.position = "none")
wc_1 / (wc_2 + wc_3) + plot_layout(guides = 'collect')
In this figure, we see the the number of words each participant is contributing to the chat in total and on a daily basis. (Please note the whole dataset used in this project is missing data in July and August when I was out of the country and my laptop was off.)
In these plots, we see a few notable features. First, we see that my younger sister, Asha, has much more important things to do during the day than sitting around on her phone. Next, we see an increase in communications from my Mom as she was sharing news from Boston on the condition of my aunt in her final days of life. Finally, we see another jump in communications from my Mom when the holidays come around and she is planning get-togethers, presents, and other family happenings.
senti_df <- df %>%
unnest_tokens(word, text) %>%
filter(!word %in% STOP_WORDS) %>%
inner_join(get_sentiments("nrc"), by="word") %>%
inner_join(get_sentiments("afinn"), by="word")
senti_df_day <- senti_df %>%
group_by(person, date) %>%
summarize(value = mean(value)) %>%
group_by(person) %>%
mutate(value_diff = c(NA, diff(value))) %>%
ungroup %>% ungroup
s_1 <- senti_df %>%
# filter(date < as.Date("2019-06-01") | date > as.Date("2019-07-01")) %>% # excluding dates around death of Mom's sister
ggplot(aes(x = person, y = value)) +
stat_summary(fun.data = "mean_cl_boot") +
labs(y = "Mean sentiment")
s_2 <- senti_df_day %>%
ggplot(aes(x = date, y = value, color = person)) +
geom_point(alpha = .5) +
geom_smooth(se=F, size = .5, method="loess") +
labs(y = "Sentiment", x = "", color = "") +
scale_x_date(breaks = pretty_breaks(10))
s_1 + s_2
This plot displays the positivity and negativity of the words used in texts from each member of the family. In this analysis, we use a simple bag-of-words approach with the NRC Emotion Lexicon dictionary to measure sentiment.
Again, we see the move towards negatively valenced speech during the time of my aunt’s illness and again during the holidays. Following the holiday rush, we see an unprecedented increase in positive valence from my mother and general increase in positive valence from most other members of the family.
word_freq_df <- df %>%
unnest_tokens(word, text) %>%
filter(!word %in% STOP_WORDS) %>%
mutate(word = str_replace(word, "'s", "")) %>%
mutate(word = str_replace(word, "[0-9]*", "")) %>%
select(word) %>%
table %>%
as.data.frame
wordcloud2(word_freq_df, minSize = 3)
No self-respecting text analysis would be complete without the inclusion of a word cloud. Very unsurprisingly, this plot reveals the popularity of my newborn nephew “Brady” (whose birth in March just preceded the beginning of data collection).
nsent <- df %>%
group_by(response_to) %>%
summarize(responded_to_n = n()) %>%
ungroup %>%
mutate(total_n = sum(responded_to_n))
# numerator: proportion of senders texts to a respondee (sent_to_n/sender_total_n)
# denominator: proportion of respondees texts to full data (respondee_total_n/total_n)
response_df <- df %>%
filter(response_to != person) %>%
group_by(person, response_to) %>%
summarize(sent_to_n = n()) %>%
group_by(person) %>%
mutate(sender_total_n = sum(sent_to_n)) %>%
mutate(numerator = sent_to_n/sender_total_n) %>%
ungroup %>%
group_by(response_to) %>%
mutate(respondee_total_n = sum(sent_to_n)) %>%
ungroup %>%
mutate(total_n = sum(respondee_total_n)) %>%
mutate(numerator = sent_to_n/sender_total_n) %>%
mutate(denominator = respondee_total_n/total_n) %>%
mutate(ratio = (numerator/denominator)/5)
response_df %>%
ggplot(aes(x = person, y = ratio, fill = response_to)) +
geom_bar(stat="identity", position = position_dodge()) +
labs(x = "", y = "Ratio", fill = "Response to:")
In this plot, I examine what proportion of a sender’s text messages are in response to another member of the group chat. This is determined by observing who sent the previous message prior to the sender entering their own. However, if it has been more than two hours since anyone sent a message, the sender is labeled as “starting a conversation”. These values are then scaled by how often the respondee sends a message. Thus, values close to 1 indicate that a sender responds to a respondee proportional to how often the respondee sends a message.
# this plot doesn't represent all responses because if there are no valenced words from the dictionary in a text it is dropped
senti_resp_df <- senti_df %>%
right_join(select(df, msg_c_id, response_to))
## Joining, by = c("response_to", "msg_c_id")
senti_resp_df %>%
filter(!is.na(person)) %>%
ggplot(aes(x = person, y = value, fill = response_to, color = response_to)) +
geom_jitter(alpha = .3, size = .4, position = position_jitter(w = .3)) +
stat_summary(fun.data = "mean_cl_boot", position = position_dodge(w = .4)) +
labs(x = "", y = "Sentiment", color = "Response to:", fill = "Response to:")
## Warning: Removed 1 rows containing missing values (geom_pointrange).
This plot shows the average sentiment of a sender’s response to a given person.
r_1 <- df %>%
filter(time_diff_lead < NEW_CONVO_LATENCY) %>%
ggplot(aes(x = person, y = as.numeric(time_diff_lead))) +
stat_summary(fun.data = "mean_cl_boot", size = .4, color = "brown") +
# geom_point(position = position_jitter(w = .2), alpha = .2) +
labs(x = "", y = "Seconds until receiving a response")
r_2 <- df %>%
filter(time_diff_lead < NEW_CONVO_LATENCY) %>%
ggplot(aes(x = person, y = as.numeric(time_diff_lead))) +
stat_summary(fun.data = "mean_cl_boot", size = .4, color = "brown") +
geom_point(position = position_jitter(w = .2), alpha = .2) +
labs(x = "", y = "")
r_1 + r_2
In this plot, I examined which members of the family get faster or slower responses from others. This is measured by the time between a sender sends a text message and when another person responds. However, if the time exceeds two hours it is not considered a response, but rather, a new conversation. It appears that my father may need to wait marginally longer than other member’s of the family to receive a response.
In this post I have demonstrated a few various idea to examine the dynamics of your group chats. These are just a small subset of fun hypotheses to look at. For deeper analysis, one may look at language style matching and topics models to gain more insight into the relations between speakers and the content of chats. Enjoy!
Information about this R session including which version of R was used, and what packages were loaded.
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.3
## [5] readr_1.3.1 tidyr_1.0.0 tibble_2.1.3 ggplot2_3.2.1
## [9] tidyverse_1.3.0 tidytext_0.2.2 textdata_0.3.0 wordcloud2_0.2.2
## [13] scales_1.0.0 patchwork_1.0.0 stopwords_1.0 zoo_1.8-5
## [17] ngram_3.0.4 here_0.1 knitr_1.26
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-137 fs_1.3.1 lubridate_1.7.4
## [4] RColorBrewer_1.1-2 httr_1.4.1 rprojroot_1.3-2
## [7] SnowballC_0.6.0 tools_3.5.3 backports_1.1.5
## [10] R6_2.4.1 rpart_4.1-13 Hmisc_4.3-0
## [13] DBI_1.1.0 lazyeval_0.2.2 colorspace_1.4-1
## [16] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5
## [19] gridExtra_2.3 compiler_3.5.3 cli_1.1.0
## [22] rvest_0.3.5 htmlTable_1.13.1 xml2_1.2.2
## [25] labeling_0.3 checkmate_1.9.1 rappdirs_0.3.1
## [28] digest_0.6.23 foreign_0.8-71 rmarkdown_1.16
## [31] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.4.0
## [34] dbplyr_1.4.2 htmlwidgets_1.5.1 rlang_0.4.2
## [37] readxl_1.3.1 rstudioapi_0.10 generics_0.0.2
## [40] jsonlite_1.6 acepack_1.4.1 tokenizers_0.2.1
## [43] magrittr_1.5 Formula_1.2-3 Matrix_1.2-15
## [46] Rcpp_1.0.3 munsell_0.5.0 lifecycle_0.1.0
## [49] stringi_1.4.3 yaml_2.2.0 grid_3.5.3
## [52] crayon_1.3.4 lattice_0.20-38 haven_2.2.0
## [55] splines_3.5.3 hms_0.5.2 zeallot_0.1.0
## [58] pillar_1.4.2 reprex_0.3.0 glue_1.3.1
## [61] evaluate_0.14 latticeExtra_0.6-28 data.table_1.12.6
## [64] modelr_0.1.5 vctrs_0.2.1 cellranger_1.1.0
## [67] gtable_0.3.0 assertthat_0.2.1 xfun_0.10
## [70] broom_0.5.2 janeaustenr_0.1.5 survival_2.43-3
## [73] cluster_2.0.7-1 ellipsis_0.3.0