Introduction

This project is an exploration of text processing using methods from term frequency to unsupervised clustering. The data for this project come from MIT’s Persuasion for Good dataset (Wang et al. 2019). I am loading a number of packages here, marked each for the different sections they’re used in. Some are essential [tidyverse] while others are nice to have [skimr]. Either way, I’ve split them into three distinct sections: tidying the text (we’ll talk about what that means), analyzing text data, and unsupervised machine learning (clustering).

Text data are being generated constantly - reviews, surveys and questionnaires, text messages even! Reading every word and deriving meaning from it is very possible, a technique called coding. But what about when instead of five speakers there are 200? Or each person speaks for an hour instead of thirty minutes? How can we scale up our analysis of human text and speech using programming?

# tidying
library(tidyverse)
library(readxl)
library(broom)
library(dplyr)
# text analysis
library(stopwords)
library(tidytext)
library(skimr)
library(tidylo)
# unsupervised learning
library(furrr)
library(future)
library(stm)

There are a few definitions we’ll be using throughout. They will be defined as they come up, but if you just eed to see a masterlist, I’ve provided one here.

Data Import & Tidying

The data are from the MIT gitlab page. I pulled the raw data directly from the repository, just full_d_raw and full_i_raw, which are the full dialogue and information tables about each participant (respectively).

# full_d for the full dialog, not annotated
full_d_raw = read_csv('data/FullData/full_dialog.csv',
         col_names = TRUE) %>% 
  select(
    ConvoId = B2,
    Turn = Turn,
    Role = B4, # (0 = persuader, 1 = persuadee)
    Dialogue = Unit
    )

full_i_raw = read_csv('data/FullData/full_info.csv',
         col_names = TRUE) %>% 
  rename_with(~str_remove(.,"\\.x$")) %>% 
  select(
    ConvoId = B2,
    Role = B4, # (0 = persuader, 1 = persuadee)
    User = B3,
    NumTurn = B7,
    Donation = B6,
    everything()
    )
# anonymous fn ? regular expression (python/perl/real programming)

Reading the data in, there were some interesting formatting choices. In full_i_raw, columns about personality traits and personal characteristics were marked as ‘extrovert.x income.x conscientious.x religion.x’ et cetera and there. This dataset contains some important variables we need to understand. First, in full_d_raw, we have ConvoId, which is the unique identifier of each conversation held between two participants; There is Role, a boolean (0/1) value that indicates whether a participant is the persuader (0) or the persuadee (1). Role alternates every row, which contains the Dialogue, everything that someone said (typed) during their speaking turn.

In full_i_raw, we need to know about all of these variables, and we will look into them as they come up. What’s important to know is that full_i_raw also has ConvoId and Role, two variables contained in the full_d_raw and full_i_raw dataset. These are our joining variables for the next step.

# join on ConvoId, Dialogue
fullset = full_d_raw %>% left_join(full_i_raw)

# looking at individual sentences/instances to look at how standard the data are
fullset %>% slice_sample(n = 5) %>% pull(Dialogue)
## [1] "That is what I like to hear!"                                                                         
## [2] "neither.  it's a charity that provides resources for the children.  would you like to donate $1 today"
## [3] "You're welcome. I'm glad to be a part of a noble cause! Take care!"                                   
## [4] "Oh that is great. Also, I forgot to ask, do you like basketball or football?"                         
## [5] "Your welcome and thank you for your help"

Joining the two datasets together is an important step for simplifying the analysis to come! Peeking into the text container Dialogue, we see that it is comprised of full exchanges, anywhere from a symbol to a sentence to multiple sentences. There were four datasets in the Persuasion For Good data. These are the annotated datasets, aka the analysis that the original researcher did appended to the datasets we just examined.

# ann_d for the full dialog, with annotations
ann_d_raw = read_xlsx('data/AnnotatedData/300_dialog.xlsx',
         col_names = TRUE) %>%
  select(
    ConvoId = B2,
    Role = B4, # (0 means persuader, 1 means persuadee)
    Dialogue = Unit,
    everything()
  )

# ann_i for the corresponding index file
ann_i_raw = read_xlsx('data/AnnotatedData/300_info.xlsx',
         col_names = TRUE) %>%
  select(
    ConvoId = B2,
    Role = B4, # (0 means persuader, 1 means persuadee)
    User = B3,
    Donation = B6,
    IntDonation = B5,
    NumTurn = B7,
    everything()
  )

annset = ann_d_raw %>% left_join(ann_i_raw) 

So what’s new and influential in the annotated data? The most relevant addition to our data is IntDonation, the intended or stated donation that the participant verbally committed to during the conversation. We also join this data the same way we did to create fullset.

Tidy Text

Here is where the tidying happens! So tidy text data is one observation per row, and in this text project, that means breaking up the text container Dialogue into one word per row. I used tidytext::unnest_tokens() to automatically parse the text into single rows. This is a simple, fast way to parse the text.

# What can we do with tidyr? Here are some transformations from chapter 1 of Silge's Text Mining With R to make token per row or 'tidy' data. 

fulltidy = fullset %>% 
    # assigning labels to the levels
    mutate(
    # levels of religion
    religion = factor(religion, levels = c("Catholic","Protestant","Atheist","Other religion"), 
                      labels = c("Catholic","Protestant","Atheist","Other religions")), 
    # levels of sex
    sex = factor(sex, levels = c("Male","Female","Other"),
                 labels = c("Male","Female","Other Genders")), 
    # levels of political ideology
    ideology = factor(ideology, levels =  c("Conservative","Moderate","Liberal"), 
                      labels = c("Conservative","Moderate","Liberal"), ordered = FALSE), 
    # levels of Role
    Role = factor(Role, levels = c("0","1"),labels = c("Persuader","Persuadee")), # adding labels!!
    # levels of education
    edu = factor(edu, levels = c("Less than four-year college","Four-year college","Postgraduate"), 
                 labels = c("Less than four-year college","College","Postgraduate"))
  ) %>% 

  # tidying the Dialogue by word
  unnest_tokens(Word,Dialogue) %>% 
  # pulling out irrelevant words
  anti_join(get_stopwords(),by = c("Word" = "word")) 

stopwords::get_stopwords() is an important part of this process. It identifies words that do not usually contain meaningful information so that the tens of thousands of common little words won’t interfere with our analysis. Nesting the function within anti_join() removes these observations from the dataset.

As you can see, there is a significant mutate(), containing factor() functions. These functions are assigning labels to the different levels of variables we’ll be examining. As you see, those variables are religion, sex, education, and political ideology.

So what does this all look like?

Exploratory Data Analysis

In this section, we will visualize all of the variables of interest, to determine what to expect further down the line. Let’s start with our newly parsed words.

# most popular words
fulltidy %>% count(Word, sort = TRUE) %>% 
  filter(n > 1000) %>% 
  # reorder words by count
  ggplot(aes(fct_reorder(Word, n),n)) +
  geom_col() +
  coord_flip() +
  labs(x = NULL, title = "Most Common Words in Dialogue")

Using ‘tidy’ data, we can see which words were used the most during the all the conversations. The words ‘children’,‘save’,and ‘donate’ were used the most out of any words, most likely because the charity is called ‘Save the Children’ and the participants are tasked with convincing people to donate. Still interesting! What about the conversations?

# utterances per conversation
fullset %>% 
  count(ConvoId) %>% 
  ggplot(aes(n)) +
  geom_histogram(binwidth = 5) +
  labs(title = 'How long is each conversation?',x = '# of turns',y = '# of conversations',subtitle = 'On average, their conversations are 20 exchanges long.')

These conversations are about 20 exchanges long of average, but we can see that most of them were at least that long, with a few of them going up to 30! That’s a while to talk to a stranger. Who’s saying what?

fulltidy %>% count(Word,Role, sort = TRUE) %>% 
  # filter(n > 500) %>% 
  slice_max(n,n = 25) %>% 
  ggplot(aes(Word,n, fill = as.factor(Role))) +
  geom_col() +
  coord_flip() +
  labs(title = "Top Words by Role", x = NULL, fill = NULL,subtitle = "Only the persuaders said payment (and most everything else too)")

Out of the top words that anyone said, persuaders said every word, and they said every word a lot. The only word that persuadees said more than persuaders is ‘yes’. One of the major facets we examine is the participant’s role in the conversation as it seems to have an impact on what words each participant uses.

####Bar Charts (n) By Role####

fulltidy %>% 
  count(Word,Role,sort = TRUE) %>% 
  slice_max(n,n = 20) %>% 
  ggplot(aes(Word,n)) +
  geom_col(aes(fill = n),show.legend = FALSE) +
#  geom_text(aes(label = Word), check_overlap = TRUE, vjust = 1.5) +
  facet_wrap(~Role, scales = "free") +
  labs(x = NULL, y = 'count',title = "Most Common Words by Role", subtitle = "Persuaders talked waaaay more often...about the same things as the persuadees") +
  coord_flip()

Role also seems to have an impact of how much each participant says. Clearly, the persuaders are driving the conversation. This chart shows the most common words by Role. One would think, ‘Ah this is it! We know what the most common words by Role are!’ Well, not exactly. Alas, this is only the beginning of text analysis. This chart is still important though!

Text analysis is based around frequency, how often the words occur. And there is something meaningful to learn from each of the different methods of exploring frequency we go over in this analysis. Later, I’ll explain some better ways to examine frequency.

Donations Aside

So how well did participants do at persuading people?

# plot intended donations
annset %>% 
  filter(IntDonation < 200,
         Role == 1) %>% 
  ggplot() +
  geom_boxplot(aes(IntDonation)) +
  labs(title = "Show Me the Money", x = "stated donation amount (in USD)",subtitle = "Most people failed to get their partner to verbally commit to a donation amount")

Not a lot of people planned to donate! Our persuaders may not have been very persuasive. There were a few outliers not represented on this graph, but first let’s see how the money acually turned out.

# difference of intended donation and donation by frequency
annset %>% 
  # difference between IntDonation and Donation
  filter(IntDonation < 75,
         Role == 1) %>%
  ggplot(aes(Donation-IntDonation)) +
  geom_histogram(bins = 15) +
  labs(title = "But even out of those who did commit...",y = "# of participants", x = "difference in stated donation v actual donation amount (in USD)", subtitle = "Most either gave less or didn't actually give at all")

Most people actually donated how much they pledged to donate (which was nothing). As expected, some donated less, some donated more. Two big talkers promised to donate $10,000 and $500 respectively, but nothing came from it. Perhaps they weren’t so persuasive after all…

So who was doing the donating and persuading anyways?

Traits and Values

Let’s get to know our participants! Here we have plots of their different values (the big five) and traits (education, sex, etc). These variables form the basis of our grouping variables later on, so it’s a good idea to start here. This is also where I always start when playing with a dataset. While the dependent variable is always interesting, it’s important to explore all the variables you have at the beginning.

# the big five [extrovert, agreeable, conscientious, neurotic, open]
full_i_raw %>% 
  select(User,extrovert:open) %>% 
  pivot_longer(extrovert:open,names_to = "trait", values_to = "score") %>% 
  ggplot(aes(score, fill = trait)) +
  facet_wrap(~trait) +
  geom_histogram(bins = 5, show.legend = FALSE) +
  labs(title = 'The Big Five', y = '# of participants',subtitle = 'Participants\'s personalities tended to be highly agreeable, conscientious and open')

Something cool about this dataset is that each participant took three personality tests before the discussion. The big five, a major psychometric personality test, is the one we focus on in this analysis, but a further analysis may evaluate the sample using the Schwartz portrait, moral foundations, and decision making model. The big five model here shows us the different sorts of people we have and their values.

Now we’ll look at our sample and their different traits. I made plots of each variable, simply visualizing the sample to understand what the best characteristics would be to group by. Rather than examining the entire population as a group, we can breakout traits to see if they are related to which words participants used. To figure our the best traits to examine however, we look at distributions and the relationships between the different variables.

# one (or more) trait plots with religion, sex, ideology, role, education
# sex and role
fullset %>%
  count(sex,Role) %>% 
  replace_na(list(sex = 'Other')) %>% 
  mutate(Role = factor(Role, labels = c("Persuader","Persuadee"))) %>% 
  ggplot(aes(sex,n, fill = Role)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Role) +
  labs(title = "Who's talking?", subtitle = "Role was split up evenly across gender identity",y = '# of participants',x = "gender")

Role is arguably the most important grouping we have in this sample, because it tells us about the structure of the conversation. This grouping was the most obvious to analyze, because one could reasonably assume that the person being convinced and the person doing the convincing may speak differently. The gender of the participant seems to be evenly split up across male and female, and there are too few other genders participating to examine. What about ideology? Religion?

# religion/ideology
full_i_raw %>%
  count(religion,ideology) %>% 
  mutate(ideology = factor(ideology, c("Conservative","Moderate","Liberal"))) %>% 
  filter(!is.na(ideology),!is.na(religion)) %>% 
  ggplot(aes(ideology,n,fill = religion)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~religion) +
  labs(title = "Who's Talking?",x = "political ideology",y = "# of participants", subtitle = "Christians tend to be more conservative and Atheists tend more liberal")

Higher proportions of atheists are liberal, as well as a large part of other religions. This tells us about the group makeups of political ideology, where the conservative label will tend to have more Christians and liberals will be mostly other religions and atheists. What about education?

# dodge barchart ideology/religion
full_i_raw %>% 
  count(edu,ideology) %>% 
  # reoder education levels
  mutate(edu = factor(edu, c("Less than four-year college","Four-year college","Postgraduate")),
         ideology = factor(ideology, c("Conservative","Moderate","Liberal"))) %>% 
  filter(!is.na(edu), !is.na(ideology)) %>% 
  ggplot(aes(ideology,n,fill = ideology)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~edu) +
  labs(title = "Who's Talking?", x = 'political ideology', y = '# of participants', subtitle = "Ideological distributions hold across each level of education")

Most of our participants don’t have a college degree, and for this sample, it seems education is not very related to political ideology. There are pretty similar distributions of political ideology for each education level. Either way, our sample has more liberals than conservatives and moderates.

There are a number of traits that are not visualized here: income level, race, marital status. These variables were also interesting, but each also had other issues that made them a bad fit for this analysis. The variables we looked at here will be used as grouping variables throughout the rest of our analysis.

Back to the words!

Rank and Frequency

Text cannot be added or subtracted. Text is analyzed by frequency, and there are a surprisingly large number of ways to analyze frequency. In this project, we look at an absolute relative document frequency and a Bayesian relative document frequency. But first, simple frequency!

### preparation for Zipf's law and tf-idf
# create two group by, one by conversation and one by total words in the sample
# tidy the Dialogue to one obs per row, count by ConvoId
words1 = fullset %>% 
  select(ConvoId,
         Dialogue) %>% 
  unnest_tokens(Word,Dialogue) %>% 
  count(ConvoId,Word,sort = TRUE, name = "Count")

# count the words per conversation
words2 = words1 %>% 
  group_by(ConvoId) %>% 
  summarize(ctotal = sum(Count))

wordcount <- left_join(words1, words2)

To look at frequency, we have to look at examine how often something happens. To use the count() function, we have to examine how often something happens by category. In the first chunk creating words1,words2, and wordcount, we group by ConvoId, the unique identifier for each individual conversation held between two participants.

# tidy the Dialogue to one obs per row, count by total words in sample
wordcount1 = fullset %>%
  select(ConvoId,
         Dialogue) %>%
  unnest_tokens(Word,Dialogue) %>%
  count(Word,sort = TRUE, name = "Count")

And we have one grouped by word itself [this is simple word frequency].

What do the distributions of the words look like compared to their conversation?

wordcount %>% 
  ggplot(aes(Count/ctotal, fill = 16)) +
  geom_histogram(bins = 15, show.legend = FALSE) +
  labs(title = 'Word Frequency per Conversation', x = "proportion of word usage", y = '# of conversations',subtitle = 'Many words are used infrequently and a few words are used very frequently') 

This was where I felt like I started learning about my data. There are a lot of words used very few times in many conversations.

wordcount1 %>% 
  ggplot(aes(Count, fill = 13)) +
  geom_histogram(bins = 20, show.legend = FALSE) +
  scale_x_log10() +
  labs(title = 'Word Frequency', x = '# of occurences per word', y = '# of words',subtitle = 'Many words are said a few times, some words are said a lot')

So there are a very high number of words that are used only one time. Some of these words are typos. There isn’t a really good way to filter those words out, but since they’re rather infrequent (i.e. wuold is probably seen only once), they won’t disrupt the analysis too much. We can look into this phenomenon even more using Zipf’s law.

Zipf’s Law

The phenomenon we just saw, the inverse relationship between word frequency and rank, has a name: Zipf’s Law. Zipf’s law is the observation that word frequency has an inverse relationship to rank. \[ \text{frequency} \propto \frac{1}{\text{rank}} \]

Zipf’s law is based on rank, so here we build that into our dataset freq_by_rank. The is how many times each word appears relative to other words in the sample. This is an abbreviated way of exploring how many times each word appears.

# by convo_id
freq_by_rank = wordcount %>%
  group_by(ConvoId) %>%
  mutate(rank = row_number(), ## this works because the data are sorted already
         `term frequency` = Count/ctotal)

# corpus, no by variable
freq_by_rank1 = wordcount1 %>% 
  mutate(rank = row_number(),
         `term frequency` = Count)

Similarly to word frequency, we create a rank for each word based on how often it’s used in each group [ConvoId] or in the entire set. Creating a rank is rather simple as count(sort = TRUE) organizes the data based on frequency automatically. We also create a second freq_by_rank1 to measure rank of words across the entire dataset, not just within ConvoId groups.

So what can we make with this?

# Zipf's law plot - group by ConvoId
freq_by_rank %>%
  ggplot(aes(rank, `term frequency`, color = ConvoId)) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
  scale_x_log10() +
  scale_y_log10() +
  labs(title = 'Zipf\'s law in action',subtitle = 'There are over 1,000 distinct conversations visualized here')

This graph creates a line for each conversation. It is mostly unreadable [there are 1017 distinct conversations], but shows that all of the conversations demonstrate this inverse relationship that Zipf demonstrates. Here is the reasonable interpretation.

# Zipf's law plot - corpus
freq_by_rank1 %>% 
  ggplot(aes(rank, `term frequency`)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(title = 'Zipf\'s law in action', subtitle = 'Zipf\'s law says there is an inverse term frequency-rank relationship')

# very straight

This plot of the entire data is nearly perfect. With a scaled x and y axis, we can see a nearly linear relationship appear in the data. There is a bit of bowing at the tails. Out data has fewer low ranking words than expected and fewer words with high rank than expected. We’re going to take this idea a little further with tf-idf.

Term Frequency - Inverse Document Frequency

Term frequency - inverse document frequency is a statistic intended to measure the relative importance of a word within a document, compared to similar documents.This method is very common for search engines, efficiently sorting out which words are common across all websites and which are common only on certain pages. \[ tf \times \log(\frac{N}{df})\] where tf is term frequency, df is document frequency, and N is number of documents. Term frequency is simple frequency within the document, and inverse document frequency is the logarithm of number of documents over the number of documents containing the term.

We can apply this method to investigate deeper into the data. We have already looked at which words are the most common according to role and the entire body of work. Now we can look at how common certain words are within each conversation.

To create a ranked set, we use ConvoId as a grouping factor, or document.

# rank grouped by ConvoId
ranktf = freq_by_rank %>% 
  bind_tf_idf(Word,ConvoId,Count)

# checking her out
ranktf %>%
  select(-rank) %>%

# maximize tf_idf
arrange(desc(tf_idf))
## # A tibble: 168,548 x 8
## # Groups:   ConvoId [1,017]
##    ConvoId             Word    Count ctotal `term frequency`     tf   idf tf_idf
##    <chr>               <chr>   <int>  <int>            <dbl>  <dbl> <dbl>  <dbl>
##  1 BAD_20180827-02031~ sir         4     98           0.0408 0.0408  4.09  0.167
##  2 20180824-024455_40~ rupees      4    168           0.0238 0.0238  6.92  0.165
##  3 BAD_20180827-02031~ r           3     98           0.0306 0.0306  5.13  0.157
##  4 BAD_20180827-02552~ u           5    109           0.0459 0.0459  3.34  0.153
##  5 20180827-012328_10~ sir         7    194           0.0361 0.0361  4.09  0.148
##  6 20180824-015542_82~ 500         5    167           0.0299 0.0299  4.62  0.138
##  7 20180807-033356_73~ u           5    122           0.0410 0.0410  3.34  0.137
##  8 BAD_20180827-02031~ u           4     98           0.0408 0.0408  3.34  0.136
##  9 BAD_20180827-02040~ insura~     3    116           0.0259 0.0259  5.13  0.133
## 10 20180831-191711_71~ gail       12    568           0.0211 0.0211  5.83  0.123
## # ... with 168,538 more rows

Let’s talk about bind_tf_idf(). It takes the arguments term, document, n and creates three variables: tf, idf, and tf_idf. tf - Term Frequency is a measure of how frequently the word appears in the document. How common is a word within a given conversation? idf - Inverse Document Frequency measure how \(infrequently\) a word appears in the entire boby of text [all documents]. tf-idf - Term Frequency - Inverse Document Frequency is them the product of those two measures, and it able to account for the frequency of a word within a conversation compared to the frequency of a word across all conversations.

# grouped by role - top 10 words by role
fulltidy %>% 
  count(Role,Word, sort = TRUE) %>% 
  bind_tf_idf(Word, Role,n) %>% 
  arrange(-tf_idf) %>% 
  group_by(Role) %>% 
  slice_max(tf_idf, n = 10)
## # A tibble: 26 x 6
## # Groups:   Role [2]
##    Role      Word             n       tf   idf    tf_idf
##    <fct>     <chr>        <int>    <dbl> <dbl>     <dbl>
##  1 Persuader 2018           120 0.00114  0.693 0.000791 
##  2 Persuader reportedly      82 0.000780 0.693 0.000541 
##  3 Persuader intensifying    73 0.000695 0.693 0.000481 
##  4 Persuader essential       45 0.000428 0.693 0.000297 
##  5 Persuader tangible        23 0.000219 0.693 0.000152 
##  6 Persuader preventable     22 0.000209 0.693 0.000145 
##  7 Persuader snack           22 0.000209 0.693 0.000145 
##  8 Persuader incoming        18 0.000171 0.693 0.000119 
##  9 Persuader ct              17 0.000162 0.693 0.000112 
## 10 Persuader smallest        15 0.000143 0.693 0.0000989
## # ... with 16 more rows

The persuader has the largest tf-idf values for 2018, reportedly, intensifying. These are words that distinguish different conversations the most by occurring a lot in very few conversations. What if we want to look at which conversations are marked by important words specific to the Role?

# create tf-idf values for each individual (ConvoId by Role)
fulltidy %>% 
  unite('ConvoId', ConvoId,Role,sep = '--',remove = FALSE) %>% 
  count(ConvoId,Word, sort = TRUE) %>% 
  bind_tf_idf(Word, ConvoId,n) %>% 
  arrange(-tf_idf) %>% 
  separate(ConvoId,into = c('ConvoId','Role'),sep = '--') %>% 
  filter(n > 10) %>% ## the difference between this
  group_by(Role) %>% 
  slice_max(tf_idf, n = 10)
## # A tibble: 12 x 7
## # Groups:   Role [2]
##    ConvoId                  Role      Word         n     tf   idf tf_idf
##    <chr>                    <chr>     <chr>    <int>  <dbl> <dbl>  <dbl>
##  1 20180831-185408_698_live Persuadee children    13 0.0992 0.216 0.0214
##  2 20180825-092012_333_live Persuadee children    11 0.0719 0.216 0.0155
##  3 20180831-191711_71_live  Persuader gail        11 0.0444 6.23  0.276 
##  4 20180903-103939_512_live Persuader give        11 0.0940 1.35  0.127 
##  5 20180826-220528_999_live Persuader people      12 0.0432 1.34  0.0579
##  6 20180826-220528_999_live Persuader think       16 0.0576 0.898 0.0517
##  7 20180807-094907_712_live Persuader children    16 0.123  0.216 0.0266
##  8 20180824-122747_679_live Persuader children    14 0.116  0.216 0.0250
##  9 20180830-194021_927_live Persuader children    11 0.113  0.216 0.0245
## 10 20180807-213738_534_live Persuader children    13 0.113  0.216 0.0244
## 11 20180807-052301_924_live Persuader charity     11 0.0478 0.504 0.0241
## 12 20180827-021816_456_live Persuader children    17 0.110  0.216 0.0238
# create tf-idf values for each individual (ConvoId by Role)
convo_tf_idf = fulltidy %>% 
  unite('ConvoID', ConvoId,Role,sep = '--',remove = FALSE) %>% 
  count(ConvoID,Word, sort = TRUE) %>% 
  bind_tf_idf(Word,ConvoID,n) %>% 
  arrange(-tf_idf) %>% 
  separate(ConvoID,into = c('ConvoId','Role'),sep = '--') %>% 
  group_by(Role) %>% 
  slice_max(tf_idf, n = 10)

The code for both of these is very similar. Essentially, it groups ConvoId and Role into one column to do bind_tf_idf() and then splits them apart [bind_tf_idf() only takes one grouping variable, so unite() makes it possible].

Ah, here we see our favorite words ‘children’, ‘charity’. This is completely in line with what we saw from the simple frequency plots. But not everyone’s distinguishing conversation marker can be children, can it? Another observation is our counts. These words all have very low counts, less than 20 for each ‘distinguishing’ term. What’s going on here?

# grouping tf_idf by Role
role_words = convo_tf_idf %>% 
  group_by(Word) %>% 
  filter(sum(n) > 10) %>% ## only keep words used over 10 times total
  ungroup() %>% 
  group_by(Role,Word) %>% ## most distinct words by role
  summarise(tf_idf = mean(tf_idf)) %>% ## mean distinctive
  group_by(Role) %>% 
  slice_max(tf_idf,n = 15)

# plot tf_idf by Role
role_words %>% 
  ggplot(aes(tf_idf,Word, fill = Role)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Role, scales = "free_y") +
  labs(title = "Highest tf-idf words by Role", y = NULL, subtitle = " 'u sir' is not a vary good description of our dataset")

We’re left with a confusing plot. It has two words ‘u’ and ‘sir’, claiming that the persuader used ‘sir’ more in a conversation then ‘u’ and the persuadee used it an even number of times. This is obviously meaningless, but tf-idf is a widely celebrated technique. Why did it fail?

tf-idf, or term frequency-inverse document frequency requires large enough document sizes, and in our case was an ineffective approach, as the document sizes (individual conversations) had too few words in it. At an average of [avg words per ConvoId], there are not enough unique words to distinguish which words are valuable and which words are not. Some words had such low counts (typos, speaking styles), tf-idf was useless in identifying their importance as typos or important words.

Methodologically, this makes sense. Each conversation has a common goal, and while there are different persuasion styles, tf-idf will only tell us which words stand out in each conversation. The conversations were not particularly specialized, so the most we could utilize tf-idf is observing surface-level differences between larger groupings of words, such as religion, race, or personality trait.

The mark of a good programmer is being able to use her tools. The mark of a great one is knowing each tools’ limits and finding an alternative path. The next approach is log-odds.

Log-Odds

Let’s try log-odds. Log-odds is a logarithm of the probability of success over the probability of failure. Here we define success as a word being contained in a document and failure as a word not being in a documents.

\[ \log{p} = \log{\frac{p}{1 - p}} \] Similarly to tf-idf, we measure frequency of the term, but the frequency in log-odds is the frequency of the term over the infrequency [frequency of anything else] of the term. Instead of weighting by documents over document frequency, you weight with relative frequency. This keeps our most common words ‘children’, ‘save’ from falling through the cracks even through they’re important.

With a slight modification, we can use log-odds to create a new frequency statistic.

# log-odds grouped by role
role_lo = fulltidy %>%
  filter(!is.na(Role)) %>%
  # count each Word by Role
  count(Word, Role, sort = TRUE) %>%
  # creates weighted log_odds by Role
  bind_log_odds(Role, Word, n) %>%
  # sort highest to lowest log_odds
  arrange(-log_odds_weighted) %>%
  # 
  group_by(Word) %>%     
  # filter out words used less than 10 times
  filter(sum(n) > 10) %>% ## only keep words used over 50 times total
  # ungroup by word
  ungroup %>%
  # regroup by Role
  group_by(Role) %>%
  # select the highest log odds
  slice_max(log_odds_weighted, n = 10)    ## most likely by role

Weighted Log Odds is a technique comparing the frequency of the words in the text to the frequency of the words in each grouping. This technique differs from term frequency-inverse document frequency, because tf-idf uses the frequency of the words and compares which words are different across documents, using a relative search to estimate importance. Log-odds approach is different, using logarithm of the odds and Bayesian inference to weight variable across documents. The main difference is that rather than measuring simple frequency per Word per group, we measure the logarithm of the odds. What are we left with?

role_lo %>%
  ggplot(aes(log_odds_weighted, fct_reorder(Word, log_odds_weighted), fill = Role)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Role, scales = "free_y") +
  labs(title = "Log Odds: Role",y = NULL, subtitle = "Persuaders flexed their vocabulary. Persuadees agreed with them")

Now these are real words that we can interpret! Log-Odds isn’t as dependent on raw frequency, so it is able to adjust better for low counts with its weighting. These plots look completely different, the persuaders tending towards specific, large adjectives and nearly all of the persuadees using positively associated filler words, like ‘oh’, ‘ok’, ‘sure’. Lets look at our four traits.

# log-odds by interest trait: religion, sex, ideology, role, education
# log-odds grouped by religion
religion_lo = fulltidy %>%
  filter(!is.na(religion)) %>%
  count(Word, religion, sort = TRUE) %>%
  bind_log_odds(religion, Word, n) %>% 
  arrange(-log_odds_weighted) %>%
  group_by(Word) %>%     ## only keep words used over 50 times total
  filter(sum(n) > 10) %>% # how robust is log-odds
  ungroup %>%
  group_by(religion) %>%
  slice_max(log_odds_weighted, n = 10)    ## most likely by religion

religion_lo %>%
  ggplot(aes(log_odds_weighted, fct_reorder(Word, log_odds_weighted),fill = religion)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~religion, scales = "free_y") +
  labs(title = "Log Odds: Religion", y = NULL, subtitle = "Distinct verbal differences seem to be apparent between every group")

Using log-odds to examine religion, we can learn a lot more about how people see the world. We have some interesting words, including ‘stc’ which means ‘subject to change’ in text speak. Protestants also mention ‘starvation’ and ‘church’. The catholics do not mention church, instead seem to talk about themselves often with ‘iam’ and relate it to ‘veterans’. They also use ‘good’ and ‘super’ often. Atheists say ‘born’ and ‘rate’ more than any other words, and make reference to the ‘rohingya’ genocide. Other religions mention ‘foundations’ and ‘national’, seemingly making reference to other aid campaigns. ‘Disney’ also makes the list.

# log-odds grouped by sex
sex_lo = fulltidy %>%
  filter(!is.na(sex)) %>%
  count(Word, sex, sort = TRUE) %>%
  bind_log_odds(sex, Word, n) %>%
  arrange(-log_odds_weighted) %>%
  group_by(Word) %>%     ## only keep words used over 50 times total
  filter(sum(n) > 10) %>% # how robust is log-odds
  ungroup %>%
  group_by(sex) %>%
  slice_max(log_odds_weighted, n = 10)    ## most likely by sex

sex_lo %>%
  ggplot(aes(log_odds_weighted, fct_reorder(Word, log_odds_weighted),fill = sex)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sex, scales = "free") +
  labs(title = "Log Odds: Gender",y = NULL, subtitle = "Some crossovers between groups: women say 'wonderful' and men say 'ok'")

For gender, we can see that men are using some shorter words than women, and ‘hit’. Women are emphasizing ‘choose’ and . Something interesting is how ‘2’ makes the list in both instances. So does ‘going’ and ‘directly’. The other genders represented in our sample are on par with the most popular words across the sample in terms of basic frequency.

# log-odds grouped by ideology
ideology_lo = fulltidy %>%
  filter(!is.na(ideology)) %>%
  count(Word, ideology, sort = TRUE) %>%
  bind_log_odds(ideology, Word, n) %>%
  arrange(-log_odds_weighted) %>%
  group_by(Word) %>%     ## only keep words used over 50 times total
  filter(sum(n) > 10) %>% # how robust is log-odds
  ungroup %>%
  group_by(ideology) %>%
  slice_max(log_odds_weighted, n = 10)    ## most likely by ideology

ideology_lo %>%
  ggplot(aes(log_odds_weighted, fct_reorder(Word, log_odds_weighted),fill = ideology)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ideology, scales = "free_y") +
  labs(title = "Log Odds: Political Ideology",y = NULL, subtitle = "Conservatives talk about flooding, liberals talk about mexico, moderates talk to each other")

With respect to political ideology, conservatives seem to be concerned with ‘flooding’, perhaps referring to natural disaster relief funds. They mention ‘church’ like the protestants, but also ‘sleep’ and ‘dinner’. Moderates also have ‘iam’ [similar to catholics] but includes ‘drinking’ and ‘virginia’. Liberals also mention ‘mexico’ [atheists do too] and ‘corporations’ and ‘bot’.

# log-odds grouped by role
edu_lo = fulltidy %>%
  filter(!is.na(edu)) %>%
  count(Word, edu, sort = TRUE) %>%
  bind_log_odds(edu, Word, n) %>%
  arrange(-log_odds_weighted) %>%
  group_by(Word) %>%     ## only keep words used over 50 times total
  filter(sum(n) > 10) %>% # how robust is log-odds
  ungroup %>%
  group_by(edu) %>%
  slice_max(log_odds_weighted, n = 10) ## most likely by education

edu_lo %>%
  ggplot(aes(log_odds_weighted, fct_reorder(Word, log_odds_weighted), fill = edu)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~edu, scales = "free_y") +
  labs(title = "Log Odds: Educational Background",y = NULL, subtitle = "Education groups showed no overlap between common words")

Most words have a fairly uniform frequency within the groups, but it appears the postgraduates kept closely to the top words from the overall sample. Less than college found ‘iam’ and ‘ur’, shorthand for I am and you are. ‘mexico’ also made a comeback [liberal/atheist] and ‘iam’/‘ur’ [moderate]. The distinct words in each group is different than the other traits, meaning this measure may be indicating distinction in the way people speak depending on their level of education!

What else can we find out about what people are saying which words?

Values Modelling

Another exploration we can do is into the likelihood that you have a personal value and say a certain word. How likely are you to say it?

For this last section, we don’t need to learn another technique (whew!). Just remembering some basic statistics and some darn good programming. We’re examining each component of the OCEAN big five model, which gives you a score in five categories: extroversion, agreeableness, conscientiousness, neurotics, openness.

We’re going to go over each step of code here, and then reapply this same code to each value of the big five.

agree_freq <- fulltidy %>%
  filter(!is.na(agreeable)) %>%
  # creating a count of Word by agreeable score
  count(agreeable, Word) %>%
  # filling in implicit missing values
  complete(agreeable, Word, fill = list(n = 0)) %>%
  # grouping the values to transform by group
  group_by(agreeable) %>%
  # creating a frequency value for each group
  mutate(value_total = sum(n)) %>%
  # ungrouping to transform by observation
  ungroup() %>%
  # grouping values to transform by Word
  group_by(Word) %>%
  # filter out words with fewer than 50 observations
  filter(sum(n) > 50) %>%
  # ungrouping to transform by observation
  ungroup()

So we start by taking the fulltidy set we’re been using [one word per row data] and creating a count by the agreeable score and the word, making a dataset with one word per score per observations. We then fill in every combination that may not exist, to make sure that there is an agreeable score 0-5 for every word in the set. We group_by() agreeable score to count up the number of times each score appears in the dataset and then ungroup(). Last, we regroup by Word, and filter out words that have less than 50 observations to create agree_freq.

Next we create a little regression for each word, based on the individual value scores. We are using a binomial regression, indicated by glm(cbind(n,value_total) ~ agreeable, family = "binomial"). This just means our regression outputs are likelihoods of success [the likelihood of a word appearing, p].

agree_slopes <- agree_freq %>%
  # making little tibbles in a tibble
  nest(tiny_tibbles = c(agreeable, n, value_total)) %>%
  # running a linear model on each value-count combination using agreeable score 
  mutate(models = map(tiny_tibbles, ~glm(cbind(n, value_total) ~ agreeable, data = ., family = "binomial"))) %>%
  # pull models out of the nested tibble
  mutate(models = map(models, tidy)) %>%
  # drop tiny_tibbles
  select(-tiny_tibbles) %>%
  # popping the models out of the glm object
  unnest(models) %>%
  # selecting the estimates for agreeable effect
  filter(term == "agreeable") %>%
  # adjusting the p values
  mutate(p.value = p.adjust(p.value))

We pull the regression back out of the tibbles and unnest the models. We filter on the term ‘agreeable’ to get the slopes for each word, and adjust the p values so they can be compared.

To put it into a nice looking plot, we look at the smallest p values [smallest coincidental effect] and join to agree_slopes to get the counts [n and value_total] and we can plug the

agree_slopes %>%
  # selection criteria: the word slopes with the smallest p.values
  slice_min(p.value, n = 9) %>%
  # 
  inner_join(agree_freq) %>%
  ## plotting the value slopes
  ggplot(aes(agreeable, n / value_total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~Word) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "The Big Five: Agreeable",x = "Agreeableness score", y = "Word frequency",subtitle = "Agreeable people are more likely to say donate and yes")

And now we can see the likelihood of saying a word related to how agreeable someone is! People who are very agreeable are more likely to say ‘yes’. People who are less agreeable are more likely to say ‘hey’. We know something about the words and the sort of personalities are more or less likely to use them.

You can apply this to any of the personality values in the dataset at all, including the Schwartz, any of them. There are some other interesting qualities to examine in the big five. Let’s look at conscientiousness.

conscience_freq <- fulltidy %>%
  filter(!is.na(conscientious)) %>%
  count(conscientious, Word) %>%
  complete(conscientious, Word, fill = list(n = 0)) %>%
  group_by(conscientious) %>%
  mutate(value_total = sum(n)) %>%
  ungroup() %>%
  group_by(Word) %>%
  filter(sum(n) > 50) %>%
  ungroup()

conscience_slopes <- conscience_freq %>%
  nest(tiny_tibbles = c(conscientious, n, value_total)) %>%
  mutate(models = map(tiny_tibbles, ~glm(cbind(n, value_total) ~ conscientious, data = ., family = "binomial"))) %>%
  mutate(models = map(models, tidy)) %>%
  select(-tiny_tibbles) %>%
  unnest(models) %>%
  filter(term == "conscientious") %>%
  mutate(p.value = p.adjust(p.value))

conscience_slopes %>%
  slice_min(p.value, n = 9) %>%
  inner_join(conscience_freq) %>%
  ggplot(aes(conscientious, n / value_total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~Word) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "The Big Five: Conscientiousness",x = "Conscientiousness score", y = "Word frequency", subtitle = "Conscientious people are more likely to say donate and know less but yes more")

Consicentiousness seems to be related to saying ‘yeah’ less but ‘yes’ more. It also shows a rather large negative relationship for ‘donate’ and ‘know’. Next comes extroversion.

extrovert_freq <- fulltidy %>%
  filter(!is.na(extrovert)) %>%
  count(extrovert, Word) %>%
  complete(extrovert, Word, fill = list(n = 0)) %>%
  group_by(extrovert) %>%
  mutate(value_total = sum(n)) %>%
  ungroup() %>%
  group_by(Word) %>%
  filter(sum(n) > 50) %>%
  ungroup()

extrovert_slopes <- extrovert_freq %>%
  nest(tiny_tibbles = c(extrovert, n, value_total)) %>%
  mutate(models = map(tiny_tibbles, ~glm(cbind(n, value_total) ~ extrovert, data = ., family = "binomial"))) %>%
  mutate(models = map(models, tidy)) %>%
  select(-tiny_tibbles) %>%
  unnest(models) %>%
  filter(term == "extrovert") %>%
  mutate(p.value = p.adjust(p.value))

#extrovert_slopes %>%
#  slice_min(p.value, n = 9) %>%
extrovert_slopes %>%
  arrange(p.value) %>% head(n=9) %>%   
  inner_join(extrovert_freq) %>%
  ggplot(aes(extrovert, n / value_total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~Word) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "The Big Five: Extroversion", x = "Extroversion score",
       y = "Word frequency", subtitle = "Extroverted people are less likely to say choose, much more likely to say time")

Important note: Because the 9th p.value [“0”] is 1, slice_min() fails here, as the remaining effect sizes are also 1. Here we just have to use the smallest values of arrange()As we can see, extroverts are less likely to say many of these words, including ‘bad’ and ‘point’ but much more likely to say ‘time’ and ‘weekend’. Now for neuroticsm.

neurotic_freq <- fulltidy %>%
  filter(!is.na(neurotic)) %>%
  count(neurotic, Word) %>%
  complete(neurotic, Word, fill = list(n = 0)) %>%
  group_by(neurotic) %>%
  mutate(value_total = sum(n)) %>%
  ungroup() %>%
  group_by(Word) %>%
  filter(sum(n) > 50) %>%
  ungroup()

neurotic_slopes <- neurotic_freq %>%
  nest(tiny_tibbles = c(neurotic, n, value_total)) %>%
  mutate(models = map(tiny_tibbles, ~glm(cbind(n, value_total) ~ neurotic, data = ., family = "binomial"))) %>%
  mutate(models = map(models, tidy)) %>%
  select(-tiny_tibbles) %>%
  unnest(models) %>%
  filter(term == "neurotic") %>%
  mutate(p.value = p.adjust(p.value))

neurotic_slopes %>%
  slice_min(p.value, n = 9) %>%
  inner_join(neurotic_freq) %>%
  ggplot(aes(neurotic, n / value_total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~Word) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "The Big Five: Neuroticsm", x = "Neuroticsm score",
       y = "Word frequency", subtitle = "Neuroticism is associated with lower likelihood of saying help and big")

People with lower neuroticsm scores are more likely to say big but less likely to say ‘sorry’ or ‘oh’. They’re more likely to say ‘hits’. And last, openness.

open_freq <- fulltidy %>%
  filter(!is.na(open)) %>%
  count(open, Word) %>%
  complete(open, Word, fill = list(n = 0)) %>%
  group_by(open) %>%
  mutate(value_total = sum(n)) %>%
  ungroup() %>%
  group_by(Word) %>%
  filter(sum(n) > 50) %>%
  ungroup()

open_slopes <- open_freq %>%
  nest(tiny_tibbles = c(open, n, value_total)) %>%
  mutate(models = map(tiny_tibbles, ~glm(cbind(n, value_total) ~ open, data = ., family = "binomial"))) %>%
  mutate(models = map(models, tidy)) %>%
  select(-tiny_tibbles) %>%
  unnest(models) %>%
  filter(term == "open") %>%
  mutate(p.value = p.adjust(p.value))

open_slopes %>%
  slice_min(p.value, n = 9) %>%
  inner_join(open_freq) %>%
  ggplot(aes(open, n / value_total)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~Word) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "The Big Five: Openness",x = "Openess score",y = "Word frequency", subtitle = "Open people are less likely to say ok and yes")

People who score high on openness seem less likely to say a number of words, including yes [like agreeable]. Each personality trait sees completely different relationships to the words in the dataset. This could indicate that personality traits would make good predictors in a model to explore who says what.

Looking at the plots, many of these relationships appear flat. However, this is because they are all scaled to represent the largest effect size. The other plots still show significant effects, but some are more mild than others [eg. rather than a 3% difference in likelihood, there is a 0.25% difference but both are strong effects].

Summary

As many ways as we explored text in this project, we didn’t even touch on topic modelling or prediction. Primarily, we looked at the different ways that text interacted with different traits of our participants: ideology, gender, openness to new experiences. There are plenty of ways to analyze text and find something meaningful in it, and our two part exploration was certainly not exhaustive.

References

Wang, Xuewei, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019. “Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good.” arXiv Preprint arXiv:1906.06725.