Analyzing unstructured texts

In the last tutorial we extracted information from text using patterns. We go a step beyond in this tutorial to explore meaningful patterns in text using frequency of texts and potential sentiments in those texts. We do that by:

These techniques can be applied to understand large bodies of texts in articles, blogs, social media, planning documents, and more. However, we should also notice the limitations of quantitatively analyzing text. Quantitative analysis helps us make sense from large volumes, but it trades off against the nuances offered by qualitative analysis.

Background

We are going to analyse emails from a listserv (Cohousing-L) that focuses on cohousing. Cohousing is an intentional community of private homes clustered around shared space with some shared norms about voluntary contributions, management, and governance structures. US cohousing communities often comprise both rental and owner-occupied units; they frequently are multi-generational; they leverage existing legal structures, most often the home owner association (HOA), but the lived experience is often very different from that of conventional HOAs; and they also reflect a diversity of housing types, including apartment buildings, side-by-side duplexes and row homes, and detached single-family units.

The data is created by scraping data from the listserv.

library(ids) #used to create random ids
library(tidyverse)
library(textclean)

Read in the data and explore what the texts look like.

msgs <- read_csv("./cohousingemails/cohousing_emails.csv")
Rows: 45000 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): subject, author, email, msg_body, thread, content
dttm (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The next part is cleaning text. We use what we learned yesterday to extract email address from email content, replace redundant spaces in email content, format date, and change some text to lower case. The following code takes a while to run, in the meantime you can work on the following exercise.

msgs2 <- msgs %>%
  mutate_all(as.character) %>% ## I (Will) added this because I was getting an error stemming from read_csv() returning all factor variables. 
  filter(!is.na(content)) %>%
  mutate(
    msg_id = random_id(n = nrow(.)), #  Create a random ID
    email = case_when(
      is.na(email) ~ content %>% 
        str_match("\\(.*\\..*\\)") %>%
        str_sub(2,-2),
      T ~ email
    ),
    content = content %>%
      str_replace_all("\\\n", " ") %>%
      str_squish(),

# Note that this section takes a long time; I recommend patience. It may make sense to save intermittent steps
# instead of sequencing a long chain of pipes.

    msg_body = msg_body %>% 
                stringi::stri_trans_general("Latin-ASCII") %>%
                replace_html() %>%
                replace_emoticon() %>%
                replace_time(replacement = '<<TIME>>') %>%
                replace_number(remove = TRUE) %>%
                replace_url() %>%
                replace_tag() %>%
                replace_email(),
    
    date = as.POSIXct(date),
    
    date = case_when(
      is.na(date) ~ str_match(
        content,
        "[0-9]{1,2} [A-Za-z]{3} [0-9]{2,4} [0-9]{2}:[0-9]{2}"
      ) %>%
      lubridate::dmy_hm(),
      T ~ date
    ),
    author = author %>% tolower %>% str_replace_all("[^a-z]", " "),
    email = email %>% tolower %>% str_replace_all("[^a-z\\.@_\\d ]", "")
  )
Exercise
  • Write how the data is getting cleaned at each step? Hint: use R documentation or web search to find out what the functions are doing.

Now that we have a cleaner dataset, we can use the data for analysis.

Text preprocessing

Our data is not quite ready for analysis yet. The data needs preprocessing: transforming unstructured text data into a structured and standardized format that is easier for machines to understand and process.

Tokenization

This is typically the first step in text preprocessing. It is the process of splitting text into individual words, phrases, symbols, or other meaningful elements called tokens. These tokens become the input for other types of text processing or are used for further analysis. Tokenization helps machines understand the context and semantics of the text. For instance, “I love ice cream” can be tokenized into “I”, “love”, “ice”, “cream”.

library(tidytext)

A lot of text analysis is counting the frequency of words. Some words are less meaningful than others, such as “are”, “does”, etc. We call those stop words.

data(stop_words)

stop_words
# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ℹ 1,139 more rows

We can also add custom stop words. For example, while analyzing a document on energy planning, we can consider electricity to be a stop word as we can expect it to occur frequently and something that doesn’t provide much unique value.

other_stop_words <- tibble( word = 
      c("cohousing",
        "mailing", 
      "list",
      "unsubscribe",
      "mailman", 
      "listinfo",
      "list",
      "º",
      "org",
      "rob",
      "ann",
      "sharon",
      "villines",
      "sandelin", 
      "zabaldo",
      "fholson"),
      
      lexicon = "CUSTOM")

stop_words <- bind_rows(stop_words, other_stop_words)

Now, we can tokenize and remove stop words.

body_tokens <- msgs2 %>%
  unnest_tokens(word, msg_body, token='words') %>%
  anti_join(stop_words)
Joining with `by = join_by(word)`

From tokens, we can count the top words - the most frequently occurring words.

top_word_counts <- body_tokens %>%
  filter(
      !str_detect(word,"\\d"),
    !str_detect(word, "_")
  ) %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  select(word = word, count) %>%
  arrange(desc(count))
head(top_word_counts)
# A tibble: 6 × 2
  word      count
  <chr>     <int>
1 community 74963
2 people    50019
3 skeptical 47557
4 time      46116
5 sticking  36636
6 tongue    36570

Creating a word cloud

library(ggwordcloud)
# Create word cloud with top 15 most frequent words using ggplot2
ggplot(top_word_counts %>% head(n=15), aes(label = word, size = count)) +
  geom_text_wordcloud_area() +
  scale_size_area(max_size = 15) +
  theme_minimal()

Exercise
  • The above code only uses the top 15 results. Explore the top 25 words. Which of the words do you think we could add to the list of custom stop words?
  • Did you expect to see sticking and tongue there? Why could it be?

Stemming

This is the process of reducing a word to its base or root form. For example, the stem of the words “jumps”, “jumped”, and “jumping” is “jump”. This helps in reducing the corpus of words the model needs to know. Stemming can help with information retrieval tasks and text classification where the grammatical variations of the words aren’t significant.

library(SnowballC)

top_stem_counts <- body_tokens %>%
                      select(word)%>%
                      mutate(stem_word = wordStem(word)) %>%
                       group_by(stem_word) %>%
                       summarise(count = n()) %>%
                       select(word = stem_word, count) %>%
                       arrange(desc(count))

Lemmatization

Similar to stemming, lemmatization also reduces words to their base form, but unlike stemming, it transforms words to their morphological base, which is linguistically correct. For example, the word “better” after lemmatization would be “good”. This is particularly useful when we want to maintain the semantic meaning of words in our analysis.

library(textstem)
Loading required package: koRpus.lang.en
Loading required package: koRpus
Loading required package: sylly
For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()

Attaching package: 'koRpus'
The following object is masked from 'package:readr':

    tokenize
top_lemm_counts <- body_tokens %>%
                      select(word)%>%
                       mutate(lemm_word = lemmatize_words(word))%>%
                       group_by(lemm_word) %>%
                       summarise(count = n()) %>%
                       select(word = lemm_word, count) %>%
                       arrange(desc(count))

g1 <- top_lemm_counts %>%
        top_n(30) %>%
        ggplot() + 
        geom_bar(aes(x=  reorder(word, count), y = count), stat = 'identity') +
        coord_flip() + 
        xlab("")+
        theme_bw()
Selecting by count
Exercise

Think about the potential limitations of lemmatization. When could the results of lemmatization be misleading? You can refer to the web to figure out the answers.

Another alternative to wordcloud is a simple bar chart.

g1

N-grams

N-grams are a popular technique in Natural Language Processing (NLP) and computational linguistics. They represent a contiguous sequence of ‘n’ items from a given text or speech. In the context of language modeling, the ‘items’ usually refer to words, but they could also refer to characters, phonemes, syllables, etc.

The ‘n’ in n-gram refers to the number of grouped words. Here are the different types of n-grams:

  • A 1-gram (or unigram) is a single word. For example, in the sentence “I love to play soccer”, the unigrams would be “I”, “love”, “to”, “play”, “soccer”.

  • A 2-gram (or bigram) is a sequence of two words. Using the same sentence, the bigrams would be: “I love”, “love to”, “to play”, “play soccer”.

  • A 3-gram (or trigram) is a sequence of three words. For the given sentence, the trigrams would be: “I love to”, “love to play”, “to play soccer”.

And so on, where an n-gram would be a sequence of ‘n’ words.

The primary use of n-grams in NLP is for language modeling and text prediction. For instance, if you’ve noticed when you type something on your smartphone, it often suggests the next word. These predictions are done using n-gram models. N-grams are also used in machine translation, speech recognition, information retrieval, and various other applications.

bigrams <- msgs2 %>%
  select(-thread) %>%
  unnest_tokens(
    bigram,
    msg_body,
    token = "ngrams",
    n = 2
  )

negation_words <- c(
  "not",
  "no",
  "never", 
  "without",
  "don't",
  "cannot",
  "can't",
  "isn't",
  "wasn't",
  "hadn't",
  "couldn't",
  "wouldn't",
  "won't"
)

modified_stops <- stop_words %>%
  filter(!(word %in% negation_words))

refined_bigrams <- bigrams %>%
  separate(bigram, c("word1", "word2")) %>%
  filter(
    !word1 %in% modified_stops$word,
    !word2 %in% modified_stops$word
  ) %>%
  mutate(lemm_word1 = lemmatize_words(word1),
         lemm_word2 = lemmatize_words(word2))
Warning: Expected 2 pieces. Additional pieces discarded in 472936 rows [21, 22, 63, 64,
68, 69, 71, 72, 235, 236, 238, 334, 335, 347, 348, 354, 355, 380, 381, 411,
...].
refined_bigrams <- refined_bigrams %>%
  count(lemm_word1, lemm_word2, sort = T) %>%
  unite(bigram, lemm_word1, lemm_word2, sep = " ")

The negation_words is a vector of words that express negation. These words are often important in sentiment analysis as they can change the meaning of a phrase substantially.

Exercise

Plot refined_bigrams as horizontal bar charts using ggplot

Sentiment analysis

Sentiment analysis is a technique to assess the sentiment of a given text. Sentiment could be a binary value of positive or negative OR a spectrum (for example -1 to +1).

Sentiment analyses use a reference of word association with emotions often known as Word-Emotion Association Lexicons. For example: grateful is positive and hateful is negative. It could also have emotion associations represented in numbers for each word.

In the following code, we are joining our dataframe of words with a lexicon and grouping by total number of words associated with each sentiment. The ultimate goal is not to exactly calculate the value of sentiment in text, but to find a good enough meaningful and relative proxy for sentiments.

library(textdata)
nrc_sentiment <- get_sentiments('nrc')

body_tokens %>%
      select(word, msg_id) %>%
      mutate(lemm_word = lemmatize_words(word)) %>%
     inner_join(nrc_sentiment, by=c('lemm_word' = 'word')) %>%
     group_by(msg_id, sentiment) %>%
     summarize(count = n()) %>%
     mutate(freq = count/sum(count)) %>%
     pivot_wider(id_cols = msg_id, values_from=freq, names_from=sentiment, values_fill = 0) %>%
    top_n(5)
Warning in inner_join(., nrc_sentiment, by = c(lemm_word = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 8 of `x` matches multiple rows in `y`.
ℹ Row 6664 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
`summarise()` has grouped output by 'msg_id'. You can override using the
`.groups` argument.
Selecting by surprise
# A tibble: 44,609 × 11
# Groups:   msg_id [44,609]
   msg_id    anticipation disgust    joy negative positive  trust sadness  anger
   <chr>            <dbl>   <dbl>  <dbl>    <dbl>    <dbl>  <dbl>   <dbl>  <dbl>
 1 00002e86…       0.182   0.0303 0.152    0.121     0.303 0.212   0      0     
 2 0002eb25…       0.147   0      0.147    0.0588    0.471 0.147   0.0294 0     
 3 00039c32…       0.222   0.0222 0.0444   0.0889    0.311 0.111   0.0444 0.0222
 4 0008861e…       0.149   0      0.0851   0.213     0.213 0.149   0.0638 0.0213
 5 0008ae89…       0.0648  0.0278 0.0648   0.0833    0.361 0.231   0.0463 0.0463
 6 000998df…       0.126   0.0265 0.0795   0.106     0.311 0.139   0.0464 0.0795
 7 000ba801…       0.126   0      0.134    0.0756    0.345 0.160   0.0924 0     
 8 000fb81e…       0.2     0      0.1      0.0333    0.467 0.0667  0.0333 0.0333
 9 000fc05c…       0.0588  0      0.118    0.176     0.235 0.176   0.118  0.118 
10 00116fac…       0.128   0      0.128    0.0851    0.362 0.213   0      0     
# ℹ 44,599 more rows
# ℹ 2 more variables: fear <dbl>, surprise <dbl>

We can also calculate sentiment of each author on particular dates. Remember the values of sentiment are on a scale of polarity from -1 to +1.

library(sentimentr)

msgs2 %>% 
  top_n(50) %>% 
  get_sentences() %>%
  sentiment_by(by=c('date', 'author')) %>% 
  top_n(20)
Selecting by msg_id
Selecting by ave_sentiment
                   date                  author word_count        sd
 1: 1998-05-19 20:43:00           judith linton        234 0.2584337
 2: 1999-07-26 00:00:00          graham meltzer        896 0.2697488
 3: 1999-09-21 04:00:00        michael mcintyre        208 0.3581448
 4: 2000-02-27 05:00:00                heidinys        110 0.2315481
 5: 2002-01-26 05:00:00           dave crawford       1532 0.1919800
 6: 2002-07-21 04:00:00        jim snyder grant        285 0.2612569
 7: 2003-10-30 05:00:00                 braford        276 0.1926719
 8: 2008-05-12 04:00:00         sharon villines        265 0.1884307
 9: 2010-01-21 05:00:00            alison etter        483 0.2133545
10: 2010-01-22 05:00:00             ann zabaldo        416 0.3644365
11: 2010-03-01 05:00:00            norman gauss        248 0.3178867
12: 2011-02-02 05:00:00               grace kim        528 0.2843765
13: 2012-08-27 04:00:00          marieke hensel        241 0.2577074
14: 2012-12-03 05:00:00             ruth hirsch        243 0.3530117
15: 2013-06-03 04:00:00             rob stewart         84 0.4922629
16: 2014-03-10 04:00:00             fern selzer        112 0.2807072
17: 2014-03-31 04:00:00          alan goldblatt        204 0.3660344
18: 2014-09-12 04:00:00            r n  johnson        137 0.4674799
19: 2015-03-07 05:00:00 pen sand   jim o connor        151 0.2672172
20: 2018-07-17 04:00:00            philip dowds       1044 0.3139954
    ave_sentiment
 1:     0.2266742
 2:     0.2106340
 3:     0.3379923
 4:     0.2363161
 5:     0.1830011
 6:     0.3952517
 7:     0.2676510
 8:     0.2483870
 9:     0.1936303
10:     0.4017483
11:     0.2711444
12:     0.3995262
13:     0.2295879
14:     0.2984803
15:     0.4654042
16:     0.4347890
17:     0.2673774
18:     0.2199707
19:     0.4174141
20:     0.2069616
Exercise
  • How do you interpret the general nature of communication in the listserv from the values? Write your interpretation after looking at all values, not just the top 20.
  • The sd column refers to the standard deviation of sentiment values in the group. What do the values in sd tell you?

Sentiment analysis can be useful in gauging public opinion, market trends, consumer preferences, public sentiments, etc. However, there are several limitations, too.

Sentiment analysis struggles with the linguistic nuances of sarcasm and irony, cultural differences, subjectivity, and emotional depth.

The analyst should be wary of both the strength and limitations.

References