library(ids) #used to create random ids
library(tidyverse)
library(textclean)Analyzing unstructured texts
In the last tutorial we extracted information from text using patterns. We go a step beyond in this tutorial to explore meaningful patterns in text using frequency of texts and potential sentiments in those texts. We do that by:
- finding frequently occurring and co-occurring works
- analyzing sentiment of texts
These techniques can be applied to understand large bodies of texts in articles, blogs, social media, planning documents, and more. However, we should also notice the limitations of quantitatively analyzing text. Quantitative analysis helps us make sense from large volumes, but it trades off against the nuances offered by qualitative analysis.
Background
We are going to analyse emails from a listserv (Cohousing-L) that focuses on cohousing. Cohousing is an intentional community of private homes clustered around shared space with some shared norms about voluntary contributions, management, and governance structures. US cohousing communities often comprise both rental and owner-occupied units; they frequently are multi-generational; they leverage existing legal structures, most often the home owner association (HOA), but the lived experience is often very different from that of conventional HOAs; and they also reflect a diversity of housing types, including apartment buildings, side-by-side duplexes and row homes, and detached single-family units.
The data is created by scraping data from the listserv.
Read in the data and explore what the texts look like.
msgs <- read_csv("./cohousingemails/cohousing_emails.csv")Rows: 45000 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): subject, author, email, msg_body, thread, content
dttm (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The next part is cleaning text. We use what we learned yesterday to extract email address from email content, replace redundant spaces in email content, format date, and change some text to lower case. The following code takes a while to run, in the meantime you can work on the following exercise.
msgs2 <- msgs %>%
mutate_all(as.character) %>% ## I (Will) added this because I was getting an error stemming from read_csv() returning all factor variables.
filter(!is.na(content)) %>%
mutate(
msg_id = random_id(n = nrow(.)), # Create a random ID
email = case_when(
is.na(email) ~ content %>%
str_match("\\(.*\\..*\\)") %>%
str_sub(2,-2),
T ~ email
),
content = content %>%
str_replace_all("\\\n", " ") %>%
str_squish(),
# Note that this section takes a long time; I recommend patience. It may make sense to save intermittent steps
# instead of sequencing a long chain of pipes.
msg_body = msg_body %>%
stringi::stri_trans_general("Latin-ASCII") %>%
replace_html() %>%
replace_emoticon() %>%
replace_time(replacement = '<<TIME>>') %>%
replace_number(remove = TRUE) %>%
replace_url() %>%
replace_tag() %>%
replace_email(),
date = as.POSIXct(date),
date = case_when(
is.na(date) ~ str_match(
content,
"[0-9]{1,2} [A-Za-z]{3} [0-9]{2,4} [0-9]{2}:[0-9]{2}"
) %>%
lubridate::dmy_hm(),
T ~ date
),
author = author %>% tolower %>% str_replace_all("[^a-z]", " "),
email = email %>% tolower %>% str_replace_all("[^a-z\\.@_\\d ]", "")
)Now that we have a cleaner dataset, we can use the data for analysis.
Text preprocessing
Our data is not quite ready for analysis yet. The data needs preprocessing: transforming unstructured text data into a structured and standardized format that is easier for machines to understand and process.
Tokenization
This is typically the first step in text preprocessing. It is the process of splitting text into individual words, phrases, symbols, or other meaningful elements called tokens. These tokens become the input for other types of text processing or are used for further analysis. Tokenization helps machines understand the context and semantics of the text. For instance, “I love ice cream” can be tokenized into “I”, “love”, “ice”, “cream”.
library(tidytext)A lot of text analysis is counting the frequency of words. Some words are less meaningful than others, such as “are”, “does”, etc. We call those stop words.
data(stop_words)
stop_words# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
We can also add custom stop words. For example, while analyzing a document on energy planning, we can consider electricity to be a stop word as we can expect it to occur frequently and something that doesn’t provide much unique value.
other_stop_words <- tibble( word =
c("cohousing",
"mailing",
"list",
"unsubscribe",
"mailman",
"listinfo",
"list",
"º",
"org",
"rob",
"ann",
"sharon",
"villines",
"sandelin",
"zabaldo",
"fholson"),
lexicon = "CUSTOM")
stop_words <- bind_rows(stop_words, other_stop_words)Now, we can tokenize and remove stop words.
body_tokens <- msgs2 %>%
unnest_tokens(word, msg_body, token='words') %>%
anti_join(stop_words)Joining with `by = join_by(word)`
From tokens, we can count the top words - the most frequently occurring words.
top_word_counts <- body_tokens %>%
filter(
!str_detect(word,"\\d"),
!str_detect(word, "_")
) %>%
group_by(word) %>%
summarise(count = n()) %>%
select(word = word, count) %>%
arrange(desc(count))head(top_word_counts)# A tibble: 6 × 2
word count
<chr> <int>
1 community 74963
2 people 50019
3 skeptical 47557
4 time 46116
5 sticking 36636
6 tongue 36570
Creating a word cloud
library(ggwordcloud)
# Create word cloud with top 15 most frequent words using ggplot2
ggplot(top_word_counts %>% head(n=15), aes(label = word, size = count)) +
geom_text_wordcloud_area() +
scale_size_area(max_size = 15) +
theme_minimal()Stemming
This is the process of reducing a word to its base or root form. For example, the stem of the words “jumps”, “jumped”, and “jumping” is “jump”. This helps in reducing the corpus of words the model needs to know. Stemming can help with information retrieval tasks and text classification where the grammatical variations of the words aren’t significant.
library(SnowballC)
top_stem_counts <- body_tokens %>%
select(word)%>%
mutate(stem_word = wordStem(word)) %>%
group_by(stem_word) %>%
summarise(count = n()) %>%
select(word = stem_word, count) %>%
arrange(desc(count))Lemmatization
Similar to stemming, lemmatization also reduces words to their base form, but unlike stemming, it transforms words to their morphological base, which is linguistically correct. For example, the word “better” after lemmatization would be “good”. This is particularly useful when we want to maintain the semantic meaning of words in our analysis.
library(textstem)Loading required package: koRpus.lang.en
Loading required package: koRpus
Loading required package: sylly
For information on available language packages for 'koRpus', run
available.koRpus.lang()
and see ?install.koRpus.lang()
Attaching package: 'koRpus'
The following object is masked from 'package:readr':
tokenize
top_lemm_counts <- body_tokens %>%
select(word)%>%
mutate(lemm_word = lemmatize_words(word))%>%
group_by(lemm_word) %>%
summarise(count = n()) %>%
select(word = lemm_word, count) %>%
arrange(desc(count))
g1 <- top_lemm_counts %>%
top_n(30) %>%
ggplot() +
geom_bar(aes(x= reorder(word, count), y = count), stat = 'identity') +
coord_flip() +
xlab("")+
theme_bw()Selecting by count
Another alternative to wordcloud is a simple bar chart.
g1N-grams
N-grams are a popular technique in Natural Language Processing (NLP) and computational linguistics. They represent a contiguous sequence of ‘n’ items from a given text or speech. In the context of language modeling, the ‘items’ usually refer to words, but they could also refer to characters, phonemes, syllables, etc.
The ‘n’ in n-gram refers to the number of grouped words. Here are the different types of n-grams:
A 1-gram (or unigram) is a single word. For example, in the sentence “I love to play soccer”, the unigrams would be “I”, “love”, “to”, “play”, “soccer”.
A 2-gram (or bigram) is a sequence of two words. Using the same sentence, the bigrams would be: “I love”, “love to”, “to play”, “play soccer”.
A 3-gram (or trigram) is a sequence of three words. For the given sentence, the trigrams would be: “I love to”, “love to play”, “to play soccer”.
And so on, where an n-gram would be a sequence of ‘n’ words.
The primary use of n-grams in NLP is for language modeling and text prediction. For instance, if you’ve noticed when you type something on your smartphone, it often suggests the next word. These predictions are done using n-gram models. N-grams are also used in machine translation, speech recognition, information retrieval, and various other applications.
bigrams <- msgs2 %>%
select(-thread) %>%
unnest_tokens(
bigram,
msg_body,
token = "ngrams",
n = 2
)
negation_words <- c(
"not",
"no",
"never",
"without",
"don't",
"cannot",
"can't",
"isn't",
"wasn't",
"hadn't",
"couldn't",
"wouldn't",
"won't"
)
modified_stops <- stop_words %>%
filter(!(word %in% negation_words))
refined_bigrams <- bigrams %>%
separate(bigram, c("word1", "word2")) %>%
filter(
!word1 %in% modified_stops$word,
!word2 %in% modified_stops$word
) %>%
mutate(lemm_word1 = lemmatize_words(word1),
lemm_word2 = lemmatize_words(word2))Warning: Expected 2 pieces. Additional pieces discarded in 472936 rows [21, 22, 63, 64,
68, 69, 71, 72, 235, 236, 238, 334, 335, 347, 348, 354, 355, 380, 381, 411,
...].
refined_bigrams <- refined_bigrams %>%
count(lemm_word1, lemm_word2, sort = T) %>%
unite(bigram, lemm_word1, lemm_word2, sep = " ")The negation_words is a vector of words that express negation. These words are often important in sentiment analysis as they can change the meaning of a phrase substantially.
Sentiment analysis
Sentiment analysis is a technique to assess the sentiment of a given text. Sentiment could be a binary value of positive or negative OR a spectrum (for example -1 to +1).
Sentiment analyses use a reference of word association with emotions often known as Word-Emotion Association Lexicons. For example: grateful is positive and hateful is negative. It could also have emotion associations represented in numbers for each word.
In the following code, we are joining our dataframe of words with a lexicon and grouping by total number of words associated with each sentiment. The ultimate goal is not to exactly calculate the value of sentiment in text, but to find a good enough meaningful and relative proxy for sentiments.
library(textdata)
nrc_sentiment <- get_sentiments('nrc')
body_tokens %>%
select(word, msg_id) %>%
mutate(lemm_word = lemmatize_words(word)) %>%
inner_join(nrc_sentiment, by=c('lemm_word' = 'word')) %>%
group_by(msg_id, sentiment) %>%
summarize(count = n()) %>%
mutate(freq = count/sum(count)) %>%
pivot_wider(id_cols = msg_id, values_from=freq, names_from=sentiment, values_fill = 0) %>%
top_n(5)Warning in inner_join(., nrc_sentiment, by = c(lemm_word = "word")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 8 of `x` matches multiple rows in `y`.
ℹ Row 6664 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
`summarise()` has grouped output by 'msg_id'. You can override using the
`.groups` argument.
Selecting by surprise
# A tibble: 44,609 × 11
# Groups: msg_id [44,609]
msg_id anticipation disgust joy negative positive trust sadness anger
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 00002e86… 0.182 0.0303 0.152 0.121 0.303 0.212 0 0
2 0002eb25… 0.147 0 0.147 0.0588 0.471 0.147 0.0294 0
3 00039c32… 0.222 0.0222 0.0444 0.0889 0.311 0.111 0.0444 0.0222
4 0008861e… 0.149 0 0.0851 0.213 0.213 0.149 0.0638 0.0213
5 0008ae89… 0.0648 0.0278 0.0648 0.0833 0.361 0.231 0.0463 0.0463
6 000998df… 0.126 0.0265 0.0795 0.106 0.311 0.139 0.0464 0.0795
7 000ba801… 0.126 0 0.134 0.0756 0.345 0.160 0.0924 0
8 000fb81e… 0.2 0 0.1 0.0333 0.467 0.0667 0.0333 0.0333
9 000fc05c… 0.0588 0 0.118 0.176 0.235 0.176 0.118 0.118
10 00116fac… 0.128 0 0.128 0.0851 0.362 0.213 0 0
# ℹ 44,599 more rows
# ℹ 2 more variables: fear <dbl>, surprise <dbl>
We can also calculate sentiment of each author on particular dates. Remember the values of sentiment are on a scale of polarity from -1 to +1.
library(sentimentr)
msgs2 %>%
top_n(50) %>%
get_sentences() %>%
sentiment_by(by=c('date', 'author')) %>%
top_n(20)Selecting by msg_id
Selecting by ave_sentiment
date author word_count sd
1: 1998-05-19 20:43:00 judith linton 234 0.2584337
2: 1999-07-26 00:00:00 graham meltzer 896 0.2697488
3: 1999-09-21 04:00:00 michael mcintyre 208 0.3581448
4: 2000-02-27 05:00:00 heidinys 110 0.2315481
5: 2002-01-26 05:00:00 dave crawford 1532 0.1919800
6: 2002-07-21 04:00:00 jim snyder grant 285 0.2612569
7: 2003-10-30 05:00:00 braford 276 0.1926719
8: 2008-05-12 04:00:00 sharon villines 265 0.1884307
9: 2010-01-21 05:00:00 alison etter 483 0.2133545
10: 2010-01-22 05:00:00 ann zabaldo 416 0.3644365
11: 2010-03-01 05:00:00 norman gauss 248 0.3178867
12: 2011-02-02 05:00:00 grace kim 528 0.2843765
13: 2012-08-27 04:00:00 marieke hensel 241 0.2577074
14: 2012-12-03 05:00:00 ruth hirsch 243 0.3530117
15: 2013-06-03 04:00:00 rob stewart 84 0.4922629
16: 2014-03-10 04:00:00 fern selzer 112 0.2807072
17: 2014-03-31 04:00:00 alan goldblatt 204 0.3660344
18: 2014-09-12 04:00:00 r n johnson 137 0.4674799
19: 2015-03-07 05:00:00 pen sand jim o connor 151 0.2672172
20: 2018-07-17 04:00:00 philip dowds 1044 0.3139954
ave_sentiment
1: 0.2266742
2: 0.2106340
3: 0.3379923
4: 0.2363161
5: 0.1830011
6: 0.3952517
7: 0.2676510
8: 0.2483870
9: 0.1936303
10: 0.4017483
11: 0.2711444
12: 0.3995262
13: 0.2295879
14: 0.2984803
15: 0.4654042
16: 0.4347890
17: 0.2673774
18: 0.2199707
19: 0.4174141
20: 0.2069616
Sentiment analysis can be useful in gauging public opinion, market trends, consumer preferences, public sentiments, etc. However, there are several limitations, too.
Sentiment analysis struggles with the linguistic nuances of sarcasm and irony, cultural differences, subjectivity, and emotional depth.
The analyst should be wary of both the strength and limitations.