Stop words: Common words that carry little (or perhaps no) meaningful information are called stop words. It is common advice and practice to remove stop words for various NLP tasks, but the task of stop word removal is more nuanced than many resources may lead you to believe.
knitr::opts_chunk$set(warning = F, message = F)
suppressWarnings(if (!require("pacman"))install.packages("pacman"))
pacman::p_load(tidyverse, tidytext, tokenizers, hcandersenr, here, stopwords)
3.1 Using premade stop word lists
A quick option for using stop words is to get a list that has already been created. You should always inspect and verify the list you are using, both to make sure it hasn’t changed since you used it last, and also to check that it is appropriate for your use case.
The stopwords package contains a comprehensive collection of stop word lists in one place for ease of use in analysis and other packages.
Before we start delving into the content inside the lists, let’s take a look at how many words are included in each.
library(stopwords)
stopwords_getsources()
## [1] "snowball" "stopwords-iso" "misc" "smart"
## [5] "marimo" "ancient" "nltk" "perseus"
# Number of words
length(stopwords(source = "snowball"))
## [1] 175
length(stopwords(source = "smart"))
## [1] 571
length(stopwords(source = "stopwords-iso"))
## [1] 1298
The lengths of these lists are quite different, with the longest list being over seven times longer than the shortest!
# Finding words in snowball but not in smart list
setdiff(x = stopwords(source = "snowball"),
y = stopwords(source = "smart"))
## [1] "she's" "he'd" "she'd" "he'll" "she'll" "shan't" "mustn't"
## [8] "when's" "why's" "how's"
All these words are contractions. This is not because the SMART lexicon doesn’t include contractions; if we look, there are almost 50 of them.
str_detect: returns TRUE/FALSE– good in dplyr::filter
str_subset: returns the detected string
# Contractions in smart lexicon
str_subset(string = stopwords(source = "smart"),
pattern = "'")
## [1] "a's" "ain't" "aren't" "c'mon" "c's" "can't"
## [7] "couldn't" "didn't" "doesn't" "don't" "hadn't" "hasn't"
## [13] "haven't" "he's" "here's" "i'd" "i'll" "i'm"
## [19] "i've" "isn't" "it'd" "it'll" "it's" "let's"
## [25] "shouldn't" "t's" "that's" "there's" "they'd" "they'll"
## [31] "they're" "they've" "wasn't" "we'd" "we'll" "we're"
## [37] "we've" "weren't" "what's" "where's" "who's" "won't"
## [43] "wouldn't" "you'd" "you'll" "you're" "you've"
We seem to have stumbled upon an inconsistency: why does SMART include "he's"
but not "she's"
🤷?
This is once again a reminder that we should always look carefully at any pre-made word list or another artifact we use to make sure it works well with our needs.
It is perfectly acceptable to start with a premade word list and remove or append additional words according to your particular use case.
3.1.1 Stop word removal in R
Now that we have seen stop word lists, we can move forward with removing these words. The particular way we remove stop words depends on the shape of our data.
If you have your text in a tidy format with one word per row, you can use filter()
from dplyr with a negated %in%
if you have the stop words as a vector
or you can use anti_join()
from dplyr if the stop words are in a tibble()
.
Like in our previous chapter, let’s examine the text of “The Fir-Tree” by Hans Christian Andersen, and use tidytext to tokenize the text into words.
library(hcandersenr)
library(tidytext)
library(tidyverse)
# Tibble of the fir tree book
fir_tree <- hca_fairytales() %>%
filter(book == "The fir tree", language == "English")
# Tokenize into words
tidy_fir_tree <- fir_tree %>%
unnest_tokens(input = text, output = word, token = "words")
Let’s use the Snowball stop word list as an example. Since the stop words return from this function as a vector, we will use filter()
.
# Remove stop words using filter
tidy_fir_tree %>%
filter(!(word %in% stopwords(source = "snowball"))) %>%
slice_head(n = 10)
If we use the get_stopwords()
function from tidytext instead, then we can use the anti_join()
function.
anti_join() return all rows from x without a match in y.
# “Filtering” joins keep cases from the LHS
# Remove stop words using anti_join since tidytext returns a tibble
tidy_fir_tree %>%
anti_join(get_stopwords(source = "snowball")) %>%
slice_head(n = 10)
3.2 Creating your own stop words list
Another way to get a stop word list is to create one yourself. Let’s explore a few different ways to find appropriate words to use. We will use the tokenized data from “The Fir-Tree” as a first example. Let’s take the words and rank them by their count or frequency.
# Most frequent words in the fir tree
tidy_fir_tree %>%
count(word, sort = TRUE) %>%
slice_head(n = 23)
We recognize many of what we would consider stop words in the first column here, with three big exceptions. We see "tree"
at 3, "fir"
at 12, and "little"
at 22. These words appear high on our list, but they do provide valuable information as they all reference the main character. What went wrong with this approach?
Creating a stop word list using high-frequency words works best when it is created on a corpus of documents, not an individual document. This is because the words found in a single document will be document-specific and the overall pattern of words will not generalize that well.
The word "tree"
does seem important as it is about the main character, but it could also be appearing so often that it stops providing any information.
Let’s try a different approach, extracting high-frequency words from the corpus of all English fairy tales by H.C. Andersen.
hca_fairytales() %>%
filter(language == "English") %>%
unnest_tokens(output = word, input = text) %>%
count(word, sort = TRUE) %>%
slice_head(n = 25)
This list is more appropriate for our concept of stop words, and now it is time for us to make some choices.
Which words should we add and/or remove based on prior information? Selecting the number of words to remove is best done by a case-by-case basis as it can be difficult to determine a priori how many different “meaningless” words appear in a corpus. Our suggestion is to start with a low number like 20 and increase by 10 words until you get to words that are not appropriate as stop words for your analytical purpose.
It is worth keeping in mind that such a list is not perfect. Depending on how your text was generated or processed, strange tokens can surface as possible stop words due to encoding or optical character recognition errors. Further, these results are based on the corpus of documents we have available, which is potentially biased. In our example here, all the fairy tales were written by the same European white man from the early 1800s.
Try an inverse document frequency (IDF) of each word. The IDF of a word is a quantity that is low for commonly-used words in a collection of documents and high for words not used often in a collection of documents. It is typically defined as
\[
idf(term)= ln(\frac{n_{documents}}{n_{documents containing term}})
\]
If the word “dog” appears in 4 out of 100 documents then it would have an idf("dog") = log(100/4) = 3.22
, and if the word “cat” appears in 99 out of 100 documents then it would have an idf("cat") = log(100/99) = 0.01
. Notice how the idf values goes to zero (as a matter of fact when a term appears in all the documents then the idf of that word is 0 log(100/100) = log(1) = 0
), the more documents it is contained in. What happens if we create a stop word list based on words with the lowest IDF? The following function takes a tokenized dataframe and returns a dataframe with a column for each word and a column for the IDF.
#library(rlang)
df <- hca_fairytales() %>%
filter(language == "English") %>%
unnest_tokens(output = word, input = text)
# Unique words in the whole corpus
words <- df %>%
pull(word) %>%
unique()
# Number of books in the corpus
n_docs <- df %>%
pull(book) %>%
unique() %>%
length()
# Unique words in the corpus found in each book
n_words <- (df %>%
nest(data = c(word)) %>%
pull(data) %>%
map_dfc(~ words %in% unique(pull(.x, word))) %>%
rowSums())
tibble(word = words,
idf = log(n_docs / n_words)) %>%
arrange((idf)) %>%
slice_head(n = 25)
This can be put in a functions as:
calc_idf <- function(df, word, document) {
words <- df %>% pull({{word}}) %>% unique()
n_docs <- length(unique(pull(df, {{document}})))
n_words <- df %>%
nest(data = c({{word}})) %>%
pull(data) %>%
map_dfc(~ words %in% unique(pull(.x, {{word}}))) %>%
rowSums()
tibble(word = words,
idf = log(n_docs / n_words))
}
{{ }}
allows you to use tidyverse functions in user defined function to access df variables? -ish..
See: https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/#a-simpler-interpolation-pattern-with-
This time we get better results. The list starts with “a,” “the,” “and,” and “to” and continues with many more reasonable choices of stop words. We need to look at these results manually to turn this into a list. We need to go as far down in rank as we are comfortable with. You as a data practitioner are in full control of how you want to create the list. If you don’t want to include “little” you are still able to add “are” to your list even though it is lower on the list.
3.3 All stop word lists are context-specific
Context is important in text modeling, so it is important to ensure that the stop word lexicon you use reflects the word space that you are planning on using it in. One common concern to consider is how pronouns bring information to your text. Pronouns are included in many different stop word lists (although inconsistently), but they will often not be noise in text data
On the other hand, sometimes you will have to add in words yourself, depending on the domain. If you are working with texts for dessert recipes, certain ingredients (sugar, eggs, water) and actions (whisking, baking, stirring) may be frequent enough to pass your stop word threshold, but you may want to keep them as they may be informative. Throwing away “eggs” as a common word would make it harder or downright impossible to determine if certain recipes are vegan or not while whisking and stirring may be fine to remove as distinguishing between recipes that do and don’t require a whisk might not be that big of a deal.
3.4 What happens when you remove stop words
larger stop word lists remove more words than shorter stop word lists. In this example with fairy tales, over half of the words have been removed, with the largest list removing over 80% of the words.
Handling misspellings when using premade lists can be done by manually adding common misspellings. You could imagine creating all words that are a certain string distance away from the stop words, but we do not recommend this as you would quickly include informative words this way.
One of the downsides of creating your own stop word lists using frequencies is that you are limited to using words that you have already observed. It could happen that “she’d” is included in your training corpus but the word “he’d” did not reach the threshold. This is a case where you need to look at your words and adjust accordingly. Here the large premade stop word lists can serve as inspiration for missing words.
Given the right list of words, we see no harm to the model performance, and sometimes find improvement due to noise reduction (Feldman, and Sanger 2007).
3.6 Summary
In many standard NLP workflows, the removal of stop words is presented as a default or the correct choice without comment. Although removing stop words can improve the accuracy of your machine learning using text data, choices around such a step are complex. The content of existing stop word lists varies tremendously, and the available strategies for building your own can have subtle to not-so-subtle effects on your model results.
---
title: 'Natural Language Features: Chapter 3 Stop Words'
output:
  html_document:
    css: style_7.css
    df_print: paged
    theme: flatly
    highlight: breezedark
    toc: yes
    toc_float: yes
    code_download: yes
---

**Stop words**: Common words that carry little (or perhaps no) meaningful information are called *stop words*. It is common advice and practice to remove stop words for various NLP tasks, but the task of stop word removal is more nuanced than many resources may lead you to believe.

```{r setup}
knitr::opts_chunk$set(warning = F, message = F)

```

```{r}
suppressWarnings(if (!require("pacman"))install.packages("pacman"))

pacman::p_load(tidyverse, tidytext, tokenizers, hcandersenr, here, stopwords)
```

## **3.1 Using premade stop word lists**

A quick option for using stop words is to get a list that has already been created. You should always inspect and verify the list you are using, both to make sure it hasn't changed since you used it last, and also to check that it is appropriate for your use case.

> The **stopwords** package contains a comprehensive collection of stop word lists in one place for ease of use in analysis and other packages.

Before we start delving into the content inside the lists, let's take a look at how many words are included in each.

```{r}
library(stopwords)
stopwords_getsources()

# Number of words
length(stopwords(source = "snowball"))
length(stopwords(source = "smart"))
length(stopwords(source = "stopwords-iso"))
```

The lengths of these lists are quite different, with the longest list being over seven times longer than the shortest!

-   Interpret **Upset plot**

```{r}
# Finding words in snowball but not in smart list
setdiff(x = stopwords(source = "snowball"),
        y = stopwords(source = "smart"))
```

All these words are contractions. This **is *not* because** the SMART lexicon doesn't include contractions; if we look, there are almost 50 of them.

> **str_detect**: returns TRUE/FALSE-- good in dplyr::filter
>
> **str_subset**: returns the detected string

```{r}
# Contractions in smart lexicon
str_subset(string = stopwords(source = "smart"),
           pattern = "'")
```

We seem to have stumbled upon an inconsistency: why does SMART include `"he's"` but not `"she's"` 🤷?

This is once again a reminder that we should always look carefully at any pre-made word list or another artifact we use to make sure it works well with our needs.

> **It is perfectly acceptable to start with a premade word list and remove or append additional words according to your particular use case.**

### 3.1.1 Stop word removal in R

Now that we have seen stop word lists, we can move forward with removing these words. The particular way we remove stop words depends on the **shape of our data**.

-   If you have your text in a tidy format with one word per row, you can use `filter()` from **dplyr** with a negated `%in%` if you have the stop words as a vector

-   or you can use `anti_join()` from **dplyr** if the stop words are in a `tibble()`.

Like in our previous chapter, let's examine the text of "The Fir-Tree" by Hans Christian Andersen, and use **tidytext** to tokenize the text into words.

```{r}
library(hcandersenr)
library(tidytext)
library(tidyverse)

# Tibble of the fir tree book
fir_tree <- hca_fairytales() %>% 
  filter(book == "The fir tree", language == "English")

# Tokenize into words
tidy_fir_tree <- fir_tree %>% 
  unnest_tokens(input = text, output = word, token = "words")


```

Let's use the Snowball stop word list as an example. Since the stop words return from this function as a vector, we will use `filter()`.

```{r}
# Remove stop words using filter
tidy_fir_tree %>% 
  filter(!(word %in% stopwords(source = "snowball"))) %>% 
  slice_head(n = 10)
```

If we use the `get_stopwords()` function from **tidytext** instead, then we can use the `anti_join()` function.

> **anti_join() return all rows from x without a match in y.**
>
> \# "Filtering" joins keep cases from the LHS

```{r}
# Remove stop words using anti_join since tidytext returns a tibble
tidy_fir_tree %>% 
  anti_join(get_stopwords(source = "snowball")) %>% 
  slice_head(n = 10)
```

## **3.2 Creating your own stop words list**

Another way to get a stop word list is to create one yourself. Let's explore a few different ways to find appropriate words to use. We will use the tokenized data from "The Fir-Tree" as a first example. Let's take the words and rank them by their count or frequency.

```{r}
# Most frequent words in the fir tree
tidy_fir_tree %>% 
  count(word, sort = TRUE) %>% 
  slice_head(n = 23)
```

We recognize many of what we would consider stop words in the first column here, with three big exceptions. We see `"tree"` at 3, `"fir"` at 12, and `"little"` at 22. These words appear high on our list, but they do provide valuable information as they all reference the main character. What went wrong with this approach?

> -   **Creating a stop word list using high-frequency words works best when it is created on a corpus of documents, not an individual document. This is because the words found in a single document will be document-specific and the overall pattern of words will not generalize that well.**
>
> -   The word `"tree"` does seem important as it is about the main character, but it could also be appearing so often that it stops providing any information.

Let's try a different approach, extracting high-frequency words from the **corpus of *all* English fairy tales by H.C. Andersen.**

```{r}
hca_fairytales() %>% 
  filter(language == "English") %>% 
  unnest_tokens(output = word, input = text) %>% 
  count(word, sort = TRUE) %>% 
  slice_head(n = 25)
```

This list is more appropriate for our concept of stop words, and now it is time for us to make some choices.

Which words should we add and/or remove based on prior information? Selecting the number of words to remove is best done by a case-by-case basis as it can be difficult to determine a priori how many different "meaningless" words appear in a corpus. Our suggestion is to start with a low number like 20 and increase by 10 words until you get to words that are not appropriate as stop words for your analytical purpose.

It is worth keeping in mind that such a list is not perfect. Depending on how your text was generated or processed, strange tokens can surface as possible stop words due to encoding or optical character recognition errors. Further, these results are based on the corpus of documents we have available, which is potentially biased. In our example here, all the fairy tales were written by the same European white man from the early 1800s.

**Try an [*inverse document frequency*](https://www.tidytextmining.com/tfidf.html) (IDF) of each word.** The IDF of a word is a quantity that is low for commonly-used words in a collection of documents and high for words not used often in a collection of documents. It is typically defined as

$$
idf(term)= ln(\frac{n_{documents}}{n_{documents containing term}})
$$

If the word "dog" appears in 4 out of 100 documents then it would have an `idf("dog") = log(100/4) = 3.22`, and if the word "cat" appears in 99 out of 100 documents then it would have an `idf("cat") = log(100/99) = 0.01`. Notice how the idf values goes to zero (as a matter of fact when a term appears in all the documents then the idf of that word is 0 `log(100/100) = log(1) = 0`), the more documents it is contained in. What happens if we create a stop word list based on words with the lowest IDF? The following function takes a tokenized dataframe and returns a dataframe with a column for each word and a column for the IDF.

```{r}
#library(rlang)
df <- hca_fairytales() %>% 
  filter(language == "English") %>% 
  unnest_tokens(output = word, input = text) 



# Unique words in the whole corpus
words <- df %>%
  pull(word) %>% 
  unique()

# Number of books in the corpus
n_docs <- df %>% 
  pull(book) %>% 
  unique() %>% 
  length()

# Unique words in the corpus found in each book
n_words <- (df %>% 
  nest(data = c(word)) %>% 
    pull(data) %>% 
    map_dfc(~ words %in% unique(pull(.x, word))) %>% 
    rowSums())

tibble(word = words,
         idf = log(n_docs / n_words)) %>% 
  arrange((idf)) %>% 
  slice_head(n = 25) 

```

This can be put in a functions as:

```{r}
calc_idf <- function(df, word, document) {
  words <- df %>% pull({{word}}) %>% unique()
  n_docs <- length(unique(pull(df, {{document}})))
  n_words <- df %>%
    nest(data = c({{word}})) %>%
    pull(data) %>%
    map_dfc(~ words %in% unique(pull(.x, {{word}}))) %>%
    rowSums()
  
  tibble(word = words,
         idf = log(n_docs / n_words))
}
```

> `{{ }}` allows you to use tidyverse functions in user defined function to access df variables? -ish..
>
> See: <https://www.tidyverse.org/blog/2019/06/rlang-0-4-0/#a-simpler-interpolation-pattern-with->

This time we get better results. The list starts with "a," "the," "and," and "to" and continues with many more reasonable choices of stop words. We need to look at these results manually to turn this into a list. We need to go as far down in rank as we are comfortable with. You as a data practitioner are in full control of how you want to create the list. If you don't want to include "little" you are still able to add "are" to your list even though it is lower on the list.

## **3.3 All stop word lists are context-specific**

**Context is important in text modeling**, so it is important to ensure that the stop word lexicon you use reflects the word space that you are planning on using it in. One common concern to consider is how pronouns bring information to your text. Pronouns are included in many different stop word lists (although inconsistently), but they will often *not* be noise in text data

On the other hand, sometimes you will have to add in words yourself, depending on the domain. If you are working with texts for dessert recipes, certain ingredients (sugar, eggs, water) and actions (whisking, baking, stirring) may be frequent enough to pass your stop word threshold, but you may want to keep them as they may be informative. Throwing away "eggs" as a common word would make it harder or downright impossible to determine if certain recipes are vegan or not while whisking and stirring may be fine to remove as distinguishing between recipes that do and don't require a whisk might not be that big of a deal.

## 

**3.4 What happens when you remove stop words**

-   larger stop word lists remove more words than shorter stop word lists. In this example with fairy tales, over half of the words have been removed, with the largest list removing over 80% of the words.

-   Handling misspellings when using premade lists can be done by manually adding common misspellings. You could imagine creating all words that are a certain string distance away from the stop words, but we do not recommend this as you would quickly include informative words this way.

-   One of the downsides of creating your own stop word lists using frequencies is that you are limited to using words that you have already observed. It could happen that "she'd" is included in your training corpus but the word "he'd" did not reach the threshold. This is a case where you need to look at your words and adjust accordingly. Here the large premade stop word lists can serve as inspiration for missing words.

-   Given the right list of words, we see no harm to the model performance, and sometimes find improvement due to noise reduction ([Feldman, and Sanger 2007](https://smltar.com/stopwords.html#ref-Feldman2007)).

## **3.6 Summary**

In many standard NLP workflows, the removal of stop words is presented as a default or the correct choice without comment. Although removing stop words can improve the accuracy of your machine learning using text data, choices around such a step are complex. The content of existing stop word lists varies tremendously, and the available strategies for building your own can have subtle to not-so-subtle effects on your model results.

```{r eval=FALSE, include=FALSE}
# paletteer::paletteer_d("ggsci::category10_d3")
```
