spacyr_thursday

The simplest way to use Spacy in R is with the spacyr package, which requires a link between Python & R. The simplest approach I know, is to use Reticulate, which is part of Anaconda - the premier data science suite for Python which conveniently solves all virtual environment issues and opens a link between RStudio and Python. For brevity’s sake, we’ll assume that Anaconda has been installed on your machine.

To run spacyr we need to initialise a language model through the spacy_initialize() function. Running spacy is memory intensive, so when you are finished remember to call spacy_finalize().

library(spacyr)
library(quanteda) #Many of spacyr's functions are strengthened (in some cases depend upon) quanteda (another R package/framework for dealing with text data)
library(tidyverse)
library(ParseR)

#Initialising Spacy with a Spanish language model.
spacy_initialize("es_core_news_sm")

#get some data
df <- readr::read_csv("~/Google Drive/My Drive/Data Science Project Work/Sandbox/Danone/danone_mentions.csv")

#we take a non-random sample of the data set for size/speed & clean the column names:
df <- df[100:500,] %>%
  janitor::clean_names()

#Keep columns with less than 80% NAs
df <- df[colSums(is.na(df)) < nrow(df) * 0.8]

names(df) #text variable is mention_content

##  [1] "id"                 "date"               "time"              
##  [4] "media_type"         "site_name"          "site_domain"       
##  [7] "mention_url"        "publisher_name"     "publisher_username"
## [10] "title"              "mention_content"    "topics"            
## [13] "subtopics"          "sentiment"          "country"           
## [16] "state"              "city"               "language"          
## [19] "potential_reach"    "synthesio_rank"

We’ve got some data, we’ve initialised spacyr with a language model, now what? One function we may be interested in, is spacy_tokenize(). When working with spacyr we need to be sure that we are working on a TIF-compliant (text interchange format) corpus, a quanteda corpus, or in our case - a simple character vector (our text variable).

sentences_df <- spacy_tokenize(df$mention_content, 
                               what = "sentence", #Could use "word" instead
                               output = "data.frame") #Could be a list too
as_tibble(sentences_df)

## # A tibble: 1,952 x 2
##    doc_id token                                                                 
##    <chr>  <chr>                                                                 
##  1 text1  Escarnio, diría yo.                                                   
##  2 text1  RT @RevolucionadoZ: La Casa Real tuvo a su disposición en 2018 un tot…
##  3 text1  Mientras tanto millones de familias no pueden gastar ni 50€ al mes en…
##  4 text1  Esto se llama violencia.                                              
##  5 text1  #LárgateBorbón                                                        
##  6 text2  Laboratorios Viñas colabora con Amref Salud África en la mejora de lo…
##  7 text3  Recomendaciones sobre las comidas preparadas en casa para bebés y men…
##  8 text3  👇 RT @USDAFoodSafe_es: ¿Preparas comida para bebés fresca todos los …
##  9 text3  Sepa cuánto tiempo debe almacenar esta comida para mantenerla libre d…
## 10 text4  La UNESCO también implicada en la salud de la infancia y de las escue…
## # … with 1,942 more rows

Personally, I don’t find the tokenizer too useful, we can get tokens from tidytext and other packages quicker and more accurately, but it’s worth noting its existence. So what do we want spacyr for?

The main function that we will be using is spacy_parse() which takes several arguments. The most important for our needs being: pos (part of speech) lemma

Other potentially useful arguments are entity, dependency & nounphrase. However, I do not find them to be either as accessible or as accurate as pos & lemma, and parsing with spacy can take a long time to run on larger data sets.

parsed_text <- spacy_parse(df$mention_content, 
                           lemma = TRUE,
                           pos = TRUE,
                           entity = FALSE)

## Warning in spacy_parse.character(df$mention_content, lemma = TRUE, pos = TRUE, :
## lemmatization may not work properly in model 'es_core_news_sm'

parsed_text <- as_tibble(parsed_text)
parsed_text

## # A tibble: 45,315 x 6
##    doc_id sentence_id token_id token           lemma           pos  
##    <chr>        <int>    <int> <chr>           <chr>           <chr>
##  1 text1            1        1 Escarnio        Escarnio        PROPN
##  2 text1            1        2 ,               ,               PUNCT
##  3 text1            1        3 diría           decir           VERB 
##  4 text1            1        4 yo              yo              PRON 
##  5 text1            1        5 .               .               PUNCT
##  6 text1            2        1 RT              RT              PROPN
##  7 text1            2        2 @RevolucionadoZ @revolucionadoz NOUN 
##  8 text1            2        3 :               :               PUNCT
##  9 text1            2        4 La              el              DET  
## 10 text1            2        5 Casa            Casa            PROPN
## # … with 45,305 more rows

view(parsed_text) #view our data in a separate pane in RStudio

Now that we have our parsed_text object - which has tokenised our text variable, kept a document id (for summarising or joining later), given us a token_id column, token, lemma, and pos column, we have some interesting options.

We could group_by() the doc_id column, and summarise() + str_c() to create a new text variable of lemmatised posts, which - provided there have been no unfortunate deletions of rows - we could add to our original data frame with a mutate() call.

Alternatively we can investigate parsed_text e.g. we could count lemma for individual word frequencies, or use our pos tags to inspect certain aspects of language. Most likely we will want to look at adjectives, nouns or adverbs. So let’s have a look…

parsed_text %>%
  filter(pos == "ADJ")%>%
  count(lemma, sort = TRUE)

## # A tibble: 916 x 2
##    lemma         n
##    <chr>     <int>
##  1 #            34
##  2 buen         33
##  3 falso        30
##  4 saludable    30
##  5 mejor        29
##  6 diferente    28
##  7 ignorante    28
##  8 dispuesto    27
##  9 emergente    27
## 10 fundado      27
## # … with 906 more rows

Let’s make a quick bar chart of our adjective lemmas and their frequencies:

We could run a similar analysis for nouns, and adverbs etc:

parsed_text %>%
  filter(pos == "NOUN")%>%
  count(lemma, sort = TRUE)

We can combine parts of speech for different analyses and EDA, but I think we get the picture. What else can we do?

Let’s paste our lemmatised tokens back into sentences & create a new column with the parts of speech of our lemmatised sentences separated by commas:

#Important step is to clean the doc_id phase for summarisation & pasting
parsed_text <- parsed_text %>%
  mutate(doc_id = str_replace_all(doc_id, "text", ""))%>%
  mutate(doc_id = as.integer(doc_id))

tmp <- parsed_text %>%
  group_by(doc_id)%>%
  summarise(text = stringr::str_c(lemma, collapse = " "),
            pos = str_c(pos, collapse = ", "))

#Quick check that the dimensions stack up before we try to edit our original data frame.
nrow(tmp) == nrow(df)

## [1] TRUE

df <- df %>% 
  mutate(lemma = tmp$text,
         pos = tmp$pos)

rm(tmp)

Now if we want to do some EDA, we could clean up our text variable some more - remove stop words, remove punctuation, excess spaces, numbers etc. (We could also have filtered our tmp variable to not include ‘PUNC’ etc…)

We could move onto bigram networks and the rest of the ParseR stack (this bigram is totally useless as we haven’t carried out the standard pre-processing steps, but we get the gist)

df %>%
  count_ngram(text_var = lemma, n = 2, top_n = 50, min_freq = 5)%>%
  pluck("viz")%>%
  viz_ngram()

Let’s take a quick look at the entity recogniser:

entities <- spacy_parse(df$mention_content, pos = FALSE, tag = FALSE,
                         lemma = FALSE, entity = TRUE, dependency = FALSE,
                         nounphrase = TRUE)

We can see pretty quickly that spacyr’s named entity recogniser by itself is neither quick to use nor accurate - it is heavily dependent on capitalisation (I think it requires a combination with nounphrases and dependencies, but that is beyond our scope here…)

entities %>%
  filter(nchar(entity) > 0)%>%
  count(token, entity, sort = TRUE)%>%
  top_n(20)

## Selecting by n

##           token entity   n
## 1             A  ORG_I 219
## 2            EL MISC_B 108
## 3            ES MISC_I 108
## 4            te MISC_I 100
## 5            en MISC_I  94
## 6          casa MISC_I  91
## 7  Negacionista  LOC_B  88
## 8       Quédate MISC_B  88
## 9      Mientras MISC_B  87
## 10         ríes MISC_I  87
## 11           tú MISC_I  87
## 12           TU MISC_B  83
## 13      ENEMIGO MISC_B  82
## 14           EL MISC_I  81
## 15      ENEMIGO MISC_I  81
## 16        GENTE MISC_B  81
## 17           NI  ORG_B  81
## 18           NO  ORG_I  81
## 19           SE  ORG_B  81
## 20           de MISC_I  69

spacy_finalize()

spacyr_thursday_july

Jack Penzer

13/07/2021