The simplest way to use Spacy in R is with the spacyr package, which requires a link between Python & R. The simplest approach I know, is to use Reticulate, which is part of Anaconda - the premier data science suite for Python which conveniently solves all virtual environment issues and opens a link between RStudio and Python. For brevity’s sake, we’ll assume that Anaconda has been installed on your machine.
To run spacyr we need to initialise a language model through the spacy_initialize() function. Running spacy is memory intensive, so when you are finished remember to call spacy_finalize().
library(spacyr)
library(quanteda) #Many of spacyr's functions are strengthened (in some cases depend upon) quanteda (another R package/framework for dealing with text data)
library(tidyverse)
library(ParseR)
#Initialising Spacy with a Spanish language model.
spacy_initialize("es_core_news_sm")
#get some data
df <- readr::read_csv("~/Google Drive/My Drive/Data Science Project Work/Sandbox/Danone/danone_mentions.csv")
#we take a non-random sample of the data set for size/speed & clean the column names:
df <- df[100:500,] %>%
janitor::clean_names()
#Keep columns with less than 80% NAs
df <- df[colSums(is.na(df)) < nrow(df) * 0.8]
names(df) #text variable is mention_content
## [1] "id" "date" "time"
## [4] "media_type" "site_name" "site_domain"
## [7] "mention_url" "publisher_name" "publisher_username"
## [10] "title" "mention_content" "topics"
## [13] "subtopics" "sentiment" "country"
## [16] "state" "city" "language"
## [19] "potential_reach" "synthesio_rank"
We’ve got some data, we’ve initialised spacyr with a language model, now what? One function we may be interested in, is spacy_tokenize(). When working with spacyr we need to be sure that we are working on a TIF-compliant (text interchange format) corpus, a quanteda corpus, or in our case - a simple character vector (our text variable).
sentences_df <- spacy_tokenize(df$mention_content,
what = "sentence", #Could use "word" instead
output = "data.frame") #Could be a list too
as_tibble(sentences_df)
## # A tibble: 1,952 x 2
## doc_id token
## <chr> <chr>
## 1 text1 Escarnio, diría yo.
## 2 text1 RT @RevolucionadoZ: La Casa Real tuvo a su disposición en 2018 un tot…
## 3 text1 Mientras tanto millones de familias no pueden gastar ni 50€ al mes en…
## 4 text1 Esto se llama violencia.
## 5 text1 #LárgateBorbón
## 6 text2 Laboratorios Viñas colabora con Amref Salud África en la mejora de lo…
## 7 text3 Recomendaciones sobre las comidas preparadas en casa para bebés y men…
## 8 text3 👇 RT @USDAFoodSafe_es: ¿Preparas comida para bebés fresca todos los …
## 9 text3 Sepa cuánto tiempo debe almacenar esta comida para mantenerla libre d…
## 10 text4 La UNESCO también implicada en la salud de la infancia y de las escue…
## # … with 1,942 more rows
Personally, I don’t find the tokenizer too useful, we can get tokens from tidytext and other packages quicker and more accurately, but it’s worth noting its existence. So what do we want spacyr for?
The main function that we will be using is spacy_parse() which takes several arguments. The most important for our needs being: pos (part of speech) lemma
Other potentially useful arguments are entity, dependency & nounphrase. However, I do not find them to be either as accessible or as accurate as pos & lemma, and parsing with spacy can take a long time to run on larger data sets.
parsed_text <- spacy_parse(df$mention_content,
lemma = TRUE,
pos = TRUE,
entity = FALSE)
## Warning in spacy_parse.character(df$mention_content, lemma = TRUE, pos = TRUE, :
## lemmatization may not work properly in model 'es_core_news_sm'
parsed_text <- as_tibble(parsed_text)
parsed_text
## # A tibble: 45,315 x 6
## doc_id sentence_id token_id token lemma pos
## <chr> <int> <int> <chr> <chr> <chr>
## 1 text1 1 1 Escarnio Escarnio PROPN
## 2 text1 1 2 , , PUNCT
## 3 text1 1 3 diría decir VERB
## 4 text1 1 4 yo yo PRON
## 5 text1 1 5 . . PUNCT
## 6 text1 2 1 RT RT PROPN
## 7 text1 2 2 @RevolucionadoZ @revolucionadoz NOUN
## 8 text1 2 3 : : PUNCT
## 9 text1 2 4 La el DET
## 10 text1 2 5 Casa Casa PROPN
## # … with 45,305 more rows
view(parsed_text) #view our data in a separate pane in RStudio
Now that we have our parsed_text object - which has tokenised our text variable, kept a document id (for summarising or joining later), given us a token_id column, token, lemma, and pos column, we have some interesting options.
We could group_by() the doc_id column, and summarise() + str_c() to create a new text variable of lemmatised posts, which - provided there have been no unfortunate deletions of rows - we could add to our original data frame with a mutate() call.
Alternatively we can investigate parsed_text e.g. we could count lemma for individual word frequencies, or use our pos tags to inspect certain aspects of language. Most likely we will want to look at adjectives, nouns or adverbs. So let’s have a look…
parsed_text %>%
filter(pos == "ADJ")%>%
count(lemma, sort = TRUE)
## # A tibble: 916 x 2
## lemma n
## <chr> <int>
## 1 # 34
## 2 buen 33
## 3 falso 30
## 4 saludable 30
## 5 mejor 29
## 6 diferente 28
## 7 ignorante 28
## 8 dispuesto 27
## 9 emergente 27
## 10 fundado 27
## # … with 906 more rows
Let’s make a quick bar chart of our adjective lemmas and their frequencies:
We could run a similar analysis for nouns, and adverbs etc:
parsed_text %>%
filter(pos == "NOUN")%>%
count(lemma, sort = TRUE)
We can combine parts of speech for different analyses and EDA, but I think we get the picture. What else can we do?
Let’s paste our lemmatised tokens back into sentences & create a new column with the parts of speech of our lemmatised sentences separated by commas:
#Important step is to clean the doc_id phase for summarisation & pasting
parsed_text <- parsed_text %>%
mutate(doc_id = str_replace_all(doc_id, "text", ""))%>%
mutate(doc_id = as.integer(doc_id))
tmp <- parsed_text %>%
group_by(doc_id)%>%
summarise(text = stringr::str_c(lemma, collapse = " "),
pos = str_c(pos, collapse = ", "))
#Quick check that the dimensions stack up before we try to edit our original data frame.
nrow(tmp) == nrow(df)
## [1] TRUE
df <- df %>%
mutate(lemma = tmp$text,
pos = tmp$pos)
rm(tmp)
Now if we want to do some EDA, we could clean up our text variable some more - remove stop words, remove punctuation, excess spaces, numbers etc. (We could also have filtered our tmp variable to not include ‘PUNC’ etc…)
We could move onto bigram networks and the rest of the ParseR stack (this bigram is totally useless as we haven’t carried out the standard pre-processing steps, but we get the gist)
df %>%
count_ngram(text_var = lemma, n = 2, top_n = 50, min_freq = 5)%>%
pluck("viz")%>%
viz_ngram()
Let’s take a quick look at the entity recogniser:
entities <- spacy_parse(df$mention_content, pos = FALSE, tag = FALSE,
lemma = FALSE, entity = TRUE, dependency = FALSE,
nounphrase = TRUE)
We can see pretty quickly that spacyr’s named entity recogniser by itself is neither quick to use nor accurate - it is heavily dependent on capitalisation (I think it requires a combination with nounphrases and dependencies, but that is beyond our scope here…)
entities %>%
filter(nchar(entity) > 0)%>%
count(token, entity, sort = TRUE)%>%
top_n(20)
## Selecting by n
## token entity n
## 1 A ORG_I 219
## 2 EL MISC_B 108
## 3 ES MISC_I 108
## 4 te MISC_I 100
## 5 en MISC_I 94
## 6 casa MISC_I 91
## 7 Negacionista LOC_B 88
## 8 Quédate MISC_B 88
## 9 Mientras MISC_B 87
## 10 ríes MISC_I 87
## 11 tú MISC_I 87
## 12 TU MISC_B 83
## 13 ENEMIGO MISC_B 82
## 14 EL MISC_I 81
## 15 ENEMIGO MISC_I 81
## 16 GENTE MISC_B 81
## 17 NI ORG_B 81
## 18 NO ORG_I 81
## 19 SE ORG_B 81
## 20 de MISC_I 69
spacy_finalize()