This is the third tutorial for text mining on employees’ opinion. See the first one at https://rpubs.com/vicenc/textmining-employees, and the second one at https://rpubs.com/vicenc/sentiment-analysis.
We have a data set with 149 positive and negative employees’ opinions on how they feel about the organization where they are working. You can download the data set from https://github.com/vicencfernandez/WorkforceAnalytics. The data set contains three variables:
The activity’s objective is to predict the kind of opinion (positive or negative) based on the employee comment. In this report, the analysis strategy focuses on a word embbeding approach.
We need to install and load several packages to analyze our data set with text mining.
library(tidyverse)
library(tidymodels)
library(tidytext)
library(textrecipes)
library(textdata)
library(irlba)
library(widyr)
The tidyverse is an opinionated collection of R packages designed for data science. The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. The tidytext is a package that make many text mining tasks easier, more effective, and consistent with tidy data principles. The textrecipes is an extension package for Text Processing of the recipes package (in tidymodels package). The textdata is a package that includes various sentiment lexicons and labeled text data sets for classification and analysis. The irlba is a set of methods for truncated singular value decomposition and principal components analysis of large sparse and dense matrices (SVD). The widyr is useful package for several operations such as co-occurrence counts, correlations, or clustering that are mathematically convenient on wide matrices.
We also load some extra packages for visualization of some figures and tables in this document.
library(gridExtra)
library(knitr)
library(kableExtra)
The gridExtrapackage provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables. The knitr is a package for dynamic report generation in R. Finally, the kableExtra is a package to build complex table with kable and Pipe Syntax.
The first step is to load the data set and check that everything is right. Instead of using a standard R data.frame, we have decided to use a tibble because this makes it much easier to work with large data.
opinions <- read.csv("employee_opinions.csv", sep = ";") %>% as_tibble()
opinions
## # A tibble: 149 x 3
## commentID comment assessment
## <int> <chr> <chr>
## 1 1 In my 30-year career, I’ve more never been proud and ho… positive
## 2 2 They will have to burn the building down before I will … positive
## 3 3 I’m surrounded by people who want to work and who love … positive
## 4 4 This is the second love of my life. positive
## 5 5 I have been an employee here for 45 years and will stay… positive
## 6 6 For personal reasons I have been forced to seek employm… positive
## 7 7 Happy employees don't go looking for other opportunities positive
## 8 8 These folks walk the walk. Seriously. Truly a company o… positive
## 9 9 Having this job has changed my life positive
## 10 10 I’ve been working for this company for only two years, … positive
## # … with 139 more rows
We can see that the variable \(assessment\) has been defined as a character, but we prefer to define it as a factor. So, let’s change it.
opinions$assessment <- as.factor(opinions$assessment)
opinions
## # A tibble: 149 x 3
## commentID comment assessment
## <int> <chr> <fct>
## 1 1 In my 30-year career, I’ve more never been proud and ho… positive
## 2 2 They will have to burn the building down before I will … positive
## 3 3 I’m surrounded by people who want to work and who love … positive
## 4 4 This is the second love of my life. positive
## 5 5 I have been an employee here for 45 years and will stay… positive
## 6 6 For personal reasons I have been forced to seek employm… positive
## 7 7 Happy employees don't go looking for other opportunities positive
## 8 8 These folks walk the walk. Seriously. Truly a company o… positive
## 9 9 Having this job has changed my life positive
## 10 10 I’ve been working for this company for only two years, … positive
## # … with 139 more rows
Now, our data set is ready for analysis. The first step will be to remove the stop words. Please, see the previous tutorials to know more about tokenization and stop words.
opinion_non_stop <- opinions %>%
unnest_tokens(output = word, input = comment, token = "words") %>%
anti_join(stop_words) %>%
group_by(commentID) %>%
summarise(comment = str_c(word, collapse = " "))
The basic idea behind word embedding is to find similarities between words in one or several documents by using some model to predict the co-occurrence of words within a small set of words. In other words, how often two words appear close to each other. But, what do we mean by close? It’s something that we will have to decide (and test). We will use the term ‘context window’ as the distance in words that we are going to consider between the word that we are analyzing and the maximum distance of our consideration. Lets’ see an example.
Consider the sentence ‘this company is the best place where I have worked in my life’. If the center word is ‘place’ and the context window (our definition of proximity) is three words, we will consider that the close words are: ‘is’, ‘the’, ‘best’, ‘where’, ‘I’, and ‘have’ are near to ‘place’.
There are two approaches for using Word Embedding techniques:
The CBOW techniques tries to predict a specific word based on a set of words. This approach is very common when we are implementing a predictive web search.
The Skip-Gram Model techniques tries to predict a set of words based on a specific word. This approach is very common when we want to identify the context of the word. In this document, we focus on this approach.
Most of Word Embedding Model are built by neural networks. In this case, we are going to create a simple model with tidytext. As we have said before, we need to define our ‘context window’. For presenting the process, we have decided to use a ‘context window’ with a length of six words.
The first step is to tokenize our data by words, but previously we need to create a variable (called skipgramID) that allow us to identify the opinion and the ngram.
tidy_skipgrams <- opinion_non_stop %>%
unnest_tokens(ngram, comment, token = "ngrams", n = 6) %>%
filter(ngram != "") %>%
mutate(ngramID = row_number()) %>%
tidyr::unite(skipgramID, commentID, ngramID) %>%
unnest_tokens(word, ngram)
tidy_skipgrams
## # A tibble: 1,620 x 2
## skipgramID word
## <chr> <chr>
## 1 1_1 30
## 2 1_1 career
## 3 1_1 i’ve
## 4 1_1 proud
## 5 1_1 honored
## 6 1_1 carry
## 7 1_2 career
## 8 1_2 i’ve
## 9 1_2 proud
## 10 1_2 honored
## # … with 1,610 more rows
Now, we need to calculate the probabilities for the unigrams (how often we find each word in the original text) and for the skipgrams (how often we find each word next to every other word within the context window).
unigram_probs <- opinion_non_stop %>%
unnest_tokens(word, comment) %>%
count(word, sort = TRUE) %>%
mutate(p = n / sum(n))
unigram_probs
## # A tibble: 526 x 3
## word n p
## <chr> <int> <dbl>
## 1 company 32 0.0366
## 2 people 18 0.0206
## 3 employees 17 0.0194
## 4 management 13 0.0149
## 5 it’s 11 0.0126
## 6 feel 10 0.0114
## 7 job 10 0.0114
## 8 i’m 9 0.0103
## 9 time 9 0.0103
## 10 care 7 0.008
## # … with 516 more rows
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
skipgram_probs
## # A tibble: 3,903 x 4
## item1 item2 n p
## <chr> <chr> <dbl> <dbl>
## 1 company company 49 0.00522
## 2 employees employees 31 0.00330
## 3 people people 30 0.00320
## 4 time time 30 0.00320
## 5 management management 24 0.00256
## 6 team team 21 0.00224
## 7 finding finding 18 0.00192
## 8 leave leave 14 0.00149
## 9 we’re we’re 14 0.00149
## 10 i’m i’m 13 0.00139
## # … with 3,893 more rows
Finally, we normalize the skipgram probabilities based on the unigram probabilities.
normalized_prob <- skipgram_probs %>%
filter(n > 3) %>%
rename(word1 = item1, word2 = item2) %>%
left_join(unigram_probs %>%
select(word1 = word, p1 = p),
by = "word1"
) %>%
left_join(unigram_probs %>%
select(word2 = word, p2 = p),
by = "word2"
) %>%
mutate(p_together = p / p1 / p2)
normalized_prob
## # A tibble: 727 x 7
## word1 word2 n p p1 p2 p_together
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 company company 49 0.00522 0.0366 0.0366 3.90
## 2 employees employees 31 0.00330 0.0194 0.0194 8.75
## 3 people people 30 0.00320 0.0206 0.0206 7.56
## 4 time time 30 0.00320 0.0103 0.0103 30.2
## 5 management management 24 0.00256 0.0149 0.0149 11.6
## 6 team team 21 0.00224 0.00571 0.00571 68.5
## 7 finding finding 18 0.00192 0.00571 0.00571 58.7
## 8 leave leave 14 0.00149 0.00686 0.00686 31.7
## 9 we’re we’re 14 0.00149 0.00686 0.00686 31.7
## 10 i’m i’m 13 0.00139 0.0103 0.0103 13.1
## # … with 717 more rows
We can identify the words that appear more frequently together or the words that appear more regularly with one specific word (in this example, company).
normalized_prob %>%
filter(word1 != word2) %>%
arrange(-p_together)
## # A tibble: 566 x 7
## word1 word2 n p p1 p2 p_together
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 sad incredibly 5 0.000533 0.00114 0.00114 408.
## 2 incredibly sad 5 0.000533 0.00114 0.00114 408.
## 3 interviews sad 5 0.000533 0.00114 0.00114 408.
## 4 sad interviews 5 0.000533 0.00114 0.00114 408.
## 5 assistance offer 5 0.000533 0.00114 0.00114 408.
## 6 offer assistance 5 0.000533 0.00114 0.00114 408.
## 7 helping assistance 5 0.000533 0.00114 0.00114 408.
## 8 assistance helping 5 0.000533 0.00114 0.00114 408.
## 9 greener grass 5 0.000533 0.00114 0.00114 408.
## 10 grass greener 5 0.000533 0.00114 0.00114 408.
## # … with 556 more rows
normalized_prob %>%
filter(word1 == "company") %>%
arrange(-p_together)
## # A tibble: 24 x 7
## word1 word2 n p p1 p2 p_together
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 company appreciates 5 0.000533 0.0366 0.00114 12.7
## 2 company strive 5 0.000533 0.0366 0.00114 12.7
## 3 company goals 5 0.000533 0.0366 0.00114 12.7
## 4 company 10 4 0.000426 0.0366 0.00114 10.2
## 5 company professional 4 0.000426 0.0366 0.00114 10.2
## 6 company accord 4 0.000426 0.0366 0.00114 10.2
## 7 company motivates 4 0.000426 0.0366 0.00114 10.2
## 8 company tenure 6 0.000639 0.0366 0.00229 7.65
## 9 company culture 5 0.000533 0.0366 0.00229 6.37
## 10 company told 4 0.000426 0.0366 0.00229 5.10
## # … with 14 more rows
The best option to work with this information is to transform the data.frame to a sparse matrix and reduce its dimensionality. For reducing the number of dimensions, we have decided to carry out a singular value decomposition (SVD), which provides a way to factorize a matrix into singular vectors and singular values. We set 50 as the number of dimensions to reduce.
pmi_matrix <- normalized_prob %>%
mutate(pmi = log10(p_together)) %>%
cast_sparse(word1, word2, pmi)
pmi_matrix@x[is.na(pmi_matrix@x)] <- 0
pmi_svd <- irlba(pmi_matrix, 50, maxit = 500)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)
Now, we can build a function (called search_proximity) that allow us to find close words based on the word_vectors that we have just created.
search_proximity <- function(word_vectors, selected_vector) {
product <- word_vectors %*% selected_vector
similarities <- product %>%
tidy() %>%
as_tibble() %>%
mutate(token = rownames(product)) %>%
rename(similarity = x) %>%
arrange(-similarity)
}
Let’s see two examples. For instance, we can see which are the most common words that appear together with company and employees.
company_together <- search_proximity(word_vectors, word_vectors["company", ])
company_together
## # A tibble: 161 x 2
## similarity[,1] token
## <dbl> <chr>
## 1 0.217 company
## 2 0.141 accord
## 3 0.129 strive
## 4 0.111 goals
## 5 0.110 praise
## 6 0.102 tenure
## 7 0.0998 attachment
## 8 0.0993 professional
## 9 0.0950 motivates
## 10 0.0907 appreciates
## # … with 151 more rows
employee_together <- search_proximity(word_vectors, word_vectors["employees", ])
employee_together
## # A tibble: 161 x 2
## similarity[,1] token
## <dbl> <chr>
## 1 0.243 employees
## 2 0.181 safe
## 3 0.168 noticing
## 4 0.168 recognizing
## 5 0.153 customers
## 6 0.145 lay
## 7 0.127 valuing
## 8 0.107 cares
## 9 0.0781 business
## 10 0.0598 talk
## # … with 151 more rows
Finally, we can visualize these results.
company_together %>%
mutate(selected = "company") %>%
bind_rows(employee_together %>%
mutate(selected = "employee")) %>%
group_by(selected) %>%
top_n(15, similarity) %>%
ungroup() %>%
mutate(token = reorder(token, similarity)) %>%
ggplot(aes(token, similarity, fill = selected)) +
geom_col(show.legend = FALSE) +
facet_wrap(~selected, scales = "free") +
coord_flip() +
scale_y_continuous(expand = c(0, 0)) +
labs(
x = NULL, title = "What word vectors are most similar to Company and Employees?",
subtitle = "Calculated using counts and matrix factorization"
)
Using the word_vectors, we can apply some mathematical functions, such as \(+\) and \(-\). The typical example is: \(King - Man + Woman = Queen\). Let’s see an example with our data set.
mystery_product <- word_vectors["company", ] - word_vectors["motivates", ]
new_words <- search_proximity(word_vectors, mystery_product)
new_words
## # A tibble: 161 x 2
## similarity[,1] token
## <dbl> <chr>
## 1 0.284 means
## 2 0.192 goals
## 3 0.151 accord
## 4 0.122 company
## 5 0.121 praise
## 6 0.114 tenure
## 7 0.108 attachment
## 8 0.106 professional
## 9 0.0964 successes
## 10 0.0957 10
## # … with 151 more rows
After introducing some concepts, let’s try to build a model to predict the kind of opinion based on the employee comments. As we want to predict the value of a categorical variable (positive and negative), we will use a simple logistic regression model.
We will follow different approaches to build, train and test our logistic regression model:
The first step is to create our word vector with a ‘context window’ of four words, as we have seen before.
set.seed(10)
tidy_skipgrams <- opinion_non_stop %>%
unnest_tokens(ngram, comment, token = "ngrams", n = 4) %>%
filter(ngram != "") %>%
mutate(ngramID = row_number()) %>%
tidyr::unite(skipgramID, commentID, ngramID) %>%
unnest_tokens(word, ngram)
unigram_probs <- opinion_non_stop %>%
unnest_tokens(word, comment) %>%
count(word, sort = TRUE) %>%
mutate(p = n / sum(n))
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
normalized_prob <- skipgram_probs %>%
filter(n > 3) %>%
rename(word1 = item1, word2 = item2) %>%
left_join(unigram_probs %>%
select(word1 = word, p1 = p),
by = "word1"
) %>%
left_join(unigram_probs %>%
select(word2 = word, p2 = p),
by = "word2"
) %>%
mutate(p_together = p / p1 / p2)
pmi_matrix <- normalized_prob %>%
mutate(pmi = log10(p_together)) %>%
cast_sparse(word1, word2, pmi)
pmi_matrix@x[is.na(pmi_matrix@x)] <- 0
pmi_svd <- irlba(pmi_matrix, 50, maxit = 500)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)
my_embeddings_vector <- tibble(
tokens = row.names(word_vectors),
word_vectors %>% as_tibble()
)
To compare all our analyses’ results, we have decided to set the seed to 10210 (this number is random).
set.seed(10210)
Before building our model, we need to split up our data set into a training data set and a testing data set. In this case, we have decided that the training data set contains the 75% of the original data set and to stratify the samples based on whether the opinion is positive or negative.
opinion_split <- initial_split(opinions, strata = "assessment", p = 0.75)
train_data <- opinion_split %>% training()
test_data <- opinion_split %>% testing()
There are several ways to preprocess our data set (e.g., missing values imputation, removing predictors, centering, and scaling) before the analysis. In this case, we will create a recipe that allows us to handle all the data preprocessing.
In the recipe, we need to indicate the following parameters:
data_rec <- recipe(formula = assessment ~ comment, data = train_data) %>%
step_tokenize(comment) %>%
step_stopwords(comment, keep = FALSE) %>%
step_word_embeddings(comment, embeddings = my_embeddings_vector) %>%
prep(training = train_data)
data_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Training data contained 113 data points and no missing data.
##
## Operations:
##
## Tokenization for comment [trained]
## Stop word removal for comment [trained]
## Word embeddings aggregated from comment [trained]
After the initial definition of the recipe, we can add new processes:
Now, we need to carry out (bake) our recipe with our data sets.
train_baked <- data_rec %>% bake(new_data = train_data)
test_baked <- data_rec %>% bake(new_data = test_data)
With the baked recipe, we can start working in the predictive model. As we have seen before, we have decided to use a logistics regression model. To create the predictive model structure, we need to define two elements:
glmnetglm_model <- logistic_reg(mode = "classification", mixture = 0, penalty = 0.1) %>%
set_engine("glmnet")
glm_model
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.1
## mixture = 0
##
## Computational engine: glmnet
Now, it’s time to fit the predictive model structure to our training data set, so we need two define two elements:
final_model <- glm_model %>%
fit(assessment ~ ., data = train_baked)
The final step is to assess the performance of the model. The most straightforward way is to show the actual and predictive values of the testing data set.
predictions_glm <- final_model %>%
predict(new_data = test_baked) %>%
bind_cols(test_baked %>% select(assessment))
kable(head(predictions_glm),
col.names = c("Predictive Values", "Actual Values"),
table.attr = "style='width:40%;'",
caption = "Comparison between actual and predicted values"
) %>%
kable_styling(position = "center")
| Predictive Values | Actual Values |
|---|---|
| positive | positive |
| negative | positive |
| positive | positive |
| positive | positive |
| positive | positive |
| positive | positive |
We can see that not all predictive values fit the actual values. Another better way to show the results is the Confusion Matrix, where we can see the number of false positives, false negatives, true positives, and true negatives.
predictions_glm %>%
conf_mat(assessment, .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n, fill = c("1", "2", "3", "4"))) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)
Another way to evaluate the predictive model is to assess the accuracy. In other words, the fraction of predictions the model got right.
predictions_glm %>%
metrics(assessment, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy")
## # A tibble: 1 x 2
## .metric .estimate
## <chr> <dbl>
## 1 accuracy 0.667
As we can see, the results is not very good (66.7%).
We can also assess the predictive model with the ROC curve:
test_baked %>%
select(assessment) %>%
mutate(
my_class = parsnip:::predict_class(final_model, test_baked),
my_prop = parsnip:::predict_classprob(final_model, test_baked) %>% pull(`negative`)
) %>%
roc_curve(assessment, my_prop) %>%
autoplot()
Now, we are going to evaluate what happens when we use a ‘context window’ of six words. The first step is to create a new word vector.
set.seed(10)
tidy_skipgrams <- opinion_non_stop %>%
unnest_tokens(ngram, comment, token = "ngrams", n = 6) %>%
filter(ngram != "") %>%
mutate(ngramID = row_number()) %>%
tidyr::unite(skipgramID, commentID, ngramID) %>%
unnest_tokens(word, ngram)
unigram_probs <- opinion_non_stop %>%
unnest_tokens(word, comment) %>%
count(word, sort = TRUE) %>%
mutate(p = n / sum(n))
skipgram_probs <- tidy_skipgrams %>%
pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
mutate(p = n / sum(n))
normalized_prob <- skipgram_probs %>%
filter(n > 3) %>%
rename(word1 = item1, word2 = item2) %>%
left_join(unigram_probs %>%
select(word1 = word, p1 = p),
by = "word1"
) %>%
left_join(unigram_probs %>%
select(word2 = word, p2 = p),
by = "word2"
) %>%
mutate(p_together = p / p1 / p2)
pmi_matrix <- normalized_prob %>%
mutate(pmi = log10(p_together)) %>%
cast_sparse(word1, word2, pmi)
pmi_matrix@x[is.na(pmi_matrix@x)] <- 0
pmi_svd <- irlba(pmi_matrix, 25, maxit = 500)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)
my_embeddings_vector <- tibble(
tokens = row.names(word_vectors),
word_vectors %>% as_tibble()
)
To start, we define the same seed again.
set.seed(10210)
We split up the data set again.
opinion_split <- initial_split(opinions, strata = "assessment", p = 0.75)
train_data <- opinion_split %>% training()
test_data <- opinion_split %>% testing()
The pre-process of our data set is the same as we have seen before, but now we are going to use the new word vector.
data_rec <- recipe(formula = assessment ~ comment, data = train_data) %>%
step_tokenize(comment) %>%
step_stopwords(comment, keep = FALSE) %>%
step_word_embeddings(comment, embeddings = my_embeddings_vector) %>%
prep(training = train_data)
data_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Training data contained 113 data points and no missing data.
##
## Operations:
##
## Tokenization for comment [trained]
## Stop word removal for comment [trained]
## Word embeddings aggregated from comment [trained]
Now, we need to carry out (bake) our recipe with our data sets.
train_baked <- data_rec %>% bake(new_data = train_data)
test_baked <- data_rec %>% bake(new_data = test_data)
Now, it’s time to define the model structure, to fit the predictive model to our training data, and predict the values of the testing data set.
glm_model <- logistic_reg(mode = "classification", mixture = 0, penalty = 0.1) %>%
set_engine("glmnet")
glm_model
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.1
## mixture = 0
##
## Computational engine: glmnet
final_model <- glm_model %>%
fit(assessment ~ ., data = train_baked)
predictions_glm <- final_model %>%
predict(new_data = test_baked) %>%
bind_cols(test_baked %>% select(assessment))
The final step is to assess the performance of the model. Let’s see the results is a Confusion Matrix.
predictions_glm %>%
conf_mat(assessment, .pred_class) %>%
pluck(1) %>%
as_tibble() %>%
ggplot(aes(Prediction, Truth, alpha = n, fill = c("1", "2", "3", "4"))) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)
If we prefer to assess the model by the accuracy, the result is 58.3%. It’s not a good predictor.
predictions_glm %>%
metrics(assessment, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy")
## # A tibble: 1 x 2
## .metric .estimate
## <chr> <dbl>
## 1 accuracy 0.583
Finally, we can assess the predictive model with the ROC curve:
test_baked %>%
select(assessment) %>%
mutate(
my_class = parsnip:::predict_class(final_model, test_baked),
my_prop = parsnip:::predict_classprob(final_model, test_baked) %>% pull(`negative`)
) %>%
roc_curve(assessment, my_prop) %>%
autoplot()
This document has built two simple models to predict how employees feel about their organizations from their comments. We decided to use a logistic regression model and word embedding vector with two different ‘context windows’. None of them has given us an excellent result. The first one could be interesting, but not the second one. We need to remember that word embedding models are fine to predict concurrence words, not for predicting other issues.
Mikolov, T; Chen, K.; Corrado, G.; Deam. J. (2013). Efficient Estimation of Word Representations in Vector Space. Retrieved from https://arxiv.org/pdf/1301.3781.pdf
Ball, C. (2018) Word Embeddings. Retrieved from https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html
Silge, J. (2017). Word Vectors with tidy data principles. Retrieved from https://juliasilge.com/blog/tidy-word-vectors/