1 The context and the data set

This is the third tutorial for text mining on employees’ opinion. See the first one at https://rpubs.com/vicenc/textmining-employees, and the second one at https://rpubs.com/vicenc/sentiment-analysis.

We have a data set with 149 positive and negative employees’ opinions on how they feel about the organization where they are working. You can download the data set from https://github.com/vicencfernandez/WorkforceAnalytics. The data set contains three variables:

  • commentID: An integer variable with the opinion ID
  • comment: A character variable with the employee opinion about their organization
  • assessment: A factor variable indicating if the opinion is positive or negative

The activity’s objective is to predict the kind of opinion (positive or negative) based on the employee comment. In this report, the analysis strategy focuses on a word embbeding approach.

2 Packages for the analysis and reporting

We need to install and load several packages to analyze our data set with text mining.

library(tidyverse)
library(tidymodels)
library(tidytext)
library(textrecipes)
library(textdata)
library(irlba)
library(widyr)

The tidyverse is an opinionated collection of R packages designed for data science. The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. The tidytext is a package that make many text mining tasks easier, more effective, and consistent with tidy data principles. The textrecipes is an extension package for Text Processing of the recipes package (in tidymodels package). The textdata is a package that includes various sentiment lexicons and labeled text data sets for classification and analysis. The irlba is a set of methods for truncated singular value decomposition and principal components analysis of large sparse and dense matrices (SVD). The widyr is useful package for several operations such as co-occurrence counts, correlations, or clustering that are mathematically convenient on wide matrices.

We also load some extra packages for visualization of some figures and tables in this document.

library(gridExtra)
library(knitr)
library(kableExtra)

The gridExtrapackage provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page, and draw tables. The knitr is a package for dynamic report generation in R. Finally, the kableExtra is a package to build complex table with kable and Pipe Syntax.

3 Loading and cleaning the data set

The first step is to load the data set and check that everything is right. Instead of using a standard R data.frame, we have decided to use a tibble because this makes it much easier to work with large data.

opinions <- read.csv("employee_opinions.csv", sep = ";") %>% as_tibble()
opinions
## # A tibble: 149 x 3
##    commentID comment                                                  assessment
##        <int> <chr>                                                    <chr>     
##  1         1 In my 30-year career, I’ve more never been proud and ho… positive  
##  2         2 They will have to burn the building down before I will … positive  
##  3         3 I’m surrounded by people who want to work and who love … positive  
##  4         4 This is the second love of my life.                      positive  
##  5         5 I have been an employee here for 45 years and will stay… positive  
##  6         6 For personal reasons I have been forced to seek employm… positive  
##  7         7 Happy employees don't go looking for other opportunities positive  
##  8         8 These folks walk the walk. Seriously. Truly a company o… positive  
##  9         9 Having this job has changed my life                      positive  
## 10        10 I’ve been working for this company for only two years, … positive  
## # … with 139 more rows

We can see that the variable \(assessment\) has been defined as a character, but we prefer to define it as a factor. So, let’s change it.

opinions$assessment <- as.factor(opinions$assessment)
opinions
## # A tibble: 149 x 3
##    commentID comment                                                  assessment
##        <int> <chr>                                                    <fct>     
##  1         1 In my 30-year career, I’ve more never been proud and ho… positive  
##  2         2 They will have to burn the building down before I will … positive  
##  3         3 I’m surrounded by people who want to work and who love … positive  
##  4         4 This is the second love of my life.                      positive  
##  5         5 I have been an employee here for 45 years and will stay… positive  
##  6         6 For personal reasons I have been forced to seek employm… positive  
##  7         7 Happy employees don't go looking for other opportunities positive  
##  8         8 These folks walk the walk. Seriously. Truly a company o… positive  
##  9         9 Having this job has changed my life                      positive  
## 10        10 I’ve been working for this company for only two years, … positive  
## # … with 139 more rows

Now, our data set is ready for analysis. The first step will be to remove the stop words. Please, see the previous tutorials to know more about tokenization and stop words.

opinion_non_stop <- opinions %>%
  unnest_tokens(output = word, input = comment, token = "words") %>%
  anti_join(stop_words) %>%
  group_by(commentID) %>%
  summarise(comment = str_c(word, collapse = " "))

4 Word Embedding Approach

The basic idea behind word embedding is to find similarities between words in one or several documents by using some model to predict the co-occurrence of words within a small set of words. In other words, how often two words appear close to each other. But, what do we mean by close? It’s something that we will have to decide (and test). We will use the term ‘context window’ as the distance in words that we are going to consider between the word that we are analyzing and the maximum distance of our consideration. Lets’ see an example.

Consider the sentence ‘this company is the best place where I have worked in my life’. If the center word is ‘place’ and the context window (our definition of proximity) is three words, we will consider that the close words are: ‘is’, ‘the’, ‘best’, ‘where’, ‘I’, and ‘have’ are near to ‘place’.

4.1 Word Embedding Approaches

There are two approaches for using Word Embedding techniques:

  • Continuous Bag of Words - CBOW
  • Skip-Gram Model

The CBOW techniques tries to predict a specific word based on a set of words. This approach is very common when we are implementing a predictive web search.

The Skip-Gram Model techniques tries to predict a set of words based on a specific word. This approach is very common when we want to identify the context of the word. In this document, we focus on this approach.

4.2 Building an Word Embedding Model

Most of Word Embedding Model are built by neural networks. In this case, we are going to create a simple model with tidytext. As we have said before, we need to define our ‘context window’. For presenting the process, we have decided to use a ‘context window’ with a length of six words.

The first step is to tokenize our data by words, but previously we need to create a variable (called skipgramID) that allow us to identify the opinion and the ngram.

tidy_skipgrams <- opinion_non_stop %>%
  unnest_tokens(ngram, comment, token = "ngrams", n = 6) %>%
  filter(ngram != "") %>%
  mutate(ngramID = row_number()) %>%
  tidyr::unite(skipgramID, commentID, ngramID) %>%
  unnest_tokens(word, ngram)
tidy_skipgrams
## # A tibble: 1,620 x 2
##    skipgramID word   
##    <chr>      <chr>  
##  1 1_1        30     
##  2 1_1        career 
##  3 1_1        i’ve   
##  4 1_1        proud  
##  5 1_1        honored
##  6 1_1        carry  
##  7 1_2        career 
##  8 1_2        i’ve   
##  9 1_2        proud  
## 10 1_2        honored
## # … with 1,610 more rows

Now, we need to calculate the probabilities for the unigrams (how often we find each word in the original text) and for the skipgrams (how often we find each word next to every other word within the context window).

unigram_probs <- opinion_non_stop %>%
  unnest_tokens(word, comment) %>%
  count(word, sort = TRUE) %>%
  mutate(p = n / sum(n))
unigram_probs
## # A tibble: 526 x 3
##    word           n      p
##    <chr>      <int>  <dbl>
##  1 company       32 0.0366
##  2 people        18 0.0206
##  3 employees     17 0.0194
##  4 management    13 0.0149
##  5 it’s          11 0.0126
##  6 feel          10 0.0114
##  7 job           10 0.0114
##  8 i’m            9 0.0103
##  9 time           9 0.0103
## 10 care           7 0.008 
## # … with 516 more rows
skipgram_probs <- tidy_skipgrams %>%
  pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
  mutate(p = n / sum(n))
skipgram_probs
## # A tibble: 3,903 x 4
##    item1      item2          n       p
##    <chr>      <chr>      <dbl>   <dbl>
##  1 company    company       49 0.00522
##  2 employees  employees     31 0.00330
##  3 people     people        30 0.00320
##  4 time       time          30 0.00320
##  5 management management    24 0.00256
##  6 team       team          21 0.00224
##  7 finding    finding       18 0.00192
##  8 leave      leave         14 0.00149
##  9 we’re      we’re         14 0.00149
## 10 i’m        i’m           13 0.00139
## # … with 3,893 more rows

Finally, we normalize the skipgram probabilities based on the unigram probabilities.

normalized_prob <- skipgram_probs %>%
  filter(n > 3) %>%
  rename(word1 = item1, word2 = item2) %>%
  left_join(unigram_probs %>%
    select(word1 = word, p1 = p),
  by = "word1"
  ) %>%
  left_join(unigram_probs %>%
    select(word2 = word, p2 = p),
  by = "word2"
  ) %>%
  mutate(p_together = p / p1 / p2)
normalized_prob
## # A tibble: 727 x 7
##    word1      word2          n       p      p1      p2 p_together
##    <chr>      <chr>      <dbl>   <dbl>   <dbl>   <dbl>      <dbl>
##  1 company    company       49 0.00522 0.0366  0.0366        3.90
##  2 employees  employees     31 0.00330 0.0194  0.0194        8.75
##  3 people     people        30 0.00320 0.0206  0.0206        7.56
##  4 time       time          30 0.00320 0.0103  0.0103       30.2 
##  5 management management    24 0.00256 0.0149  0.0149       11.6 
##  6 team       team          21 0.00224 0.00571 0.00571      68.5 
##  7 finding    finding       18 0.00192 0.00571 0.00571      58.7 
##  8 leave      leave         14 0.00149 0.00686 0.00686      31.7 
##  9 we’re      we’re         14 0.00149 0.00686 0.00686      31.7 
## 10 i’m        i’m           13 0.00139 0.0103  0.0103       13.1 
## # … with 717 more rows

We can identify the words that appear more frequently together or the words that appear more regularly with one specific word (in this example, company).

normalized_prob %>%
  filter(word1 != word2) %>%
  arrange(-p_together)
## # A tibble: 566 x 7
##    word1      word2          n        p      p1      p2 p_together
##    <chr>      <chr>      <dbl>    <dbl>   <dbl>   <dbl>      <dbl>
##  1 sad        incredibly     5 0.000533 0.00114 0.00114       408.
##  2 incredibly sad            5 0.000533 0.00114 0.00114       408.
##  3 interviews sad            5 0.000533 0.00114 0.00114       408.
##  4 sad        interviews     5 0.000533 0.00114 0.00114       408.
##  5 assistance offer          5 0.000533 0.00114 0.00114       408.
##  6 offer      assistance     5 0.000533 0.00114 0.00114       408.
##  7 helping    assistance     5 0.000533 0.00114 0.00114       408.
##  8 assistance helping        5 0.000533 0.00114 0.00114       408.
##  9 greener    grass          5 0.000533 0.00114 0.00114       408.
## 10 grass      greener        5 0.000533 0.00114 0.00114       408.
## # … with 556 more rows
normalized_prob %>%
  filter(word1 == "company") %>%
  arrange(-p_together)
## # A tibble: 24 x 7
##    word1   word2            n        p     p1      p2 p_together
##    <chr>   <chr>        <dbl>    <dbl>  <dbl>   <dbl>      <dbl>
##  1 company appreciates      5 0.000533 0.0366 0.00114      12.7 
##  2 company strive           5 0.000533 0.0366 0.00114      12.7 
##  3 company goals            5 0.000533 0.0366 0.00114      12.7 
##  4 company 10               4 0.000426 0.0366 0.00114      10.2 
##  5 company professional     4 0.000426 0.0366 0.00114      10.2 
##  6 company accord           4 0.000426 0.0366 0.00114      10.2 
##  7 company motivates        4 0.000426 0.0366 0.00114      10.2 
##  8 company tenure           6 0.000639 0.0366 0.00229       7.65
##  9 company culture          5 0.000533 0.0366 0.00229       6.37
## 10 company told             4 0.000426 0.0366 0.00229       5.10
## # … with 14 more rows

The best option to work with this information is to transform the data.frame to a sparse matrix and reduce its dimensionality. For reducing the number of dimensions, we have decided to carry out a singular value decomposition (SVD), which provides a way to factorize a matrix into singular vectors and singular values. We set 50 as the number of dimensions to reduce.

pmi_matrix <- normalized_prob %>%
  mutate(pmi = log10(p_together)) %>%
  cast_sparse(word1, word2, pmi)

pmi_matrix@x[is.na(pmi_matrix@x)] <- 0

pmi_svd <- irlba(pmi_matrix, 50, maxit = 500)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)

Now, we can build a function (called search_proximity) that allow us to find close words based on the word_vectors that we have just created.

search_proximity <- function(word_vectors, selected_vector) {
  product <- word_vectors %*% selected_vector

  similarities <- product %>%
    tidy() %>%
    as_tibble() %>%
    mutate(token = rownames(product)) %>%
    rename(similarity = x) %>%
    arrange(-similarity)
}

Let’s see two examples. For instance, we can see which are the most common words that appear together with company and employees.

company_together <- search_proximity(word_vectors, word_vectors["company", ])
company_together
## # A tibble: 161 x 2
##    similarity[,1] token       
##             <dbl> <chr>       
##  1         0.217  company     
##  2         0.141  accord      
##  3         0.129  strive      
##  4         0.111  goals       
##  5         0.110  praise      
##  6         0.102  tenure      
##  7         0.0998 attachment  
##  8         0.0993 professional
##  9         0.0950 motivates   
## 10         0.0907 appreciates 
## # … with 151 more rows
employee_together <- search_proximity(word_vectors, word_vectors["employees", ])
employee_together
## # A tibble: 161 x 2
##    similarity[,1] token      
##             <dbl> <chr>      
##  1         0.243  employees  
##  2         0.181  safe       
##  3         0.168  noticing   
##  4         0.168  recognizing
##  5         0.153  customers  
##  6         0.145  lay        
##  7         0.127  valuing    
##  8         0.107  cares      
##  9         0.0781 business   
## 10         0.0598 talk       
## # … with 151 more rows

Finally, we can visualize these results.

company_together %>%
  mutate(selected = "company") %>%
  bind_rows(employee_together %>%
    mutate(selected = "employee")) %>%
  group_by(selected) %>%
  top_n(15, similarity) %>%
  ungroup() %>%
  mutate(token = reorder(token, similarity)) %>%
  ggplot(aes(token, similarity, fill = selected)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~selected, scales = "free") +
  coord_flip() +
  scale_y_continuous(expand = c(0, 0)) +
  labs(
    x = NULL, title = "What word vectors are most similar to Company and Employees?",
    subtitle = "Calculated using counts and matrix factorization"
  )

Using the word_vectors, we can apply some mathematical functions, such as \(+\) and \(-\). The typical example is: \(King - Man + Woman = Queen\). Let’s see an example with our data set.

mystery_product <- word_vectors["company", ] - word_vectors["motivates", ]
new_words <- search_proximity(word_vectors, mystery_product)
new_words
## # A tibble: 161 x 2
##    similarity[,1] token       
##             <dbl> <chr>       
##  1         0.284  means       
##  2         0.192  goals       
##  3         0.151  accord      
##  4         0.122  company     
##  5         0.121  praise      
##  6         0.114  tenure      
##  7         0.108  attachment  
##  8         0.106  professional
##  9         0.0964 successes   
## 10         0.0957 10          
## # … with 151 more rows

5 Predicting model based on Word Embedding Models

After introducing some concepts, let’s try to build a model to predict the kind of opinion based on the employee comments. As we want to predict the value of a categorical variable (positive and negative), we will use a simple logistic regression model.

We will follow different approaches to build, train and test our logistic regression model:

  • A model based on a ‘context window’ with a length of 4 words
  • A model based on a ‘context window’ with a length of 6 words

5.1 Predicting model with a ‘context window’ with a length of 4 words

The first step is to create our word vector with a ‘context window’ of four words, as we have seen before.

set.seed(10)

tidy_skipgrams <- opinion_non_stop %>%
  unnest_tokens(ngram, comment, token = "ngrams", n = 4) %>%
  filter(ngram != "") %>%
  mutate(ngramID = row_number()) %>%
  tidyr::unite(skipgramID, commentID, ngramID) %>%
  unnest_tokens(word, ngram)

unigram_probs <- opinion_non_stop %>%
  unnest_tokens(word, comment) %>%
  count(word, sort = TRUE) %>%
  mutate(p = n / sum(n))

skipgram_probs <- tidy_skipgrams %>%
  pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
  mutate(p = n / sum(n))

normalized_prob <- skipgram_probs %>%
  filter(n > 3) %>%
  rename(word1 = item1, word2 = item2) %>%
  left_join(unigram_probs %>%
    select(word1 = word, p1 = p),
  by = "word1"
  ) %>%
  left_join(unigram_probs %>%
    select(word2 = word, p2 = p),
  by = "word2"
  ) %>%
  mutate(p_together = p / p1 / p2)

pmi_matrix <- normalized_prob %>%
  mutate(pmi = log10(p_together)) %>%
  cast_sparse(word1, word2, pmi)

pmi_matrix@x[is.na(pmi_matrix@x)] <- 0

pmi_svd <- irlba(pmi_matrix, 50, maxit = 500)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)

my_embeddings_vector <- tibble(
  tokens = row.names(word_vectors),
  word_vectors %>% as_tibble()
)

To compare all our analyses’ results, we have decided to set the seed to 10210 (this number is random).

set.seed(10210)

Before building our model, we need to split up our data set into a training data set and a testing data set. In this case, we have decided that the training data set contains the 75% of the original data set and to stratify the samples based on whether the opinion is positive or negative.

opinion_split <- initial_split(opinions, strata = "assessment", p = 0.75)
train_data <- opinion_split %>% training()
test_data <- opinion_split %>% testing()

There are several ways to preprocess our data set (e.g., missing values imputation, removing predictors, centering, and scaling) before the analysis. In this case, we will create a recipe that allows us to handle all the data preprocessing.

In the recipe, we need to indicate the following parameters:

  • The \(formula\), where \(assessment\) is the dependent variable and \(comment\) is the independent variable
  • The \(data\), where we have decided to use the training data set to build the model
data_rec <- recipe(formula = assessment ~ comment, data = train_data) %>%
  step_tokenize(comment) %>%
  step_stopwords(comment, keep = FALSE) %>%
  step_word_embeddings(comment, embeddings = my_embeddings_vector) %>%
  prep(training = train_data)
data_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Training data contained 113 data points and no missing data.
## 
## Operations:
## 
## Tokenization for comment [trained]
## Stop word removal for comment [trained]
## Word embeddings aggregated from comment [trained]

After the initial definition of the recipe, we can add new processes:

  • To tokenize the data set as we have seen before. As we haven’t indicated anything, the function will tokenize the data set by words (1-gram).
  • To remove the stop words.
  • To evaluate the words (tokens) by our world embedded model.
  • To prepare our data set based on the previous steps.

Now, we need to carry out (bake) our recipe with our data sets.

train_baked <- data_rec %>% bake(new_data = train_data)
test_baked <- data_rec %>% bake(new_data = test_data)

With the baked recipe, we can start working in the predictive model. As we have seen before, we have decided to use a logistics regression model. To create the predictive model structure, we need to define two elements:

  • The type of model and the mode - As we explained before, we will use a logistic regression for classification
  • The computational engine - There are several options, but here we have decided to use glmnet
glm_model <- logistic_reg(mode = "classification", mixture = 0, penalty = 0.1) %>%
  set_engine("glmnet")
glm_model
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.1
##   mixture = 0
## 
## Computational engine: glmnet

Now, it’s time to fit the predictive model structure to our training data set, so we need two define two elements:

  • The independent and dependent variables of the model - The independent variable is \(assessment\), and the dependent variable is \(comment\)
  • The data set to fit - As we are building the model, we will use the training data set
final_model <- glm_model %>%
  fit(assessment ~ ., data = train_baked)

The final step is to assess the performance of the model. The most straightforward way is to show the actual and predictive values of the testing data set.

predictions_glm <- final_model %>%
  predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% select(assessment))

kable(head(predictions_glm),
  col.names = c("Predictive Values", "Actual Values"),
  table.attr = "style='width:40%;'",
  caption = "Comparison between actual and predicted values"
) %>%
  kable_styling(position = "center")
Comparison between actual and predicted values
Predictive Values Actual Values
positive positive
negative positive
positive positive
positive positive
positive positive
positive positive

We can see that not all predictive values fit the actual values. Another better way to show the results is the Confusion Matrix, where we can see the number of false positives, false negatives, true positives, and true negatives.

predictions_glm %>%
  conf_mat(assessment, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n, fill = c("1", "2", "3", "4"))) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)

Another way to evaluate the predictive model is to assess the accuracy. In other words, the fraction of predictions the model got right.

predictions_glm %>%
  metrics(assessment, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy")
## # A tibble: 1 x 2
##   .metric  .estimate
##   <chr>        <dbl>
## 1 accuracy     0.667

As we can see, the results is not very good (66.7%).

We can also assess the predictive model with the ROC curve:

test_baked %>%
  select(assessment) %>%
  mutate(
    my_class = parsnip:::predict_class(final_model, test_baked),
    my_prop = parsnip:::predict_classprob(final_model, test_baked) %>% pull(`negative`)
  ) %>%
  roc_curve(assessment, my_prop) %>%
  autoplot()

5.2 Predicting model with a ‘context window’ with a length of 6 words

Now, we are going to evaluate what happens when we use a ‘context window’ of six words. The first step is to create a new word vector.

set.seed(10)

tidy_skipgrams <- opinion_non_stop %>%
  unnest_tokens(ngram, comment, token = "ngrams", n = 6) %>%
  filter(ngram != "") %>%
  mutate(ngramID = row_number()) %>%
  tidyr::unite(skipgramID, commentID, ngramID) %>%
  unnest_tokens(word, ngram)

unigram_probs <- opinion_non_stop %>%
  unnest_tokens(word, comment) %>%
  count(word, sort = TRUE) %>%
  mutate(p = n / sum(n))

skipgram_probs <- tidy_skipgrams %>%
  pairwise_count(word, skipgramID, diag = TRUE, sort = TRUE) %>%
  mutate(p = n / sum(n))

normalized_prob <- skipgram_probs %>%
  filter(n > 3) %>%
  rename(word1 = item1, word2 = item2) %>%
  left_join(unigram_probs %>%
    select(word1 = word, p1 = p),
  by = "word1"
  ) %>%
  left_join(unigram_probs %>%
    select(word2 = word, p2 = p),
  by = "word2"
  ) %>%
  mutate(p_together = p / p1 / p2)

pmi_matrix <- normalized_prob %>%
  mutate(pmi = log10(p_together)) %>%
  cast_sparse(word1, word2, pmi)

pmi_matrix@x[is.na(pmi_matrix@x)] <- 0

pmi_svd <- irlba(pmi_matrix, 25, maxit = 500)
word_vectors <- pmi_svd$u
rownames(word_vectors) <- rownames(pmi_matrix)

my_embeddings_vector <- tibble(
  tokens = row.names(word_vectors),
  word_vectors %>% as_tibble()
)

To start, we define the same seed again.

set.seed(10210)

We split up the data set again.

opinion_split <- initial_split(opinions, strata = "assessment", p = 0.75)
train_data <- opinion_split %>% training()
test_data <- opinion_split %>% testing()

The pre-process of our data set is the same as we have seen before, but now we are going to use the new word vector.

data_rec <- recipe(formula = assessment ~ comment, data = train_data) %>%
  step_tokenize(comment) %>%
  step_stopwords(comment, keep = FALSE) %>%
  step_word_embeddings(comment, embeddings = my_embeddings_vector) %>%
  prep(training = train_data)
data_rec
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Training data contained 113 data points and no missing data.
## 
## Operations:
## 
## Tokenization for comment [trained]
## Stop word removal for comment [trained]
## Word embeddings aggregated from comment [trained]

Now, we need to carry out (bake) our recipe with our data sets.

train_baked <- data_rec %>% bake(new_data = train_data)
test_baked <- data_rec %>% bake(new_data = test_data)

Now, it’s time to define the model structure, to fit the predictive model to our training data, and predict the values of the testing data set.

glm_model <- logistic_reg(mode = "classification", mixture = 0, penalty = 0.1) %>%
  set_engine("glmnet")
glm_model
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.1
##   mixture = 0
## 
## Computational engine: glmnet
final_model <- glm_model %>%
  fit(assessment ~ ., data = train_baked)

predictions_glm <- final_model %>%
  predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% select(assessment))

The final step is to assess the performance of the model. Let’s see the results is a Confusion Matrix.

predictions_glm %>%
  conf_mat(assessment, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  ggplot(aes(Prediction, Truth, alpha = n, fill = c("1", "2", "3", "4"))) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = n), colour = "white", alpha = 1, size = 8)

If we prefer to assess the model by the accuracy, the result is 58.3%. It’s not a good predictor.

predictions_glm %>%
  metrics(assessment, .pred_class) %>%
  select(-.estimator) %>%
  filter(.metric == "accuracy")
## # A tibble: 1 x 2
##   .metric  .estimate
##   <chr>        <dbl>
## 1 accuracy     0.583

Finally, we can assess the predictive model with the ROC curve:

test_baked %>%
  select(assessment) %>%
  mutate(
    my_class = parsnip:::predict_class(final_model, test_baked),
    my_prop = parsnip:::predict_classprob(final_model, test_baked) %>% pull(`negative`)
  ) %>%
  roc_curve(assessment, my_prop) %>%
  autoplot()

6 Conclusions

This document has built two simple models to predict how employees feel about their organizations from their comments. We decided to use a logistic regression model and word embedding vector with two different ‘context windows’. None of them has given us an excellent result. The first one could be interesting, but not the second one. We need to remember that word embedding models are fine to predict concurrence words, not for predicting other issues.

7 References

Mikolov, T; Chen, K.; Corrado, G.; Deam. J. (2013). Efficient Estimation of Word Representations in Vector Space. Retrieved from https://arxiv.org/pdf/1301.3781.pdf

Ball, C. (2018) Word Embeddings. Retrieved from https://cbail.github.io/textasdata/word2vec/rmarkdown/word2vec.html

Silge, J. (2017). Word Vectors with tidy data principles. Retrieved from https://juliasilge.com/blog/tidy-word-vectors/