Build a model for Animal crossing user reviews TidyTuesday 2020-05-05 for predict the rating from the text of the user reviews.

Get the data and initial exploration.

user_reviews <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/user_reviews.tsv')

head(user_reviews, 15)
## # A tibble: 15 x 4
##    grade user_name     text                                           date      
##    <dbl> <chr>         <chr>                                          <date>    
##  1     4 mds27272      My gf started playing before me. No option to~ 2020-03-20
##  2     5 lolo2178      While the game itself is great, really relaxi~ 2020-03-20
##  3     0 Roachant      My wife and I were looking forward to playing~ 2020-03-20
##  4     0 Houndf        We need equal values and opportunities for al~ 2020-03-20
##  5     0 ProfessorFox  BEWARE!  If you have multiple people in your ~ 2020-03-20
##  6     0 tb726         The limitation of one island per Switch (not ~ 2020-03-20
##  7     0 Outryder86    I was very excited for this new installment o~ 2020-03-20
##  8     0 Subby89       It's 2020 and for some reason Nintendo has de~ 2020-03-20
##  9     0 RocketRon     This is so annoying. Only one player has the ~ 2020-03-20
## 10     0 chankills     I purchased this game for my household (me an~ 2020-03-20
## 11     1 rafilks182    they heavily marketed this game as a social e~ 2020-03-20
## 12     0 Gravy_Dreamb~ You can only have one island per Switch. If m~ 2020-03-20
## 13     0 Lazariel      Only ONE island per console!You can't create ~ 2020-03-20
## 14     0 novaaaa       You can only create one island per switch and~ 2020-03-20
## 15     0 sooners11     One island per console is a pretty terrible a~ 2020-03-20
user_reviews %>%
  count(grade) %>%
  ggplot(aes(grade, n)) +
  geom_col(fill = "midnightblue", alpha = 0.7)+
  labs(x="", y="",
       title="Number of reviews by rating")

We can see that lots of people give extreme scores in their rewiews, this distribution is nor suitable to make a prediction model so its a good idea to convert this scores to a label and build a model of binary clasification good vs. bad user reviews.

user_reviews %>%
  filter(grade > 8) %>%
  sample_n(5) %>%
  pull(text)
## [1] "Un juego fantástico, con muchas opciones, con el cual puedes relajarte y jugarlos con tu familia"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "I dont think that one switch for one island is bad or good. I played the original games and hated them because it was boring to me. Now I get to go on and drop gifts off for my boyfriend in front of his house, hop over to my roommates island and hang out with either one and and drop gifts for them too. I think it's fun to see what's different when I wake up and discussing what to do orI dont think that one switch for one island is bad or good. I played the original games and hated them because it was boring to me. Now I get to go on and drop gifts off for my boyfriend in front of his house, hop over to my roommates island and hang out with either one and and drop gifts for them too. I think it's fun to see what's different when I wake up and discussing what to do or where to out things. It makes it more of a community game.… Expand"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [3] "Best game that I have EVER played! So much thought and care went into turning this into a genuinely immersive experience. Personally, the 1 island per switch problem didn’t affect me as I have my own personal switch and my brother who also plays has HIS own. The characters are  way more lifelike than in Animal Crossing New Leaf and I feel like my island is truly mine. You can placeBest game that I have EVER played! So much thought and care went into turning this into a genuinely immersive experience. Personally, the 1 island per switch problem didn’t affect me as I have my own personal switch and my brother who also plays has HIS own. The characters are  way more lifelike than in Animal Crossing New Leaf and I feel like my island is truly mine. You can place furniture inside OR outside as well as building shops, bridges, ramps and even a giant robot! With the shops you can find carefully designed furniture and clothing! Most of the furniture you buy can even be customized into multiple styles, truly making it feel like yours! You can also craft your own furniture such as chairs, tables and beds! It makes it feel like you have a part in your island life. With the clothing  store you can buy hats, shirts, pants and more! Unfortunately this game has been review BOMBED by people who in the end won’t make change! This is the best Switch game and I highly recommend. This game is a 10/10, not 5.7.… Expand"
## [4] "Best game ever if you don't have to share your island with anyone. I feel it was designed for people between 25 and 30 years old.Don't miss this game just because the low ratings due to the one island per console issue, if you live alone you won't regret!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [5] "Excellent game, this title is a very good example of real entertainment for gamers, thanks Nintendo."

The reviews have some text problems, Let’s remove at least the final “Expand” from the reviews, and create a new categorical rating variable.

reviews_parsed <- user_reviews %>%
  mutate(text = str_remove(text, "Expand$")) %>% #remove the word Expand
  mutate(rating = case_when(
    grade > 7 ~ "good", #make a new variable rating, treshold in 7
    TRUE ~ "bad"
  ))

Now let’s prepare the reviews to make the model, for this we split the text column into tokens by user name.

library(tidytext)

words_per_review <- reviews_parsed %>%
  unnest_tokens(word, text) %>%
  count(user_name, name = "total_words")

words_per_review %>%
  ggplot(aes(total_words)) +
  geom_histogram(fill = "midnightblue", alpha = 0.8)+
  labs(y="", x= "Total words",
       title= "Histogram of Number of words")

This distribution of words is not natural, the gap in the midle of the distribution looks very strange and maybe is related to the data adquisition procedure. But the data is never perfect…


Let’s build a model

library(tidymodels)

set.seed(123)
review_split <- initial_split(reviews_parsed, strata = rating)
review_train <- training(review_split)
review_test <- testing(review_split)
library(textrecipes)

review_rec <- recipe(rating ~ text, data = review_train) %>% #make the recipe
  step_tokenize(text) %>% #tokenixe text 
  step_stopwords(text) %>% #remove stopwords
  step_tokenfilter(text, max_tokens = 500) %>% #filter tokens by frecuency, keep the top 500 most-used tokens
  step_tfidf(text) %>% #create a Inverted document frecuency. common or rare the word is across all the observations
  step_normalize(all_predictors()) #Center and scale numeric data

review_prep <- prep(review_rec) #The prep() function is where everything gets evaluated

review_prep
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Training data contained 2250 data points and no missing data.
## 
## Operations:
## 
## Tokenization for text [trained]
## Stop word removal for text [trained]
## Text filtering for text [trained]
## Term frequency-inverse document frequency with text [trained]
## Centering and scaling for tfidf_text_0, tfidf_text_1, ... [trained]
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

lasso_wf <- workflow() %>% #use the workflow to manage modeling pipelines more easily
  add_recipe(review_rec) %>%
  add_model(lasso_spec)

lasso_wf
## == Workflow ==================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------
## 5 Recipe Steps
## 
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
## 
## -- Model -----------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

First, we need a set of possible regularization parameters to try.

lambda_grid <- grid_regular(penalty(), levels = 40) 

Next, we need a set of resampled data to fit and evaluate all these models.

set.seed(123)
review_folds <- bootstraps(review_train, strata = rating)
review_folds
## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits             id         
##    <list>             <chr>      
##  1 <split [2.2K/812]> Bootstrap01
##  2 <split [2.2K/850]> Bootstrap02
##  3 <split [2.2K/814]> Bootstrap03
##  4 <split [2.2K/814]> Bootstrap04
##  5 <split [2.2K/853]> Bootstrap05
##  6 <split [2.2K/840]> Bootstrap06
##  7 <split [2.2K/816]> Bootstrap07
##  8 <split [2.2K/826]> Bootstrap08
##  9 <split [2.2K/804]> Bootstrap09
## 10 <split [2.2K/809]> Bootstrap10
## # ... with 15 more rows

Implement the tunning

set.seed(2020)
lasso_grid <- tune_grid(lasso_wf,
                        resamples = review_folds,
                        grid = lambda_grid,
                        metrics = metric_set(roc_auc, ppv, npv) #set the metrics to compute
)

Once we have our tuning results, we can examine them in detail.

lasso_grid %>%
  collect_metrics() %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")%>%
  scroll_box(height = "200px")
penalty .metric .estimator mean n std_err .config
0.0000000 npv binary 0.7404002 25 0.0051755 Model01
0.0000000 ppv binary 0.8635544 25 0.0030250 Model01
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model01
0.0000000 npv binary 0.7404002 25 0.0051755 Model02
0.0000000 ppv binary 0.8635544 25 0.0030250 Model02
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model02
0.0000000 npv binary 0.7404002 25 0.0051755 Model03
0.0000000 ppv binary 0.8635544 25 0.0030250 Model03
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model03
0.0000000 npv binary 0.7404002 25 0.0051755 Model04
0.0000000 ppv binary 0.8635544 25 0.0030250 Model04
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model04
0.0000000 npv binary 0.7404002 25 0.0051755 Model05
0.0000000 ppv binary 0.8635544 25 0.0030250 Model05
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model05
0.0000000 npv binary 0.7404002 25 0.0051755 Model06
0.0000000 ppv binary 0.8635544 25 0.0030250 Model06
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model06
0.0000000 npv binary 0.7404002 25 0.0051755 Model07
0.0000000 ppv binary 0.8635544 25 0.0030250 Model07
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model07
0.0000000 npv binary 0.7404002 25 0.0051755 Model08
0.0000000 ppv binary 0.8635544 25 0.0030250 Model08
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model08
0.0000000 npv binary 0.7404002 25 0.0051755 Model09
0.0000000 ppv binary 0.8635544 25 0.0030250 Model09
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model09
0.0000000 npv binary 0.7404002 25 0.0051755 Model10
0.0000000 ppv binary 0.8635544 25 0.0030250 Model10
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model10
0.0000000 npv binary 0.7404002 25 0.0051755 Model11
0.0000000 ppv binary 0.8635544 25 0.0030250 Model11
0.0000000 roc_auc binary 0.8784799 25 0.0027573 Model11
0.0000001 npv binary 0.7404002 25 0.0051755 Model12
0.0000001 ppv binary 0.8635544 25 0.0030250 Model12
0.0000001 roc_auc binary 0.8784799 25 0.0027573 Model12
0.0000001 npv binary 0.7404002 25 0.0051755 Model13
0.0000001 ppv binary 0.8635544 25 0.0030250 Model13
0.0000001 roc_auc binary 0.8784799 25 0.0027573 Model13
0.0000002 npv binary 0.7404002 25 0.0051755 Model14
0.0000002 ppv binary 0.8635544 25 0.0030250 Model14
0.0000002 roc_auc binary 0.8784799 25 0.0027573 Model14
0.0000004 npv binary 0.7404002 25 0.0051755 Model15
0.0000004 ppv binary 0.8635544 25 0.0030250 Model15
0.0000004 roc_auc binary 0.8784799 25 0.0027573 Model15
0.0000007 npv binary 0.7404002 25 0.0051755 Model16
0.0000007 ppv binary 0.8635544 25 0.0030250 Model16
0.0000007 roc_auc binary 0.8784799 25 0.0027573 Model16
0.0000013 npv binary 0.7404002 25 0.0051755 Model17
0.0000013 ppv binary 0.8635544 25 0.0030250 Model17
0.0000013 roc_auc binary 0.8784799 25 0.0027573 Model17
0.0000023 npv binary 0.7404002 25 0.0051755 Model18
0.0000023 ppv binary 0.8635544 25 0.0030250 Model18
0.0000023 roc_auc binary 0.8784799 25 0.0027573 Model18
0.0000041 npv binary 0.7404002 25 0.0051755 Model19
0.0000041 ppv binary 0.8635544 25 0.0030250 Model19
0.0000041 roc_auc binary 0.8784799 25 0.0027573 Model19
0.0000074 npv binary 0.7404002 25 0.0051755 Model20
0.0000074 ppv binary 0.8635544 25 0.0030250 Model20
0.0000074 roc_auc binary 0.8784799 25 0.0027573 Model20
0.0000134 npv binary 0.7403643 25 0.0051784 Model21
0.0000134 ppv binary 0.8634938 25 0.0030462 Model21
0.0000134 roc_auc binary 0.8785062 25 0.0027564 Model21
0.0000242 npv binary 0.7421082 25 0.0051740 Model22
0.0000242 ppv binary 0.8644417 25 0.0029595 Model22
0.0000242 roc_auc binary 0.8807108 25 0.0025899 Model22
0.0000438 npv binary 0.7441448 25 0.0049054 Model23
0.0000438 ppv binary 0.8648689 25 0.0029972 Model23
0.0000438 roc_auc binary 0.8831642 25 0.0025112 Model23
0.0000790 npv binary 0.7496878 25 0.0049223 Model24
0.0000790 ppv binary 0.8666593 25 0.0027715 Model24
0.0000790 roc_auc binary 0.8857648 25 0.0023844 Model24
0.0001425 npv binary 0.7515045 25 0.0048764 Model25
0.0001425 ppv binary 0.8675778 25 0.0029213 Model25
0.0001425 roc_auc binary 0.8885057 25 0.0022577 Model25
0.0002572 npv binary 0.7585626 25 0.0045795 Model26
0.0002572 ppv binary 0.8692375 25 0.0027288 Model26
0.0002572 roc_auc binary 0.8918078 25 0.0021223 Model26
0.0004642 npv binary 0.7683244 25 0.0042001 Model27
0.0004642 ppv binary 0.8719790 25 0.0025633 Model27
0.0004642 roc_auc binary 0.8964992 25 0.0019420 Model27
0.0008377 npv binary 0.7808043 25 0.0041640 Model28
0.0008377 ppv binary 0.8770188 25 0.0026343 Model28
0.0008377 roc_auc binary 0.9025998 25 0.0017409 Model28
0.0015118 npv binary 0.7925645 25 0.0041186 Model29
0.0015118 ppv binary 0.8803496 25 0.0022640 Model29
0.0015118 roc_auc binary 0.9097560 25 0.0016924 Model29
0.0027283 npv binary 0.8083005 25 0.0039963 Model30
0.0027283 ppv binary 0.8820145 25 0.0022477 Model30
0.0027283 roc_auc binary 0.9178767 25 0.0015415 Model30
0.0049239 npv binary 0.8252221 25 0.0040321 Model31
0.0049239 ppv binary 0.8817243 25 0.0024599 Model31
0.0049239 roc_auc binary 0.9256296 25 0.0013162 Model31
0.0088862 npv binary 0.8358877 25 0.0032728 Model32
0.0088862 ppv binary 0.8756299 25 0.0025199 Model32
0.0088862 roc_auc binary 0.9299922 25 0.0011654 Model32
0.0160372 npv binary 0.8488131 25 0.0034799 Model33
0.0160372 ppv binary 0.8560464 25 0.0025205 Model33
0.0160372 roc_auc binary 0.9279997 25 0.0011384 Model33
0.0289427 npv binary 0.8641609 25 0.0047379 Model34
0.0289427 ppv binary 0.8110164 25 0.0025144 Model34
0.0289427 roc_auc binary 0.9140430 25 0.0012484 Model34
0.0522335 npv binary 0.8851403 25 0.0043334 Model35
0.0522335 ppv binary 0.7353761 25 0.0014981 Model35
0.0522335 roc_auc binary 0.8756876 25 0.0021750 Model35
0.0942668 npv binary 0.9201675 25 0.0107314 Model36
0.0942668 ppv binary 0.6584426 25 0.0020148 Model36
0.0942668 roc_auc binary 0.7758855 25 0.0032418 Model36
0.1701254 npv binary NaN 0 NA Model37
0.1701254 ppv binary 0.6373748 25 0.0013901 Model37
0.1701254 roc_auc binary 0.5000000 25 0.0000000 Model37
0.3070291 npv binary NaN 0 NA Model38
0.3070291 ppv binary 0.6373748 25 0.0013901 Model38
0.3070291 roc_auc binary 0.5000000 25 0.0000000 Model38
0.5541020 npv binary NaN 0 NA Model39
0.5541020 ppv binary 0.6373748 25 0.0013901 Model39
0.5541020 roc_auc binary 0.5000000 25 0.0000000 Model39
1.0000000 npv binary NaN 0 NA Model40
1.0000000 ppv binary 0.6373748 25 0.0013901 Model40
1.0000000 roc_auc binary 0.5000000 25 0.0000000 Model40

But i’m a visual person so…

lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_line(size = 1.5, show.legend = FALSE) +
  facet_wrap(~.metric) +
  scale_x_log10()+
  theme_minimal()
## Warning: Removed 4 row(s) containing missing values (geom_path).

This shows us a lot. We see clearly that AUC and PPV have benefited from the regularization and we could identify the best value of penalty for each of those metrics. The same is not true for NPV.


Final model

Let’s keep our model as is for now, and choose a final model based on AUC. We can use select_best() to find the best AUC and then update our workflow lasso_wf with this value.

best_auc <- lasso_grid %>%
  select_best("roc_auc")

best_auc %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
penalty .config
0.0088862 Model32
final_lasso <- finalize_workflow(lasso_wf, best_auc) #functions take a list or tibble of tuning parameter values and update objects with those values

final_lasso
## == Workflow ==================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------
## 5 Recipe Steps
## 
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
## 
## -- Model -----------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.00888623816274339
##   mixture = 1
## 
## Computational engine: glmnet

This is the tuned workflow.

To find the most important variables we can use the vip package

library(vip)

final_lasso %>% 
  fit(review_train) %>%
  pull_workflow_fit() %>%
  vi(lambda = best_auc$penalty) %>%
  group_by(Sign) %>%
  top_n(20, wt = abs(Importance)) %>%
  ungroup() %>%
  mutate(
    Importance = abs(Importance),
    Variable = str_remove(Variable, "tfidf_text_"),
    Variable = fct_reorder(Variable, Importance)
  ) %>%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Sign, scales = "free_y") +
  labs(y = NULL)+
  theme_minimal()


Model test.

Finally, let’s return to our test data. The tune package has a function last_fit() which is nice for situations when you have tuned and finalized a model or workflow and want to fit it one last time on your training data and evaluate it on your testing data. You only have to pass this function your finalized model/workflow and your split.

review_final <- last_fit(final_lasso, review_split)

review_final %>%
  collect_metrics()%>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")
.metric .estimator .estimate
accuracy binary 0.8918558
roc_auc binary 0.9414445

No overfit during our tuning process, and the overall accuracy is not bad. Let’s create a confusion matrix for the testing data.

review_final %>%
  collect_predictions() %>%
  conf_mat(rating, .pred_class)
##           Truth
## Prediction bad good
##       bad  449   55
##       good  26  219

Although our overall accuracy isn’t so bad, we find that it is easier to detect the negative reviews than the positive ones.