Build a model for Animal crossing user reviews TidyTuesday 2020-05-05 for predict the rating from the text of the user reviews.
user_reviews <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/user_reviews.tsv')
head(user_reviews, 15)
## # A tibble: 15 x 4
## grade user_name text date
## <dbl> <chr> <chr> <date>
## 1 4 mds27272 My gf started playing before me. No option to~ 2020-03-20
## 2 5 lolo2178 While the game itself is great, really relaxi~ 2020-03-20
## 3 0 Roachant My wife and I were looking forward to playing~ 2020-03-20
## 4 0 Houndf We need equal values and opportunities for al~ 2020-03-20
## 5 0 ProfessorFox BEWARE! If you have multiple people in your ~ 2020-03-20
## 6 0 tb726 The limitation of one island per Switch (not ~ 2020-03-20
## 7 0 Outryder86 I was very excited for this new installment o~ 2020-03-20
## 8 0 Subby89 It's 2020 and for some reason Nintendo has de~ 2020-03-20
## 9 0 RocketRon This is so annoying. Only one player has the ~ 2020-03-20
## 10 0 chankills I purchased this game for my household (me an~ 2020-03-20
## 11 1 rafilks182 they heavily marketed this game as a social e~ 2020-03-20
## 12 0 Gravy_Dreamb~ You can only have one island per Switch. If m~ 2020-03-20
## 13 0 Lazariel Only ONE island per console!You can't create ~ 2020-03-20
## 14 0 novaaaa You can only create one island per switch and~ 2020-03-20
## 15 0 sooners11 One island per console is a pretty terrible a~ 2020-03-20
user_reviews %>%
count(grade) %>%
ggplot(aes(grade, n)) +
geom_col(fill = "midnightblue", alpha = 0.7)+
labs(x="", y="",
title="Number of reviews by rating")
We can see that lots of people give extreme scores in their rewiews, this distribution is nor suitable to make a prediction model so its a good idea to convert this scores to a label and build a model of binary clasification good vs. bad user reviews.
user_reviews %>%
filter(grade > 8) %>%
sample_n(5) %>%
pull(text)
## [1] "Un juego fantástico, con muchas opciones, con el cual puedes relajarte y jugarlos con tu familia"
## [2] "I dont think that one switch for one island is bad or good. I played the original games and hated them because it was boring to me. Now I get to go on and drop gifts off for my boyfriend in front of his house, hop over to my roommates island and hang out with either one and and drop gifts for them too. I think it's fun to see what's different when I wake up and discussing what to do orI dont think that one switch for one island is bad or good. I played the original games and hated them because it was boring to me. Now I get to go on and drop gifts off for my boyfriend in front of his house, hop over to my roommates island and hang out with either one and and drop gifts for them too. I think it's fun to see what's different when I wake up and discussing what to do or where to out things. It makes it more of a community game.… Expand"
## [3] "Best game that I have EVER played! So much thought and care went into turning this into a genuinely immersive experience. Personally, the 1 island per switch problem didn’t affect me as I have my own personal switch and my brother who also plays has HIS own. The characters are way more lifelike than in Animal Crossing New Leaf and I feel like my island is truly mine. You can placeBest game that I have EVER played! So much thought and care went into turning this into a genuinely immersive experience. Personally, the 1 island per switch problem didn’t affect me as I have my own personal switch and my brother who also plays has HIS own. The characters are way more lifelike than in Animal Crossing New Leaf and I feel like my island is truly mine. You can place furniture inside OR outside as well as building shops, bridges, ramps and even a giant robot! With the shops you can find carefully designed furniture and clothing! Most of the furniture you buy can even be customized into multiple styles, truly making it feel like yours! You can also craft your own furniture such as chairs, tables and beds! It makes it feel like you have a part in your island life. With the clothing store you can buy hats, shirts, pants and more! Unfortunately this game has been review BOMBED by people who in the end won’t make change! This is the best Switch game and I highly recommend. This game is a 10/10, not 5.7.… Expand"
## [4] "Best game ever if you don't have to share your island with anyone. I feel it was designed for people between 25 and 30 years old.Don't miss this game just because the low ratings due to the one island per console issue, if you live alone you won't regret!"
## [5] "Excellent game, this title is a very good example of real entertainment for gamers, thanks Nintendo."
The reviews have some text problems, Let’s remove at least the final “Expand” from the reviews, and create a new categorical rating variable.
reviews_parsed <- user_reviews %>%
mutate(text = str_remove(text, "Expand$")) %>% #remove the word Expand
mutate(rating = case_when(
grade > 7 ~ "good", #make a new variable rating, treshold in 7
TRUE ~ "bad"
))
Now let’s prepare the reviews to make the model, for this we split the text column into tokens by user name.
library(tidytext)
words_per_review <- reviews_parsed %>%
unnest_tokens(word, text) %>%
count(user_name, name = "total_words")
words_per_review %>%
ggplot(aes(total_words)) +
geom_histogram(fill = "midnightblue", alpha = 0.8)+
labs(y="", x= "Total words",
title= "Histogram of Number of words")
This distribution of words is not natural, the gap in the midle of the distribution looks very strange and maybe is related to the data adquisition procedure. But the data is never perfect…
library(tidymodels)
set.seed(123)
review_split <- initial_split(reviews_parsed, strata = rating)
review_train <- training(review_split)
review_test <- testing(review_split)
library(textrecipes)
review_rec <- recipe(rating ~ text, data = review_train) %>% #make the recipe
step_tokenize(text) %>% #tokenixe text
step_stopwords(text) %>% #remove stopwords
step_tokenfilter(text, max_tokens = 500) %>% #filter tokens by frecuency, keep the top 500 most-used tokens
step_tfidf(text) %>% #create a Inverted document frecuency. common or rare the word is across all the observations
step_normalize(all_predictors()) #Center and scale numeric data
review_prep <- prep(review_rec) #The prep() function is where everything gets evaluated
review_prep
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Training data contained 2250 data points and no missing data.
##
## Operations:
##
## Tokenization for text [trained]
## Stop word removal for text [trained]
## Text filtering for text [trained]
## Term frequency-inverse document frequency with text [trained]
## Centering and scaling for tfidf_text_0, tfidf_text_1, ... [trained]
penalty = tune() since we don’t yet know the best value for the regularization parameter and mixture = 1 for lasso. The lasso has proved to be a good baseline for text modeling.lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
lasso_wf <- workflow() %>% #use the workflow to manage modeling pipelines more easily
add_recipe(review_rec) %>%
add_model(lasso_spec)
lasso_wf
## == Workflow ==================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------
## 5 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
##
## -- Model -----------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
First, we need a set of possible regularization parameters to try.
lambda_grid <- grid_regular(penalty(), levels = 40)
Next, we need a set of resampled data to fit and evaluate all these models.
set.seed(123)
review_folds <- bootstraps(review_train, strata = rating)
review_folds
## # Bootstrap sampling using stratification
## # A tibble: 25 x 2
## splits id
## <list> <chr>
## 1 <split [2.2K/812]> Bootstrap01
## 2 <split [2.2K/850]> Bootstrap02
## 3 <split [2.2K/814]> Bootstrap03
## 4 <split [2.2K/814]> Bootstrap04
## 5 <split [2.2K/853]> Bootstrap05
## 6 <split [2.2K/840]> Bootstrap06
## 7 <split [2.2K/816]> Bootstrap07
## 8 <split [2.2K/826]> Bootstrap08
## 9 <split [2.2K/804]> Bootstrap09
## 10 <split [2.2K/809]> Bootstrap10
## # ... with 15 more rows
Implement the tunning
set.seed(2020)
lasso_grid <- tune_grid(lasso_wf,
resamples = review_folds,
grid = lambda_grid,
metrics = metric_set(roc_auc, ppv, npv) #set the metrics to compute
)
Once we have our tuning results, we can examine them in detail.
lasso_grid %>%
collect_metrics() %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")%>%
scroll_box(height = "200px")
| penalty | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model01 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model01 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model01 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model02 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model02 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model02 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model03 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model03 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model03 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model04 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model04 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model04 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model05 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model05 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model05 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model06 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model06 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model06 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model07 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model07 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model07 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model08 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model08 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model08 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model09 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model09 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model09 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model10 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model10 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model10 |
| 0.0000000 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model11 |
| 0.0000000 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model11 |
| 0.0000000 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model11 |
| 0.0000001 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model12 |
| 0.0000001 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model12 |
| 0.0000001 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model12 |
| 0.0000001 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model13 |
| 0.0000001 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model13 |
| 0.0000001 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model13 |
| 0.0000002 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model14 |
| 0.0000002 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model14 |
| 0.0000002 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model14 |
| 0.0000004 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model15 |
| 0.0000004 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model15 |
| 0.0000004 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model15 |
| 0.0000007 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model16 |
| 0.0000007 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model16 |
| 0.0000007 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model16 |
| 0.0000013 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model17 |
| 0.0000013 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model17 |
| 0.0000013 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model17 |
| 0.0000023 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model18 |
| 0.0000023 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model18 |
| 0.0000023 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model18 |
| 0.0000041 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model19 |
| 0.0000041 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model19 |
| 0.0000041 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model19 |
| 0.0000074 | npv | binary | 0.7404002 | 25 | 0.0051755 | Model20 |
| 0.0000074 | ppv | binary | 0.8635544 | 25 | 0.0030250 | Model20 |
| 0.0000074 | roc_auc | binary | 0.8784799 | 25 | 0.0027573 | Model20 |
| 0.0000134 | npv | binary | 0.7403643 | 25 | 0.0051784 | Model21 |
| 0.0000134 | ppv | binary | 0.8634938 | 25 | 0.0030462 | Model21 |
| 0.0000134 | roc_auc | binary | 0.8785062 | 25 | 0.0027564 | Model21 |
| 0.0000242 | npv | binary | 0.7421082 | 25 | 0.0051740 | Model22 |
| 0.0000242 | ppv | binary | 0.8644417 | 25 | 0.0029595 | Model22 |
| 0.0000242 | roc_auc | binary | 0.8807108 | 25 | 0.0025899 | Model22 |
| 0.0000438 | npv | binary | 0.7441448 | 25 | 0.0049054 | Model23 |
| 0.0000438 | ppv | binary | 0.8648689 | 25 | 0.0029972 | Model23 |
| 0.0000438 | roc_auc | binary | 0.8831642 | 25 | 0.0025112 | Model23 |
| 0.0000790 | npv | binary | 0.7496878 | 25 | 0.0049223 | Model24 |
| 0.0000790 | ppv | binary | 0.8666593 | 25 | 0.0027715 | Model24 |
| 0.0000790 | roc_auc | binary | 0.8857648 | 25 | 0.0023844 | Model24 |
| 0.0001425 | npv | binary | 0.7515045 | 25 | 0.0048764 | Model25 |
| 0.0001425 | ppv | binary | 0.8675778 | 25 | 0.0029213 | Model25 |
| 0.0001425 | roc_auc | binary | 0.8885057 | 25 | 0.0022577 | Model25 |
| 0.0002572 | npv | binary | 0.7585626 | 25 | 0.0045795 | Model26 |
| 0.0002572 | ppv | binary | 0.8692375 | 25 | 0.0027288 | Model26 |
| 0.0002572 | roc_auc | binary | 0.8918078 | 25 | 0.0021223 | Model26 |
| 0.0004642 | npv | binary | 0.7683244 | 25 | 0.0042001 | Model27 |
| 0.0004642 | ppv | binary | 0.8719790 | 25 | 0.0025633 | Model27 |
| 0.0004642 | roc_auc | binary | 0.8964992 | 25 | 0.0019420 | Model27 |
| 0.0008377 | npv | binary | 0.7808043 | 25 | 0.0041640 | Model28 |
| 0.0008377 | ppv | binary | 0.8770188 | 25 | 0.0026343 | Model28 |
| 0.0008377 | roc_auc | binary | 0.9025998 | 25 | 0.0017409 | Model28 |
| 0.0015118 | npv | binary | 0.7925645 | 25 | 0.0041186 | Model29 |
| 0.0015118 | ppv | binary | 0.8803496 | 25 | 0.0022640 | Model29 |
| 0.0015118 | roc_auc | binary | 0.9097560 | 25 | 0.0016924 | Model29 |
| 0.0027283 | npv | binary | 0.8083005 | 25 | 0.0039963 | Model30 |
| 0.0027283 | ppv | binary | 0.8820145 | 25 | 0.0022477 | Model30 |
| 0.0027283 | roc_auc | binary | 0.9178767 | 25 | 0.0015415 | Model30 |
| 0.0049239 | npv | binary | 0.8252221 | 25 | 0.0040321 | Model31 |
| 0.0049239 | ppv | binary | 0.8817243 | 25 | 0.0024599 | Model31 |
| 0.0049239 | roc_auc | binary | 0.9256296 | 25 | 0.0013162 | Model31 |
| 0.0088862 | npv | binary | 0.8358877 | 25 | 0.0032728 | Model32 |
| 0.0088862 | ppv | binary | 0.8756299 | 25 | 0.0025199 | Model32 |
| 0.0088862 | roc_auc | binary | 0.9299922 | 25 | 0.0011654 | Model32 |
| 0.0160372 | npv | binary | 0.8488131 | 25 | 0.0034799 | Model33 |
| 0.0160372 | ppv | binary | 0.8560464 | 25 | 0.0025205 | Model33 |
| 0.0160372 | roc_auc | binary | 0.9279997 | 25 | 0.0011384 | Model33 |
| 0.0289427 | npv | binary | 0.8641609 | 25 | 0.0047379 | Model34 |
| 0.0289427 | ppv | binary | 0.8110164 | 25 | 0.0025144 | Model34 |
| 0.0289427 | roc_auc | binary | 0.9140430 | 25 | 0.0012484 | Model34 |
| 0.0522335 | npv | binary | 0.8851403 | 25 | 0.0043334 | Model35 |
| 0.0522335 | ppv | binary | 0.7353761 | 25 | 0.0014981 | Model35 |
| 0.0522335 | roc_auc | binary | 0.8756876 | 25 | 0.0021750 | Model35 |
| 0.0942668 | npv | binary | 0.9201675 | 25 | 0.0107314 | Model36 |
| 0.0942668 | ppv | binary | 0.6584426 | 25 | 0.0020148 | Model36 |
| 0.0942668 | roc_auc | binary | 0.7758855 | 25 | 0.0032418 | Model36 |
| 0.1701254 | npv | binary | NaN | 0 | NA | Model37 |
| 0.1701254 | ppv | binary | 0.6373748 | 25 | 0.0013901 | Model37 |
| 0.1701254 | roc_auc | binary | 0.5000000 | 25 | 0.0000000 | Model37 |
| 0.3070291 | npv | binary | NaN | 0 | NA | Model38 |
| 0.3070291 | ppv | binary | 0.6373748 | 25 | 0.0013901 | Model38 |
| 0.3070291 | roc_auc | binary | 0.5000000 | 25 | 0.0000000 | Model38 |
| 0.5541020 | npv | binary | NaN | 0 | NA | Model39 |
| 0.5541020 | ppv | binary | 0.6373748 | 25 | 0.0013901 | Model39 |
| 0.5541020 | roc_auc | binary | 0.5000000 | 25 | 0.0000000 | Model39 |
| 1.0000000 | npv | binary | NaN | 0 | NA | Model40 |
| 1.0000000 | ppv | binary | 0.6373748 | 25 | 0.0013901 | Model40 |
| 1.0000000 | roc_auc | binary | 0.5000000 | 25 | 0.0000000 | Model40 |
But i’m a visual person so…
lasso_grid %>%
collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_line(size = 1.5, show.legend = FALSE) +
facet_wrap(~.metric) +
scale_x_log10()+
theme_minimal()
## Warning: Removed 4 row(s) containing missing values (geom_path).
This shows us a lot. We see clearly that AUC and PPV have benefited from the regularization and we could identify the best value of penalty for each of those metrics. The same is not true for NPV.
Let’s keep our model as is for now, and choose a final model based on AUC. We can use select_best() to find the best AUC and then update our workflow lasso_wf with this value.
best_auc <- lasso_grid %>%
select_best("roc_auc")
best_auc %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| penalty | .config |
|---|---|
| 0.0088862 | Model32 |
final_lasso <- finalize_workflow(lasso_wf, best_auc) #functions take a list or tibble of tuning parameter values and update objects with those values
final_lasso
## == Workflow ==================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------
## 5 Recipe Steps
##
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
##
## -- Model -----------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.00888623816274339
## mixture = 1
##
## Computational engine: glmnet
This is the tuned workflow.
To find the most important variables we can use the vip package
library(vip)
final_lasso %>%
fit(review_train) %>%
pull_workflow_fit() %>%
vi(lambda = best_auc$penalty) %>%
group_by(Sign) %>%
top_n(20, wt = abs(Importance)) %>%
ungroup() %>%
mutate(
Importance = abs(Importance),
Variable = str_remove(Variable, "tfidf_text_"),
Variable = fct_reorder(Variable, Importance)
) %>%
ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Sign, scales = "free_y") +
labs(y = NULL)+
theme_minimal()
Finally, let’s return to our test data. The tune package has a function last_fit() which is nice for situations when you have tuned and finalized a model or workflow and want to fit it one last time on your training data and evaluate it on your testing data. You only have to pass this function your finalized model/workflow and your split.
review_final <- last_fit(final_lasso, review_split)
review_final %>%
collect_metrics()%>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
position = "left")
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.8918558 |
| roc_auc | binary | 0.9414445 |
No overfit during our tuning process, and the overall accuracy is not bad. Let’s create a confusion matrix for the testing data.
review_final %>%
collect_predictions() %>%
conf_mat(rating, .pred_class)
## Truth
## Prediction bad good
## bad 449 55
## good 26 219
Although our overall accuracy isn’t so bad, we find that it is easier to detect the negative reviews than the positive ones.