Sentiment Analysis

Build a model for Animal crossing user reviews TidyTuesday 2020-05-05 for predict the rating from the text of the user reviews.

Get the data and initial exploration.

user_reviews <- readr::read_tsv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/user_reviews.tsv')

head(user_reviews, 15)

## # A tibble: 15 x 4
##    grade user_name     text                                           date      
##    <dbl> <chr>         <chr>                                          <date>    
##  1     4 mds27272      My gf started playing before me. No option to~ 2020-03-20
##  2     5 lolo2178      While the game itself is great, really relaxi~ 2020-03-20
##  3     0 Roachant      My wife and I were looking forward to playing~ 2020-03-20
##  4     0 Houndf        We need equal values and opportunities for al~ 2020-03-20
##  5     0 ProfessorFox  BEWARE!  If you have multiple people in your ~ 2020-03-20
##  6     0 tb726         The limitation of one island per Switch (not ~ 2020-03-20
##  7     0 Outryder86    I was very excited for this new installment o~ 2020-03-20
##  8     0 Subby89       It's 2020 and for some reason Nintendo has de~ 2020-03-20
##  9     0 RocketRon     This is so annoying. Only one player has the ~ 2020-03-20
## 10     0 chankills     I purchased this game for my household (me an~ 2020-03-20
## 11     1 rafilks182    they heavily marketed this game as a social e~ 2020-03-20
## 12     0 Gravy_Dreamb~ You can only have one island per Switch. If m~ 2020-03-20
## 13     0 Lazariel      Only ONE island per console!You can't create ~ 2020-03-20
## 14     0 novaaaa       You can only create one island per switch and~ 2020-03-20
## 15     0 sooners11     One island per console is a pretty terrible a~ 2020-03-20

user_reviews %>%
  count(grade) %>%
  ggplot(aes(grade, n)) +
  geom_col(fill = "midnightblue", alpha = 0.7)+
  labs(x="", y="",
       title="Number of reviews by rating")

We can see that lots of people give extreme scores in their rewiews, this distribution is nor suitable to make a prediction model so its a good idea to convert this scores to a label and build a model of binary clasification good vs. bad user reviews.

Explore the structure of the “good” reviews.

user_reviews %>%
  filter(grade > 8) %>%
  sample_n(5) %>%
  pull(text)

## [1] "Un juego fantástico, con muchas opciones, con el cual puedes relajarte y jugarlos con tu familia"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "I dont think that one switch for one island is bad or good. I played the original games and hated them because it was boring to me. Now I get to go on and drop gifts off for my boyfriend in front of his house, hop over to my roommates island and hang out with either one and and drop gifts for them too. I think it's fun to see what's different when I wake up and discussing what to do orI dont think that one switch for one island is bad or good. I played the original games and hated them because it was boring to me. Now I get to go on and drop gifts off for my boyfriend in front of his house, hop over to my roommates island and hang out with either one and and drop gifts for them too. I think it's fun to see what's different when I wake up and discussing what to do or where to out things. It makes it more of a community game.… Expand"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [3] "Best game that I have EVER played! So much thought and care went into turning this into a genuinely immersive experience. Personally, the 1 island per switch problem didn’t affect me as I have my own personal switch and my brother who also plays has HIS own. The characters are  way more lifelike than in Animal Crossing New Leaf and I feel like my island is truly mine. You can placeBest game that I have EVER played! So much thought and care went into turning this into a genuinely immersive experience. Personally, the 1 island per switch problem didn’t affect me as I have my own personal switch and my brother who also plays has HIS own. The characters are  way more lifelike than in Animal Crossing New Leaf and I feel like my island is truly mine. You can place furniture inside OR outside as well as building shops, bridges, ramps and even a giant robot! With the shops you can find carefully designed furniture and clothing! Most of the furniture you buy can even be customized into multiple styles, truly making it feel like yours! You can also craft your own furniture such as chairs, tables and beds! It makes it feel like you have a part in your island life. With the clothing  store you can buy hats, shirts, pants and more! Unfortunately this game has been review BOMBED by people who in the end won’t make change! This is the best Switch game and I highly recommend. This game is a 10/10, not 5.7.… Expand"
## [4] "Best game ever if you don't have to share your island with anyone. I feel it was designed for people between 25 and 30 years old.Don't miss this game just because the low ratings due to the one island per console issue, if you live alone you won't regret!"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [5] "Excellent game, this title is a very good example of real entertainment for gamers, thanks Nintendo."

The reviews have some text problems, Let’s remove at least the final “Expand” from the reviews, and create a new categorical rating variable.

reviews_parsed <- user_reviews %>%
  mutate(text = str_remove(text, "Expand$")) %>% #remove the word Expand
  mutate(rating = case_when(
    grade > 7 ~ "good", #make a new variable rating, treshold in 7
    TRUE ~ "bad"
  ))

Now let’s prepare the reviews to make the model, for this we split the text column into tokens by user name.

library(tidytext)

words_per_review <- reviews_parsed %>%
  unnest_tokens(word, text) %>%
  count(user_name, name = "total_words")

words_per_review %>%
  ggplot(aes(total_words)) +
  geom_histogram(fill = "midnightblue", alpha = 0.8)+
  labs(y="", x= "Total words",
       title= "Histogram of Number of words")

This distribution of words is not natural, the gap in the midle of the distribution looks very strange and maybe is related to the data adquisition procedure. But the data is never perfect…

Let’s build a model

Split the data into training and test datasets

library(tidymodels)

set.seed(123)
review_split <- initial_split(reviews_parsed, strata = rating)
review_train <- training(review_split)
review_test <- testing(review_split)

Preprocess the data

library(textrecipes)

review_rec <- recipe(rating ~ text, data = review_train) %>% #make the recipe
  step_tokenize(text) %>% #tokenixe text 
  step_stopwords(text) %>% #remove stopwords
  step_tokenfilter(text, max_tokens = 500) %>% #filter tokens by frecuency, keep the top 500 most-used tokens
  step_tfidf(text) %>% #create a Inverted document frecuency. common or rare the word is across all the observations
  step_normalize(all_predictors()) #Center and scale numeric data

review_prep <- prep(review_rec) #The prep() function is where everything gets evaluated

review_prep

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Training data contained 2250 data points and no missing data.
## 
## Operations:
## 
## Tokenization for text [trained]
## Stop word removal for text [trained]
## Text filtering for text [trained]
## Term frequency-inverse document frequency with text [trained]
## Centering and scaling for tfidf_text_0, tfidf_text_1, ... [trained]

Specify the model. Here we can set up the model specification for lasso regression with penalty = tune() since we don’t yet know the best value for the regularization parameter and mixture = 1 for lasso. The lasso has proved to be a good baseline for text modeling.

lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

lasso_wf <- workflow() %>% #use the workflow to manage modeling pipelines more easily
  add_recipe(review_rec) %>%
  add_model(lasso_spec)

lasso_wf

## == Workflow ==================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------
## 5 Recipe Steps
## 
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
## 
## -- Model -----------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet

Tune model parameters.

First, we need a set of possible regularization parameters to try.

lambda_grid <- grid_regular(penalty(), levels = 40)

Next, we need a set of resampled data to fit and evaluate all these models.

set.seed(123)
review_folds <- bootstraps(review_train, strata = rating)
review_folds

## # Bootstrap sampling using stratification 
## # A tibble: 25 x 2
##    splits             id         
##    <list>             <chr>      
##  1 <split [2.2K/812]> Bootstrap01
##  2 <split [2.2K/850]> Bootstrap02
##  3 <split [2.2K/814]> Bootstrap03
##  4 <split [2.2K/814]> Bootstrap04
##  5 <split [2.2K/853]> Bootstrap05
##  6 <split [2.2K/840]> Bootstrap06
##  7 <split [2.2K/816]> Bootstrap07
##  8 <split [2.2K/826]> Bootstrap08
##  9 <split [2.2K/804]> Bootstrap09
## 10 <split [2.2K/809]> Bootstrap10
## # ... with 15 more rows

Implement the tunning

set.seed(2020)
lasso_grid <- tune_grid(lasso_wf,
                        resamples = review_folds,
                        grid = lambda_grid,
                        metrics = metric_set(roc_auc, ppv, npv) #set the metrics to compute
)

Once we have our tuning results, we can examine them in detail.

lasso_grid %>%
  collect_metrics() %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")%>%
  scroll_box(height = "200px")

penalty	.metric	.estimator	mean	n	std_err	.config
0.0000000	npv	binary	0.7404002	25	0.0051755	Model01
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model01
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model01
0.0000000	npv	binary	0.7404002	25	0.0051755	Model02
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model02
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model02
0.0000000	npv	binary	0.7404002	25	0.0051755	Model03
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model03
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model03
0.0000000	npv	binary	0.7404002	25	0.0051755	Model04
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model04
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model04
0.0000000	npv	binary	0.7404002	25	0.0051755	Model05
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model05
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model05
0.0000000	npv	binary	0.7404002	25	0.0051755	Model06
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model06
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model06
0.0000000	npv	binary	0.7404002	25	0.0051755	Model07
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model07
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model07
0.0000000	npv	binary	0.7404002	25	0.0051755	Model08
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model08
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model08
0.0000000	npv	binary	0.7404002	25	0.0051755	Model09
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model09
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model09
0.0000000	npv	binary	0.7404002	25	0.0051755	Model10
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model10
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model10
0.0000000	npv	binary	0.7404002	25	0.0051755	Model11
0.0000000	ppv	binary	0.8635544	25	0.0030250	Model11
0.0000000	roc_auc	binary	0.8784799	25	0.0027573	Model11
0.0000001	npv	binary	0.7404002	25	0.0051755	Model12
0.0000001	ppv	binary	0.8635544	25	0.0030250	Model12
0.0000001	roc_auc	binary	0.8784799	25	0.0027573	Model12
0.0000001	npv	binary	0.7404002	25	0.0051755	Model13
0.0000001	ppv	binary	0.8635544	25	0.0030250	Model13
0.0000001	roc_auc	binary	0.8784799	25	0.0027573	Model13
0.0000002	npv	binary	0.7404002	25	0.0051755	Model14
0.0000002	ppv	binary	0.8635544	25	0.0030250	Model14
0.0000002	roc_auc	binary	0.8784799	25	0.0027573	Model14
0.0000004	npv	binary	0.7404002	25	0.0051755	Model15
0.0000004	ppv	binary	0.8635544	25	0.0030250	Model15
0.0000004	roc_auc	binary	0.8784799	25	0.0027573	Model15
0.0000007	npv	binary	0.7404002	25	0.0051755	Model16
0.0000007	ppv	binary	0.8635544	25	0.0030250	Model16
0.0000007	roc_auc	binary	0.8784799	25	0.0027573	Model16
0.0000013	npv	binary	0.7404002	25	0.0051755	Model17
0.0000013	ppv	binary	0.8635544	25	0.0030250	Model17
0.0000013	roc_auc	binary	0.8784799	25	0.0027573	Model17
0.0000023	npv	binary	0.7404002	25	0.0051755	Model18
0.0000023	ppv	binary	0.8635544	25	0.0030250	Model18
0.0000023	roc_auc	binary	0.8784799	25	0.0027573	Model18
0.0000041	npv	binary	0.7404002	25	0.0051755	Model19
0.0000041	ppv	binary	0.8635544	25	0.0030250	Model19
0.0000041	roc_auc	binary	0.8784799	25	0.0027573	Model19
0.0000074	npv	binary	0.7404002	25	0.0051755	Model20
0.0000074	ppv	binary	0.8635544	25	0.0030250	Model20
0.0000074	roc_auc	binary	0.8784799	25	0.0027573	Model20
0.0000134	npv	binary	0.7403643	25	0.0051784	Model21
0.0000134	ppv	binary	0.8634938	25	0.0030462	Model21
0.0000134	roc_auc	binary	0.8785062	25	0.0027564	Model21
0.0000242	npv	binary	0.7421082	25	0.0051740	Model22
0.0000242	ppv	binary	0.8644417	25	0.0029595	Model22
0.0000242	roc_auc	binary	0.8807108	25	0.0025899	Model22
0.0000438	npv	binary	0.7441448	25	0.0049054	Model23
0.0000438	ppv	binary	0.8648689	25	0.0029972	Model23
0.0000438	roc_auc	binary	0.8831642	25	0.0025112	Model23
0.0000790	npv	binary	0.7496878	25	0.0049223	Model24
0.0000790	ppv	binary	0.8666593	25	0.0027715	Model24
0.0000790	roc_auc	binary	0.8857648	25	0.0023844	Model24
0.0001425	npv	binary	0.7515045	25	0.0048764	Model25
0.0001425	ppv	binary	0.8675778	25	0.0029213	Model25
0.0001425	roc_auc	binary	0.8885057	25	0.0022577	Model25
0.0002572	npv	binary	0.7585626	25	0.0045795	Model26
0.0002572	ppv	binary	0.8692375	25	0.0027288	Model26
0.0002572	roc_auc	binary	0.8918078	25	0.0021223	Model26
0.0004642	npv	binary	0.7683244	25	0.0042001	Model27
0.0004642	ppv	binary	0.8719790	25	0.0025633	Model27
0.0004642	roc_auc	binary	0.8964992	25	0.0019420	Model27
0.0008377	npv	binary	0.7808043	25	0.0041640	Model28
0.0008377	ppv	binary	0.8770188	25	0.0026343	Model28
0.0008377	roc_auc	binary	0.9025998	25	0.0017409	Model28
0.0015118	npv	binary	0.7925645	25	0.0041186	Model29
0.0015118	ppv	binary	0.8803496	25	0.0022640	Model29
0.0015118	roc_auc	binary	0.9097560	25	0.0016924	Model29
0.0027283	npv	binary	0.8083005	25	0.0039963	Model30
0.0027283	ppv	binary	0.8820145	25	0.0022477	Model30
0.0027283	roc_auc	binary	0.9178767	25	0.0015415	Model30
0.0049239	npv	binary	0.8252221	25	0.0040321	Model31
0.0049239	ppv	binary	0.8817243	25	0.0024599	Model31
0.0049239	roc_auc	binary	0.9256296	25	0.0013162	Model31
0.0088862	npv	binary	0.8358877	25	0.0032728	Model32
0.0088862	ppv	binary	0.8756299	25	0.0025199	Model32
0.0088862	roc_auc	binary	0.9299922	25	0.0011654	Model32
0.0160372	npv	binary	0.8488131	25	0.0034799	Model33
0.0160372	ppv	binary	0.8560464	25	0.0025205	Model33
0.0160372	roc_auc	binary	0.9279997	25	0.0011384	Model33
0.0289427	npv	binary	0.8641609	25	0.0047379	Model34
0.0289427	ppv	binary	0.8110164	25	0.0025144	Model34
0.0289427	roc_auc	binary	0.9140430	25	0.0012484	Model34
0.0522335	npv	binary	0.8851403	25	0.0043334	Model35
0.0522335	ppv	binary	0.7353761	25	0.0014981	Model35
0.0522335	roc_auc	binary	0.8756876	25	0.0021750	Model35
0.0942668	npv	binary	0.9201675	25	0.0107314	Model36
0.0942668	ppv	binary	0.6584426	25	0.0020148	Model36
0.0942668	roc_auc	binary	0.7758855	25	0.0032418	Model36
0.1701254	npv	binary	NaN	0	NA	Model37
0.1701254	ppv	binary	0.6373748	25	0.0013901	Model37
0.1701254	roc_auc	binary	0.5000000	25	0.0000000	Model37
0.3070291	npv	binary	NaN	0	NA	Model38
0.3070291	ppv	binary	0.6373748	25	0.0013901	Model38
0.3070291	roc_auc	binary	0.5000000	25	0.0000000	Model38
0.5541020	npv	binary	NaN	0	NA	Model39
0.5541020	ppv	binary	0.6373748	25	0.0013901	Model39
0.5541020	roc_auc	binary	0.5000000	25	0.0000000	Model39
1.0000000	npv	binary	NaN	0	NA	Model40
1.0000000	ppv	binary	0.6373748	25	0.0013901	Model40
1.0000000	roc_auc	binary	0.5000000	25	0.0000000	Model40

But i’m a visual person so…

lasso_grid %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_line(size = 1.5, show.legend = FALSE) +
  facet_wrap(~.metric) +
  scale_x_log10()+
  theme_minimal()

## Warning: Removed 4 row(s) containing missing values (geom_path).

This shows us a lot. We see clearly that AUC and PPV have benefited from the regularization and we could identify the best value of penalty for each of those metrics. The same is not true for NPV.

Final model

Let’s keep our model as is for now, and choose a final model based on AUC. We can use select_best() to find the best AUC and then update our workflow lasso_wf with this value.

best_auc <- lasso_grid %>%
  select_best("roc_auc")

best_auc %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")

penalty	.config
0.0088862	Model32

final_lasso <- finalize_workflow(lasso_wf, best_auc) #functions take a list or tibble of tuning parameter values and update objects with those values

final_lasso

## == Workflow ==================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------
## 5 Recipe Steps
## 
## * step_tokenize()
## * step_stopwords()
## * step_tokenfilter()
## * step_tfidf()
## * step_normalize()
## 
## -- Model -----------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Main Arguments:
##   penalty = 0.00888623816274339
##   mixture = 1
## 
## Computational engine: glmnet

This is the tuned workflow.

To find the most important variables we can use the vip package

library(vip)

final_lasso %>% 
  fit(review_train) %>%
  pull_workflow_fit() %>%
  vi(lambda = best_auc$penalty) %>%
  group_by(Sign) %>%
  top_n(20, wt = abs(Importance)) %>%
  ungroup() %>%
  mutate(
    Importance = abs(Importance),
    Variable = str_remove(Variable, "tfidf_text_"),
    Variable = fct_reorder(Variable, Importance)
  ) %>%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Sign, scales = "free_y") +
  labs(y = NULL)+
  theme_minimal()

Model test.

Finally, let’s return to our test data. The tune package has a function last_fit() which is nice for situations when you have tuned and finalized a model or workflow and want to fit it one last time on your training data and evaluate it on your testing data. You only have to pass this function your finalized model/workflow and your split.

review_final <- last_fit(final_lasso, review_split)

review_final %>%
  collect_metrics()%>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                position = "left")

.metric	.estimator	.estimate
accuracy	binary	0.8918558
roc_auc	binary	0.9414445

No overfit during our tuning process, and the overall accuracy is not bad. Let’s create a confusion matrix for the testing data.

review_final %>%
  collect_predictions() %>%
  conf_mat(rating, .pred_class)

##           Truth
## Prediction bad good
##       bad  449   55
##       good  26  219

Although our overall accuracy isn’t so bad, we find that it is easier to detect the negative reviews than the positive ones.

Sentiment Analysis

Julian Avila-Jimenez

16/8/2020

Get the data and initial exploration.

Let’s build a model

Final model

Model test.