Text Analysis at IQMR 2

Author

Michelle Bueno Vásquez

Published

June 21, 2024

Note

All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format

Lab 4: Predicting Nostalgia

You will need to install the dataverse package (and the other packages mentioned):

Code
library(glmnet)
library(tidyverse)
library(dataverse)
library(quanteda)

Data comes from Muller and Proksch (2023)

Code
nostalgia <- get_dataframe_by_name(
  filename = "data_coded_all.tab",
  dataset = "https://doi.org/10.7910/DVN/W8VGJF", 
  server = "dataverse.harvard.edu")

Here, we will create three different sets, our training, test, and validation

Code
# sets
set.seed(20240612)
# randomly sample 30% of the data for testing and validating
test_ids <- sample(nostalgia$doc_id, nrow(nostalgia) * 0.3)
# hold out 10% of the total df for our final validation
held_out_ids <- sample(test_ids, length(test_ids) * (1/3))
# get the other 20% as test data
test_set <- nostalgia %>% filter(doc_id %in% test_ids & !doc_id %in% held_out_ids)
# finally, get our training data
train_set <- nostalgia %>% filter(!doc_id %in% test_ids)

Then we use some standard pre-processing to construct our dfm

Code
nostalgia_train_dfm <- corpus(train_set) %>% 
    tokens(remove_numbers = TRUE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    tokens_tolower() %>%
    tokens_remove(c(stopwords("english"))) %>%
    tokens_select(min_nchar = 3) %>% 
    dfm() %>% 
    dfm_wordstem() %>% 
    dfm_trim(min_termfreq = 10, termfreq_type = 'count',
        min_docfreq = 10, docfreq_type = 'count')

Now, we train our lasso model:

We will also use a logit model here to predict a binary class

Note

alpha = 1 is for lasso.

this will do cross validation to select the best \(\lambda\)

Code
cv_model <- cv.glmnet(nostalgia_train_dfm, train_set$nostalgic, alpha = 1, family = "binomial")  

We can view this cross validation process, notice how the first solid line is at about -4.3. this value of lambda minimized prediction error

Code
plot(cv_model)

We can view inside the model object here

Code
log(cv_model$lambda.min)
[1] -4.468698

We can also visualize the shrinkage process

Code
plot(cv_model$glmnet.fit,
     xvar = "lambda",
     label = TRUE) 

Let’s look at the most important coefficients as identified by the model

Code
best_coefs <- coef(cv_model,
                   s = "lambda.min")

Positive coefficients are stronger predictors of nostalgia - We see words like history, heritage, and tradition show up (makes sense!) - We also see words like new, women, and european are more predictive of a document being not nostalgia (makes sense!)

Code
head(sort(best_coefs[,1]))
(Intercept)       rural         new       first        fund     protect 
 -3.3559306  -1.5207569  -0.8580014  -0.7058376  -0.4666257  -0.3511841 

Testing model on test data

Let’s turn to testing how well our model performs on our test data

To do so, we need to create a dfm that only includes words used to train the model (our model wouldn’t know what to do with a word outside of its vocab)

We will pre-process test data using the same pre-processing steps

Code
test_dfm <- corpus(test_set)  %>% 
    tokens(remove_numbers = TRUE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    tokens_tolower() %>%
    tokens_remove(c(stopwords("english"))) %>%
    tokens_select(min_nchar = 3) %>% 
    dfm() %>% 
    dfm_wordstem() %>% 
    dfm_trim(min_termfreq = 10, termfreq_type = 'count',
        min_docfreq = 10, docfreq_type = 'count')

Match terms between training and test

Code
matched_test_terms <- dfm_match(test_dfm, 
                                features = featnames(nostalgia_train_dfm)) 

Predict a probability

We can now predict the probability any given document in our test set is nostalgic, the output is a probability

Code
predictions <- predict(cv_model,
                       newx = matched_test_terms,
                       s = "lambda.min",
                       type = "response")

Let’s look to see some representative texts

This is the text that is predicted to be most nostalgic:

Code
nostalgia %>% filter(doc_id == 
                       rownames(predictions)[
                         which.max(predictions)
                         ]) %>% 
  pull(text)
[1] "Left also wants the return of Norwegian Cultural Heritage Fund used as grants / loans to cultural heritage projects."

and here is the least nostalgic:

Code
nostalgia %>% filter(doc_id == 
                       rownames(predictions)[
                         which.min(predictions)
                         ]) %>% pull(text)
[1] "In the face of this grave employment situation, and that of young people in particular, it is necessary, to prevent that this becomes an excuse for new environmental havoc instead necessary that environmental policies are the occasion of new jobs and no longer a constraint."

Seems pretty good!

“Round” off the probabilities

We can also use predict to “round” off the probabilities and give us a 0/1

Code
predictions_class <- predict(cv_model, 
                             newx = matched_test_terms, 
                             s = "lambda.min", 
                             type = "class")

Confusion Matrix

We can create a confusion matrix to see how we did

Code
conf_mat <- table(true = test_set$nostalgic, 
                  pred = as.numeric(predictions_class))

Measuring accuracy:

Code
(conf_mat[1,1] + conf_mat[2,2])/sum(conf_mat)
[1] 0.9291667

Accuracy is 92%, which is quite good

Proportion of 1s that were actually correct:

Code
conf_mat[2,2]/sum(conf_mat[,2])
[1] 0.8666667
Tip

But note that nostalgia is rare, so we can get good accuracy just by defaulting to 0 precision tells us proportion of 1s that were actually correct 87% because it very rarely said 1 when it was 0

Propotion of actual positives:
Code
conf_mat[2,2]/sum(conf_mat[2,])
[1] 0.4642857
Tip

This tell us proportion of actual positives that were identified correctly … only 46%, very bad!

So the model under-predicts nostalgia, what should we do?

We could try different pre-processing, use different model(s), code more docs

(Time permitting) Try to replicate the training process on the training and test data and predict on the test data here are the datasets you’ll need to get your started

Code
held_out <- nostalgia %>% filter(doc_id %in% held_out_ids)
train_and_test <- nostalgia %>% filter(!doc_id %in% held_out_ids)