Text Analysis at IQMR 2

Author

Michelle Bueno Vásquez

Published

June 21, 2024

Note

All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format

Lab 4: Predicting Nostalgia

You will need to install the dataverse package (and the other packages mentioned):

Code

library(glmnet)
library(tidyverse)
library(dataverse)
library(quanteda)

Data comes from Muller and Proksch (2023)

Code

nostalgia <- get_dataframe_by_name(
  filename = "data_coded_all.tab",
  dataset = "https://doi.org/10.7910/DVN/W8VGJF", 
  server = "dataverse.harvard.edu")

Here, we will create three different sets, our training, test, and validation

Code

# sets
set.seed(20240612)
# randomly sample 30% of the data for testing and validating
test_ids <- sample(nostalgia$doc_id, nrow(nostalgia) * 0.3)
# hold out 10% of the total df for our final validation
held_out_ids <- sample(test_ids, length(test_ids) * (1/3))
# get the other 20% as test data
test_set <- nostalgia %>% filter(doc_id %in% test_ids & !doc_id %in% held_out_ids)
# finally, get our training data
train_set <- nostalgia %>% filter(!doc_id %in% test_ids)

Then we use some standard pre-processing to construct our dfm

Code

nostalgia_train_dfm <- corpus(train_set) %>% 
    tokens(remove_numbers = TRUE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    tokens_tolower() %>%
    tokens_remove(c(stopwords("english"))) %>%
    tokens_select(min_nchar = 3) %>% 
    dfm() %>% 
    dfm_wordstem() %>% 
    dfm_trim(min_termfreq = 10, termfreq_type = 'count',
        min_docfreq = 10, docfreq_type = 'count')

Now, we train our lasso model:

We will also use a logit model here to predict a binary class

Note

alpha = 1 is for lasso.

this will do cross validation to select the best $\lambda$

Code

cv_model <- cv.glmnet(nostalgia_train_dfm, train_set$nostalgic, alpha = 1, family = "binomial")

We can view this cross validation process, notice how the first solid line is at about -4.3. this value of lambda minimized prediction error

Code

plot(cv_model)

We can view inside the model object here

Code

log(cv_model$lambda.min)

[1] -4.468698

We can also visualize the shrinkage process

Code

plot(cv_model$glmnet.fit,
     xvar = "lambda",
     label = TRUE)

Let’s look at the most important coefficients as identified by the model

Code

best_coefs <- coef(cv_model,
                   s = "lambda.min")

Positive coefficients are stronger predictors of nostalgia - We see words like history, heritage, and tradition show up (makes sense!) - We also see words like new, women, and european are more predictive of a document being not nostalgia (makes sense!)

Code

head(sort(best_coefs[,1]))

(Intercept)       rural         new       first        fund     protect 
 -3.3559306  -1.5207569  -0.8580014  -0.7058376  -0.4666257  -0.3511841

Testing model on test data

Let’s turn to testing how well our model performs on our test data

To do so, we need to create a dfm that only includes words used to train the model (our model wouldn’t know what to do with a word outside of its vocab)

We will pre-process test data using the same pre-processing steps

Code

test_dfm <- corpus(test_set)  %>% 
    tokens(remove_numbers = TRUE, 
        remove_punct = TRUE, 
        remove_symbols = TRUE,
        remove_separators = TRUE) %>% 
    tokens_tolower() %>%
    tokens_remove(c(stopwords("english"))) %>%
    tokens_select(min_nchar = 3) %>% 
    dfm() %>% 
    dfm_wordstem() %>% 
    dfm_trim(min_termfreq = 10, termfreq_type = 'count',
        min_docfreq = 10, docfreq_type = 'count')

Match terms between training and test

Code

matched_test_terms <- dfm_match(test_dfm, 
                                features = featnames(nostalgia_train_dfm))

Predict a probability

We can now predict the probability any given document in our test set is nostalgic, the output is a probability

Code

predictions <- predict(cv_model,
                       newx = matched_test_terms,
                       s = "lambda.min",
                       type = "response")

Let’s look to see some representative texts

This is the text that is predicted to be most nostalgic:

Code

nostalgia %>% filter(doc_id == 
                       rownames(predictions)[
                         which.max(predictions)
                         ]) %>% 
  pull(text)

[1] "Left also wants the return of Norwegian Cultural Heritage Fund used as grants / loans to cultural heritage projects."

and here is the least nostalgic:

Code

nostalgia %>% filter(doc_id == 
                       rownames(predictions)[
                         which.min(predictions)
                         ]) %>% pull(text)

[1] "In the face of this grave employment situation, and that of young people in particular, it is necessary, to prevent that this becomes an excuse for new environmental havoc instead necessary that environmental policies are the occasion of new jobs and no longer a constraint."

Seems pretty good!

“Round” off the probabilities

We can also use predict to “round” off the probabilities and give us a 0/1

Code

predictions_class <- predict(cv_model, 
                             newx = matched_test_terms, 
                             s = "lambda.min", 
                             type = "class")

Confusion Matrix

We can create a confusion matrix to see how we did

Code

conf_mat <- table(true = test_set$nostalgic, 
                  pred = as.numeric(predictions_class))

Measuring accuracy:

Code

(conf_mat[1,1] + conf_mat[2,2])/sum(conf_mat)

[1] 0.9291667

Accuracy is 92%, which is quite good

Proportion of 1s that were actually correct:

Code

conf_mat[2,2]/sum(conf_mat[,2])

[1] 0.8666667

Tip

But note that nostalgia is rare, so we can get good accuracy just by defaulting to 0 precision tells us proportion of 1s that were actually correct 87% because it very rarely said 1 when it was 0

Propotion of actual positives:

Code

conf_mat[2,2]/sum(conf_mat[2,])

[1] 0.4642857

Tip

This tell us proportion of actual positives that were identified correctly … only 46%, very bad!

So the model under-predicts nostalgia, what should we do?

We could try different pre-processing, use different model(s), code more docs

(Time permitting) Try to replicate the training process on the training and test data and predict on the test data here are the datasets you’ll need to get your started

Code

held_out <- nostalgia %>% filter(doc_id %in% held_out_ids)
train_and_test <- nostalgia %>% filter(!doc_id %in% held_out_ids)

--- title: "Text Analysis at IQMR 2" author: "Michelle Bueno Vásquez" date: today format: html: toc: true embed-resources: true code-fold: show link-external-newwindow: true fig-dpi: 300 code-overflow: wrap code-tools: true execute: warning: false cache: true from: markdown+emoji --- <style> /* This targets the code inside the pre tag for output */ pre:not(.sourceCode.r) code { color: #355bae; background-color: #f3ffff; padding: 10px; border-radius: 1px; font-size: 10px } </style> :::{.callout-note} All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format ::: # Lab 4: Predicting Nostalgia You will need to install the `dataverse` package (and the other packages mentioned): ```{r load-libs} library(glmnet) library(tidyverse) library(dataverse) library(quanteda) ``` [Data comes from Muller and Proksch (2023)](https://www.cambridge.org/core/journals/british-journal-of-political-science/article/nostalgia-in-european-party-politics-a-textbased-measurement-approach/41C48D60B039F3081EB522FB76646E96#article) ```{r nostalgia-data} nostalgia <- get_dataframe_by_name( filename = "data_coded_all.tab", dataset = "https://doi.org/10.7910/DVN/W8VGJF", server = "dataverse.harvard.edu") ``` ### Here, we will create three different sets, our training, test, and validation ```{r sets} # sets set.seed(20240612) # randomly sample 30% of the data for testing and validating test_ids <- sample(nostalgia$doc_id, nrow(nostalgia) * 0.3) # hold out 10% of the total df for our final validation held_out_ids <- sample(test_ids, length(test_ids) * (1/3)) # get the other 20% as test data test_set <- nostalgia %>% filter(doc_id %in% test_ids & !doc_id %in% held_out_ids) # finally, get our training data train_set <- nostalgia %>% filter(!doc_id %in% test_ids) ``` Then we use some standard pre-processing to construct our dfm ```{r preprocessing-nostalgia-dfm} nostalgia_train_dfm <- corpus(train_set) %>% tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE) %>% tokens_tolower() %>% tokens_remove(c(stopwords("english"))) %>% tokens_select(min_nchar = 3) %>% dfm() %>% dfm_wordstem() %>% dfm_trim(min_termfreq = 10, termfreq_type = 'count', min_docfreq = 10, docfreq_type = 'count') ``` Now, we train our lasso model: We will also use a logit model here to predict a binary class :::{.callout-note} alpha = 1 is for lasso. this will do cross validation to select the best $\lambda$ ::: ```{r lasso-model-train} cv_model <- cv.glmnet(nostalgia_train_dfm, train_set$nostalgic, alpha = 1, family = "binomial") ``` We can view this cross validation process, notice how the first solid line is at about -4.3. this value of lambda minimized prediction error ```{r cv-plot} plot(cv_model) ``` We can view inside the model object here ```{r best-lambda} log(cv_model$lambda.min) ``` We can also visualize the shrinkage process ```{r shrinkage-process} plot(cv_model$glmnet.fit, xvar = "lambda", label = TRUE) ``` Let's look at the most important coefficients as identified by the model ```{r best-coefficients-vector} best_coefs <- coef(cv_model, s = "lambda.min") ``` Positive coefficients are stronger predictors of nostalgia - We see words like history, heritage, and tradition show up (makes sense!) - We also see words like new, women, and european are more predictive of a document being not nostalgia (makes sense!) ```{r} head(sort(best_coefs[,1])) ``` ### Testing model on test data Let's turn to testing how well our model performs on our test data To do so, we need to create a dfm that only includes words used to train the model (our model wouldn't know what to do with a word outside of its vocab) We will pre-process test data using the same pre-processing steps ```{r test-set-preprocessing} test_dfm <- corpus(test_set) %>% tokens(remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE) %>% tokens_tolower() %>% tokens_remove(c(stopwords("english"))) %>% tokens_select(min_nchar = 3) %>% dfm() %>% dfm_wordstem() %>% dfm_trim(min_termfreq = 10, termfreq_type = 'count', min_docfreq = 10, docfreq_type = 'count') ``` Match terms between training and test ```{r match-terms} matched_test_terms <- dfm_match(test_dfm, features = featnames(nostalgia_train_dfm)) ``` ### Predict a probability We can now predict the probability any given document in our test set is nostalgic, the output is a probability ```{r prediction-set} predictions <- predict(cv_model, newx = matched_test_terms, s = "lambda.min", type = "response") ``` \break ### Let's look to see some representative texts This is the text that is predicted to be most nostalgic: ```{r most-nostalgic} nostalgia %>% filter(doc_id == rownames(predictions)[ which.max(predictions) ]) %>% pull(text) ``` and here is the least nostalgic: ```{r least-nostalgic} nostalgia %>% filter(doc_id == rownames(predictions)[ which.min(predictions) ]) %>% pull(text) ``` Seems pretty good! ### "Round" off the probabilities We can also use predict to "round" off the probabilities and give us a 0/1 ```{r round-probabilities} predictions_class <- predict(cv_model, newx = matched_test_terms, s = "lambda.min", type = "class") ``` ### Confusion Matrix We can create a confusion matrix to see how we did ```{r confusion-matrix} conf_mat <- table(true = test_set$nostalgic, pred = as.numeric(predictions_class)) ``` #### Measuring accuracy: ```{r accuracy} (conf_mat[1,1] + conf_mat[2,2])/sum(conf_mat) ``` Accuracy is 92%, which is quite good #### Proportion of 1s that were actually correct: ```{r} conf_mat[2,2]/sum(conf_mat[,2]) ``` :::{.callout-tip} But note that nostalgia is rare, so we can get good accuracy just by defaulting to 0 precision tells us proportion of 1s that were actually correct 87% because it very rarely said 1 when it was 0 ::: ##### Propotion of actual positives: ```{r} conf_mat[2,2]/sum(conf_mat[2,]) ``` :::{.callout-tip} This tell us proportion of actual positives that were identified correctly ... only 46%, very bad! ::: ### So the model under-predicts nostalgia, what should we do? We could try different pre-processing, use different model(s), code more docs (Time permitting) Try to replicate the training process on the training and test data and predict on the test data here are the datasets you'll need to get your started ```{r time-permitting} held_out <- nostalgia %>% filter(doc_id %in% held_out_ids) train_and_test <- nostalgia %>% filter(!doc_id %in% held_out_ids) ```