Code
library(glmnet)
library(tidyverse)
library(dataverse)
library(quanteda)All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format
You will need to install the dataverse package (and the other packages mentioned):
library(glmnet)
library(tidyverse)
library(dataverse)
library(quanteda)Data comes from Muller and Proksch (2023)
nostalgia <- get_dataframe_by_name(
filename = "data_coded_all.tab",
dataset = "https://doi.org/10.7910/DVN/W8VGJF",
server = "dataverse.harvard.edu")# sets
set.seed(20240612)
# randomly sample 30% of the data for testing and validating
test_ids <- sample(nostalgia$doc_id, nrow(nostalgia) * 0.3)
# hold out 10% of the total df for our final validation
held_out_ids <- sample(test_ids, length(test_ids) * (1/3))
# get the other 20% as test data
test_set <- nostalgia %>% filter(doc_id %in% test_ids & !doc_id %in% held_out_ids)
# finally, get our training data
train_set <- nostalgia %>% filter(!doc_id %in% test_ids)Then we use some standard pre-processing to construct our dfm
nostalgia_train_dfm <- corpus(train_set) %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"))) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 10, termfreq_type = 'count',
min_docfreq = 10, docfreq_type = 'count')Now, we train our lasso model:
We will also use a logit model here to predict a binary class
alpha = 1 is for lasso.
this will do cross validation to select the best \(\lambda\)
cv_model <- cv.glmnet(nostalgia_train_dfm, train_set$nostalgic, alpha = 1, family = "binomial") We can view this cross validation process, notice how the first solid line is at about -4.3. this value of lambda minimized prediction error
plot(cv_model)We can view inside the model object here
log(cv_model$lambda.min)[1] -4.468698
We can also visualize the shrinkage process
plot(cv_model$glmnet.fit,
xvar = "lambda",
label = TRUE) Let’s look at the most important coefficients as identified by the model
best_coefs <- coef(cv_model,
s = "lambda.min")Positive coefficients are stronger predictors of nostalgia - We see words like history, heritage, and tradition show up (makes sense!) - We also see words like new, women, and european are more predictive of a document being not nostalgia (makes sense!)
head(sort(best_coefs[,1]))(Intercept) rural new first fund protect
-3.3559306 -1.5207569 -0.8580014 -0.7058376 -0.4666257 -0.3511841
Let’s turn to testing how well our model performs on our test data
To do so, we need to create a dfm that only includes words used to train the model (our model wouldn’t know what to do with a word outside of its vocab)
We will pre-process test data using the same pre-processing steps
test_dfm <- corpus(test_set) %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"))) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 10, termfreq_type = 'count',
min_docfreq = 10, docfreq_type = 'count')Match terms between training and test
matched_test_terms <- dfm_match(test_dfm,
features = featnames(nostalgia_train_dfm)) We can now predict the probability any given document in our test set is nostalgic, the output is a probability
predictions <- predict(cv_model,
newx = matched_test_terms,
s = "lambda.min",
type = "response")This is the text that is predicted to be most nostalgic:
nostalgia %>% filter(doc_id ==
rownames(predictions)[
which.max(predictions)
]) %>%
pull(text)[1] "Left also wants the return of Norwegian Cultural Heritage Fund used as grants / loans to cultural heritage projects."
and here is the least nostalgic:
nostalgia %>% filter(doc_id ==
rownames(predictions)[
which.min(predictions)
]) %>% pull(text)[1] "In the face of this grave employment situation, and that of young people in particular, it is necessary, to prevent that this becomes an excuse for new environmental havoc instead necessary that environmental policies are the occasion of new jobs and no longer a constraint."
Seems pretty good!
We can also use predict to “round” off the probabilities and give us a 0/1
predictions_class <- predict(cv_model,
newx = matched_test_terms,
s = "lambda.min",
type = "class")We can create a confusion matrix to see how we did
conf_mat <- table(true = test_set$nostalgic,
pred = as.numeric(predictions_class))(conf_mat[1,1] + conf_mat[2,2])/sum(conf_mat)[1] 0.9291667
Accuracy is 92%, which is quite good
conf_mat[2,2]/sum(conf_mat[,2])[1] 0.8666667
But note that nostalgia is rare, so we can get good accuracy just by defaulting to 0 precision tells us proportion of 1s that were actually correct 87% because it very rarely said 1 when it was 0
conf_mat[2,2]/sum(conf_mat[2,])[1] 0.4642857
This tell us proportion of actual positives that were identified correctly … only 46%, very bad!
We could try different pre-processing, use different model(s), code more docs
(Time permitting) Try to replicate the training process on the training and test data and predict on the test data here are the datasets you’ll need to get your started
held_out <- nostalgia %>% filter(doc_id %in% held_out_ids)
train_and_test <- nostalgia %>% filter(!doc_id %in% held_out_ids)