Code
library(glmnet)
library(tidyverse)
library(dataverse)
library(quanteda)
All credit for these materials go to Professor Ben Noble, I just compiled it into a nice html format
You will need to install the dataverse
package (and the other packages mentioned):
library(glmnet)
library(tidyverse)
library(dataverse)
library(quanteda)
Data comes from Muller and Proksch (2023)
<- get_dataframe_by_name(
nostalgia filename = "data_coded_all.tab",
dataset = "https://doi.org/10.7910/DVN/W8VGJF",
server = "dataverse.harvard.edu")
# sets
set.seed(20240612)
# randomly sample 30% of the data for testing and validating
<- sample(nostalgia$doc_id, nrow(nostalgia) * 0.3)
test_ids # hold out 10% of the total df for our final validation
<- sample(test_ids, length(test_ids) * (1/3))
held_out_ids # get the other 20% as test data
<- nostalgia %>% filter(doc_id %in% test_ids & !doc_id %in% held_out_ids)
test_set # finally, get our training data
<- nostalgia %>% filter(!doc_id %in% test_ids) train_set
Then we use some standard pre-processing to construct our dfm
<- corpus(train_set) %>%
nostalgia_train_dfm tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"))) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 10, termfreq_type = 'count',
min_docfreq = 10, docfreq_type = 'count')
Now, we train our lasso model:
We will also use a logit model here to predict a binary class
alpha = 1 is for lasso.
this will do cross validation to select the best \(\lambda\)
<- cv.glmnet(nostalgia_train_dfm, train_set$nostalgic, alpha = 1, family = "binomial") cv_model
We can view this cross validation process, notice how the first solid line is at about -4.3. this value of lambda minimized prediction error
plot(cv_model)
We can view inside the model object here
log(cv_model$lambda.min)
[1] -4.468698
We can also visualize the shrinkage process
plot(cv_model$glmnet.fit,
xvar = "lambda",
label = TRUE)
Let’s look at the most important coefficients as identified by the model
<- coef(cv_model,
best_coefs s = "lambda.min")
Positive coefficients are stronger predictors of nostalgia - We see words like history, heritage, and tradition show up (makes sense!) - We also see words like new, women, and european are more predictive of a document being not nostalgia (makes sense!)
head(sort(best_coefs[,1]))
(Intercept) rural new first fund protect
-3.3559306 -1.5207569 -0.8580014 -0.7058376 -0.4666257 -0.3511841
Let’s turn to testing how well our model performs on our test data
To do so, we need to create a dfm that only includes words used to train the model (our model wouldn’t know what to do with a word outside of its vocab)
We will pre-process test data using the same pre-processing steps
<- corpus(test_set) %>%
test_dfm tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_remove(c(stopwords("english"))) %>%
tokens_select(min_nchar = 3) %>%
dfm() %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 10, termfreq_type = 'count',
min_docfreq = 10, docfreq_type = 'count')
Match terms between training and test
<- dfm_match(test_dfm,
matched_test_terms features = featnames(nostalgia_train_dfm))
We can now predict the probability any given document in our test set is nostalgic, the output is a probability
<- predict(cv_model,
predictions newx = matched_test_terms,
s = "lambda.min",
type = "response")
This is the text that is predicted to be most nostalgic:
%>% filter(doc_id ==
nostalgia rownames(predictions)[
which.max(predictions)
%>%
]) pull(text)
[1] "Left also wants the return of Norwegian Cultural Heritage Fund used as grants / loans to cultural heritage projects."
and here is the least nostalgic:
%>% filter(doc_id ==
nostalgia rownames(predictions)[
which.min(predictions)
%>% pull(text) ])
[1] "In the face of this grave employment situation, and that of young people in particular, it is necessary, to prevent that this becomes an excuse for new environmental havoc instead necessary that environmental policies are the occasion of new jobs and no longer a constraint."
Seems pretty good!
We can also use predict to “round” off the probabilities and give us a 0/1
<- predict(cv_model,
predictions_class newx = matched_test_terms,
s = "lambda.min",
type = "class")
We can create a confusion matrix to see how we did
<- table(true = test_set$nostalgic,
conf_mat pred = as.numeric(predictions_class))
1,1] + conf_mat[2,2])/sum(conf_mat) (conf_mat[
[1] 0.9291667
Accuracy is 92%, which is quite good
2,2]/sum(conf_mat[,2]) conf_mat[
[1] 0.8666667
But note that nostalgia is rare, so we can get good accuracy just by defaulting to 0 precision tells us proportion of 1s that were actually correct 87% because it very rarely said 1 when it was 0
2,2]/sum(conf_mat[2,]) conf_mat[
[1] 0.4642857
This tell us proportion of actual positives that were identified correctly … only 46%, very bad!
We could try different pre-processing, use different model(s), code more docs
(Time permitting) Try to replicate the training process on the training and test data and predict on the test data here are the datasets you’ll need to get your started
<- nostalgia %>% filter(doc_id %in% held_out_ids)
held_out <- nostalgia %>% filter(!doc_id %in% held_out_ids) train_and_test