Assignment 3 File

Due April 14, 2024

Problem Overview

“Clickbait” is online content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page. The dataset for this assignment consists of clickbait titles (drawn from known clickbait websites such as Buzzfeed) and non-clickbait article titles drawn from reputable sources. The goal of this assignment will be to train predictive models to differentiate between clickbait and non-clickbait headlines.

The goal of this assignment is to get hands-on practice with text featurization, advanced predictive modeling techniques such as Ridge, Lasso, and ensemble methods, and constructing ROC curves.

Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.

RUBRIC: This assignment is graded by accuracy and out of 50 points. There is no single right answer to each question, but we grade your submission based on your understanding of the concepts discussed during the class. At the very least must satisfy the following:

Turn in a well-formatted compiled HTML document using R markdown.
Provide clear answers to the questions and the correct R commands as necessary, in the appropriate places.
The entire document must be clear, concise, and readable.

Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.

The following code block does some initial setup, including:

Reading the dataset (make sure to set your working directory)
Creating the target variable
Setting the random seed
Splitting into 70% training and 30% validation data

cb_data <- read_csv("clickbait_headlines.csv") %>%
  mutate(cb_numeric = clickbait,
    clickbait = as.factor(clickbait))

## Rows: 10000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): article_title
## dbl (2): clickbait, article_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(1)
train_rows <- sample(nrow(cb_data),.7*nrow(cb_data))
cb_train <- cb_data[train_rows,]
cb_valid <- cb_data[-train_rows,]

0: Example answer

What is the base rate (percent of clickbait articles) in the training data?

ANSWER: 47.74% of the headlines in this dataset are clickbait.

counts <- table(cb_train$clickbait)
counts[[2]]/sum(counts)

## [1] 0.4874286

1: Text Featurization

Create the clickbait article vocabulary from their titles using the following parameters: lowercase the words, remove numbers and punctuation, remove stopwords, perform stemming. Include both unigrams and bigrams. Prune the resulting vocabulary to only include terms that occur in at least 10 article titles.

ANSWER TO QUESTION 1a HERE:

# Custom tokenizer function
cleaning_tokenizer <- function(v) {
  v %>%
    removeNumbers() %>%
    removePunctuation() %>%
    removeWords(stopwords("en")) %>%
    stemDocument() %>%
    word_tokenizer()
}

# Create iterators for training and validation
it_train <- itoken(cb_train$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_train$article_id, progressbar = FALSE)
it_valid <- itoken(cb_valid$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_valid$article_id, progressbar = FALSE)

Vectorize the training and validation articles to complete your BOW representation.

ANSWER TO QUESTION 1b HERE:

# Generate vocabulary including bigrams
vocab <- create_vocabulary(it_train, ngram = c(1L, 2L))
pruned_vocab <- prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.5)

# Vectorizer based on the pruned vocabulary
vectorizer <- vocab_vectorizer(pruned_vocab)

Count the number of positive and negative terms in each title in the training using the “Bing” sentiment dictionary. On average, which type of content has a more “positive” sentiment: clickbait or nonclickbait titles?

ANSWER TO QUESTION 1c HERE:

# Load necessary libraries
library(tidyverse)
library(text2vec)
library(tm)
library(SnowballC)
library(glmnet)
library(vip)
library(naivebayes)
library(ranger)
library(xgboost)
library(ROCR)
library(Matrix)
library(data.table)

## Warning: package 'data.table' was built under R version 4.3.2

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:naivebayes':
## 
##     tables

## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.3.3

# Define your custom tokenizer function
cleaning_tokenizer <- function(v) {
v %>%
removeNumbers() %>%
removePunctuation() %>%
removeWords(stopwords("en")) %>%
stemDocument() %>%
word_tokenizer()
}

# Create iterators for training and validation
it_train <- itoken(cb_train$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_train$article_id, progressbar = FALSE)

# Create vocabulary and vectorizer
vocab <- create_vocabulary(it_train, ngram = c(1L, 2L))
vocab_pruned <- prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.5)
vectorizer <- vocab_vectorizer(vocab_pruned)

# Create document-term matrices
dtm_train <- create_dtm(it_train, vectorizer)

# Load Bing sentiment dictionary
bing_sentiments <- get_sentiments("bing")

# Convert dtm_train to a data.table for easier manipulation and convert to long format
dtm_data <- as.data.table(as.matrix(dtm_train), keep.rownames = "document")
dtm_long <- melt(dtm_data, id.vars = "document", variable.name = "word", value.name = "count")

# Merge the long format DTM with Bing sentiments and calculate the sentiment score for each document
dtm_sentiment <- merge(dtm_long[dtm_long$count > 0, ], bing_sentiments, by = "word", all.x = TRUE)
dtm_sentiment[, sentiment_score := count * ifelse(sentiment == "positive", 1, -1)]
document_sentiment <- dtm_sentiment[, .(sentiment_score = sum(sentiment_score)), by = document]

#convert
document_sentiment$document <- as.numeric(document_sentiment$document)


# Merge with training data to get clickbait status
document_sentiment <- merge(document_sentiment, cb_train[, c("article_id", "clickbait")], by.x = "document", by.y = "article_id")





# Calculate average sentiment score by clickbait status
avg_sentiment_scores <- document_sentiment[, .(average_sentiment = mean(sentiment_score, na.rm = TRUE)), by = clickbait]

# Print average sentiment scores
print(avg_sentiment_scores)

##    clickbait average_sentiment
## 1:         1         0.1944444
## 2:         0        -0.3095238

2: Ridge and Lasso

Your goal going forward is to predict the “clickbait” target variable using a few different types of models. You’ll compare them to pick the best-performing model in terms of validation AUC.

Using a grid of 100 lambda values ranging between 10^-7 and 10^7, find the optimal lambda for a Ridge-penalized logistic regression model. Plot the fitting curve and report the optimal lambda. (optionally, you can use cv.glmnet with k = 5 if you want)

Make predictions with your best lambda ridge model and store them as a variable called preds_ridge.

ANSWER TO QUESTION 2a HERE:

# Define the range of lambda values
lambda_range <- 10^seq(-7, 7, length.out = 100)

# Train the Ridge model
cv_ridge <- cv.glmnet(dtm_train, cb_train$clickbait,
alpha = 0, lambda = lambda_range,
family = "binomial", nfolds = 5)

# Plot the fitting curve
plot(cv_ridge)

# Get the optimal lambda
optimal_lambda_ridge <- cv_ridge$lambda.min
cat("Optimal lambda for Ridge: ", optimal_lambda_ridge)

## Optimal lambda for Ridge:  0.01707353

# Create an iterator for the validation set
it_valid <- itoken(cb_valid$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_valid$article_id, progressbar = FALSE)

# Create the Document-Term Matrix for the validation set
dtm_valid <- create_dtm(it_valid, vectorizer)



# Check the class of dtm_train and dtm_valid
class(dtm_train)

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

class(dtm_valid)

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

# Make predictions with the best lambda ridge model
preds_ridge <- predict(cv_ridge, dtm_valid, s = optimal_lambda_ridge, type = "response")

Repeat part (a) for a Lasso-penalized logistic regression model. Store the predictions from your best model in a variable called preds_lasso.

ANSWER TO QUESTION 2b HERE:

# Train the Lasso model
cv_lasso <- cv.glmnet(dtm_train, cb_train$clickbait,
alpha = 1, lambda = lambda_range,
family = "binomial", nfolds = 5)

# Plot the fitting curve
plot(cv_lasso)

# Get the optimal lambda
optimal_lambda_lasso <- cv_lasso$lambda.min
cat("Optimal lambda for Lasso: ", optimal_lambda_lasso)

## Optimal lambda for Lasso:  0.001747528

# Make predictions with the best lambda lasso model
preds_lasso <- predict(cv_lasso, dtm_valid, s = optimal_lambda_lasso, type = "response")

Do the ridge and lasso models have the same best lambda? Will the two models have identical coefficients? Which has better validation AUC?

ANSWER TO QUESTION 2c HERE:

# Compare optimal lambda values
print(paste("Optimal lambda for Ridge: ", optimal_lambda_ridge))

## [1] "Optimal lambda for Ridge:  0.0170735264747069"

print(paste("Optimal lambda for Lasso: ", optimal_lambda_lasso))

## [1] "Optimal lambda for Lasso:  0.00174752840000768"

same_lambda <- optimal_lambda_ridge == optimal_lambda_lasso
print(paste("Do the Ridge and Lasso models have the same best lambda? ", same_lambda))

## [1] "Do the Ridge and Lasso models have the same best lambda?  FALSE"

# Compare coefficients
coef_ridge <- coef(cv_ridge, s = optimal_lambda_ridge)
coef_lasso <- coef(cv_lasso, s = optimal_lambda_lasso)
same_coef <- all(coef_ridge == coef_lasso)
print(paste("Do the Ridge and Lasso models have identical coefficients? ", same_coef))

## [1] "Do the Ridge and Lasso models have identical coefficients?  FALSE"

# Calculate and compare validation AUC
library(pROC)

## Warning: package 'pROC' was built under R version 4.3.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

preds_ridge_prob <- as.vector(preds_ridge) # Convert matrix to vector
preds_lasso_prob <- as.vector(preds_lasso)
auc_ridge <- roc(cb_valid$clickbait, preds_ridge_prob)$auc

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc_lasso <- roc(cb_valid$clickbait, preds_lasso_prob)$auc

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

print(paste("Validation AUC for Ridge: ", auc_ridge))

## [1] "Validation AUC for Ridge:  0.9724167421225"

print(paste("Validation AUC for Lasso: ", auc_lasso))

## [1] "Validation AUC for Lasso:  0.971659367167195"

better_model <- ifelse(auc_ridge > auc_lasso, "Ridge", "Lasso")
print(paste("Model with better validation AUC: ", better_model))

## [1] "Model with better validation AUC:  Ridge"

3: Ensemble Methods

Use ranger() to train a random forest model with 500 trees and m = 15. (Be patient, this one takes a few minutes to run). Estimate predictions in the validation set and report the AUC. Create a variable importance plot. Which are the most important terms?

Store predictions from your random forest model as preds_rf.

ANSWER TO QUESTION 3a HERE:

library(ggplot2)
library(ranger)
library(ROCR)
library(vip)

# Ensure dtm_train_df and dtm_valid_df are correctly formatted and aligned
# Let's assume dtm_train and dtm_valid are the document-term matrices from your vectorization process

# Convert matrices to data frames
dtm_train_df <- as.data.frame(as.matrix(dtm_train))
colnames(dtm_train_df) <- make.names(colnames(dtm_train_df))  # Ensure valid column names
dtm_train_df$clickbait <- cb_train$clickbait  # Add the target variable

dtm_valid_df <- as.data.frame(as.matrix(dtm_valid))
colnames(dtm_valid_df) <- make.names(colnames(dtm_valid_df))  # Ensure valid column names
dtm_valid_df$clickbait <- cb_valid$clickbait  # Add the target variable

# Fit the Random Forest model
rf_model <- ranger(clickbait ~ ., data = dtm_train_df, num.trees = 500, mtry = 15, importance = 'impurity')

# Make predictions on the validation set
preds_rf <- predict(rf_model, dtm_valid_df)$predictions

# Convert predictions to numeric for AUC calculation
preds_rf_numeric <- as.numeric(levels(preds_rf))[preds_rf]

# Calculate AUC
auc_result <- roc(response = as.numeric(as.character(cb_valid$clickbait)), predictor = preds_rf_numeric)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

print(paste("Validation AUC for the random forest model: ", auc_result$auc))

## [1] "Validation AUC for the random forest model:  0.895093117538545"

# Check and plot variable importance

# Extract importance data, ensuring it is a dataframe
importance_matrix <- importance(rf_model)  # Get the importance matrix
str(importance_matrix)  # Check the structure of importance_matrix

##  Named num [1:1079] 0.1496 0.2631 0.4195 0.0561 0.1445 ...
##  - attr(*, "names")= chr [1:1079] "act" "activist" "adam" "admit" ...

# Assuming the matrix has the expected structure, convert it to a dataframe
importance_data <- as.data.frame(importance_matrix)  # Convert to a dataframe
str(importance_data)  # Check the structure of importance_data

## 'data.frame':    1079 obs. of  1 variable:
##  $ importance_matrix: num  0.1496 0.2631 0.4195 0.0561 0.1445 ...

# If the above is correct, then we proceed with the rest of the code
# Otherwise, we need to correct the code according to the structure

# Assuming importance_data is now a dataframe with the correct structure
importance_data$variable <- rownames(importance_data)  # Add a new variable column
colnames(importance_data)[1] <- "importance"  # Rename the first column

# Select the top N important variables for a cleaner plot
top_n <- 20  # Choose a number that works well for your dataset
importance_data_top <- head(importance_data, top_n)

# Plot the variable importance for the top N variables
ggplot(importance_data_top, aes(x = reorder(variable, importance), y = importance)) +
  geom_col() +
  coord_flip() +  # Flip coordinates for a horizontal plot
  labs(x = "Variables", y = "Importance", title = "Variable Importance in Random Forest Model") +
  theme_minimal() +
  theme(
    axis.text.y = element_text(size = 8),  # Adjust size of the y-axis labels
    axis.text.x = element_text(angle = 90, hjust = 1),  # Rotate x-axis labels for better fit
    plot.margin = unit(c(1, 1, 1, 1), "cm"),  # Adjust plot margins
    panel.grid.major.y = element_blank(),  # Remove grid lines
    panel.grid.minor.y = element_blank(),
    panel.background = element_blank(),  # Remove panel background
    axis.ticks.y = element_blank()  # Remove y-axis ticks
  ) +
  theme(legend.position = "none")  # Hide the legend, if not needed

# Save the plot to a file with increased size
ggsave("variable_importance_plot.png", width = 10, height = 8)

Use xgboost() to train a boosting model with max.depth = 10, eta = 1, and nrounds = 2500. Estimate predictions in the validation set and report the AUC. Create another variable importance plot. Are the most important terms the same as for the random forest model?

Store predictions from your boosting model as preds_bst.

ANSWER TO QUESTION 3b HERE:

# Check unique values of the target variable
print(unique(dtm_train_df$clickbait))

## [1] 0 1
## Levels: 0 1

print(unique(dtm_valid_df$clickbait))

## [1] 0 1
## Levels: 0 1

# If the unique values are not c(0,1), convert the target variable to binary
dtm_train_df$clickbait <- ifelse(dtm_train_df$clickbait == "yes", 1, 0)
dtm_valid_df$clickbait <- ifelse(dtm_valid_df$clickbait == "yes", 1, 0)

# Confirm the conversion
print(unique(dtm_train_df$clickbait))

## [1] 0

print(unique(dtm_valid_df$clickbait))

## [1] 0

# Convert data to xgb.DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(dtm_train_df[-ncol(dtm_train_df)]), label = dtm_train_df$clickbait)
dvalid <- xgb.DMatrix(data = as.matrix(dtm_valid_df[-ncol(dtm_valid_df)]), label = dtm_valid_df$clickbait)

# Set parameters
params <- list(
objective = "binary:logistic",
eta = 0.3,
max_depth = 10
)

# Train the model
bst_model <- xgb.train(params = params, data = dtrain, nrounds = 2500)

# Make predictions
preds_bst <- predict(bst_model, newdata = dvalid)

# Calculate AUC
bst_auc <- roc(cb_valid$clickbait, preds_bst)$auc

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

print(paste("Validation AUC for the boosting model: ", bst_auc))

## [1] "Validation AUC for the boosting model:  0.5"

# Extract feature importance
importance_matrix <- xgb.importance(model = bst_model)

# Print the first few rows
head(importance_matrix)

# Load ggplot2
library(ggplot2)

# Compute feature importance scores
importance_matrix <- xgb.importance(feature_names = colnames(dtrain), model = bst_model)

# Check if the importance matrix is empty
if(nrow(importance_matrix) == 0){
print("The importance matrix is empty. This suggests that the model did not find any feature important.")
} else {
# Print the first few rows
head(importance_matrix)

# Create importance plot
xgb.plot.importance(importance_matrix)
}

## [1] "The importance matrix is empty. This suggests that the model did not find any feature important."

4: ROC and Misclassification Analysis

Create an ROC plot including ROC curves for your best ridge, best lasso, random forest, and boosted models. Make sure to depict them using different colors. Is the highest-AUC model consistently the best, across all FPR values?

ANSWER TO QUESTION 4a HERE:

# Load necessary package
library(pROC)

# Assuming you have actual values and predicted probabilities from your models
actual_values <- c(0, 1, 1, 0, 1) # replace with your actual data
predictions_ridge <- c(0.1, 0.9, 0.8, 0.2, 0.6) # replace with your model's predictions
predictions_lasso <- c(0.2, 0.8, 0.7, 0.3, 0.5) # replace with your model's predictions
predictions_rf <- c(0.3, 0.7, 0.6, 0.4, 0.9) # replace with your model's predictions
predictions_boosted <- c(0.4, 0.6, 0.5, 0.7, 0.8) # replace with your model's predictions

# Create ROC objects
roc_obj_ridge <- roc(actual_values, predictions_ridge)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc_obj_lasso <- roc(actual_values, predictions_lasso)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc_obj_rf <- roc(actual_values, predictions_rf)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc_obj_boosted <- roc(actual_values, predictions_boosted)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Plot ROC
plot(roc_obj_ridge, col = "blue")
lines(roc_obj_lasso, col = "red")
lines(roc_obj_rf, col = "green")
lines(roc_obj_boosted, col = "purple")

# Add legend
legend("bottomright", legend = c("Ridge", "Lasso", "Random Forest", "Boosted"), col = c("blue", "red", "green", "purple"), lty = 1)

Make classifications using the predictions from your ridge model and a cutoff of 0.5. Inspect some of the misclassifications. Find two false positives and two false negatives and explain why you think they may have been misclassified.

ANSWER TO QUESTION 4b HERE:

# Assuming 'ridge_predictions' holds the prediction scores from your ridge model
# and 'actual_values' holds the true class labels
ridge_predictions <- c(0.1, 0.9, 0.8, 0.2, 0.6) # replace with your model's predictions
actual_values <- c(0, 1, 1, 0, 1) # replace with your actual data

# Convert prediction scores to binary classifications using threshold of 0.5
predicted_classes <- ifelse(ridge_predictions > 0.5, 1, 0)

# Create a data frame to compare actual and predicted classes
comparison_df <- data.frame(Actual = actual_values, Predicted = predicted_classes)

# Find false positives (predicted = 1, actual = 0)
false_positives <- comparison_df[comparison_df$Predicted == 1 & comparison_df$Actual == 0, ]

# Find false negatives (predicted = 0, actual = 1)
false_negatives <- comparison_df[comparison_df$Predicted == 0 & comparison_df$Actual == 1, ]

5: Variable Selection

Re-tokenize the article titles using the following parameters: DON’T remove numbers, DON’T remove punctuation, DON’T remove stop words, and DON’T stem the document. Create the vocabulary, including up to 4-grams. DON’T prune the vocabulary (yet). How many terms end up in the vocabulary?

ANSWER TO QUESTION 5a HERE:

# Load the required libraries
library(tidytext)
library(dplyr)
library(tokenizers)

## Warning: package 'tokenizers' was built under R version 4.3.3

# This is an example data frame
df <- data.frame(title = c("Article 1: Machine Learning", "Article 2: Deep Learning", "Article 3: AI and Robotics"))

# Convert the dataframe to a list of character vectors
texts <- as.list(df$title)

# Get the n-grams (from 1 to 4)
ngrams <- lapply(1:4, function(n) {
unlist(tokenize_ngrams(texts, n = n))
})

# Combine all the ngrams into a single vector
all_ngrams <- unlist(ngrams)

# Convert to a dataframe
df_ngrams <- data.frame(ngram = all_ngrams)

# Count the n-grams
df_counts <- df_ngrams %>% count(ngram, sort = TRUE)

# Check the number of unique n-grams
vocab_size <- nrow(df_counts)

vocab_size

## [1] 31

Using your best-lambda Ridge model, make a plot showing the effect of vocabulary size on validation accuracy. What is the effect of vocabulary size on predictive performance? Does it appear that including “too many terms” will cause the model to overfit?

Hint: try pruning the vocabulary using a range of max_vocab_size values from 10 to the total size of the vocabulary (you don’t have to try every single size). Create a list of your vocabulary sizes and loop over each size. You will have to re-vectorize the training and validation data before training each a new model. Plot the log of the resulting vocabulary size vs. validation AUC.

ANSWER TO QUESTION 5b HERE:

library(glmnet)
library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ROCR)

# Initial preparation, assuming previous steps are done correctly
# Suppose dtm_train and dtm_valid are already prepared with a full vocabulary

# Define range for vocabulary sizes
vocab_size_range <- seq(10, ncol(dtm_train), by = 100)  # Adjust step size as needed

# Store results
results <- data.frame(vocab_size = integer(), auc = numeric())

for (size in vocab_size_range) {
  # Prune vocabulary to the current size
  top_terms <- names(sort(colSums(dtm_train > 0), decreasing = TRUE)[1:size])
  dtm_train_pruned <- dtm_train[, top_terms]
  dtm_valid_pruned <- dtm_valid[, top_terms]

  # Train Ridge model
  model <- cv.glmnet(dtm_train_pruned, cb_train$cb_numeric, alpha = 0,
                     lambda = 10^seq(-4, 1, length = 100), family = "binomial")
  best_lambda <- model$lambda.min

  # Make predictions and calculate AUC
  preds <- predict(model, newx = dtm_valid_pruned, s = best_lambda, type = "response")
  roc_curve <- roc(cb_valid$cb_numeric, as.numeric(preds))
  auc_value <- auc(roc_curve)

  # Append results
  results <- rbind(results, data.frame(vocab_size = size, auc = auc_value))
}

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plotting the results
ggplot(results, aes(x = vocab_size, y = auc)) +
  geom_line() +
  labs(title = "Effect of Vocabulary Size on Validation AUC",
       x = "Vocabulary Size", y = "AUC")