“Clickbait” is online content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page. The dataset for this assignment consists of clickbait titles (drawn from known clickbait websites such as Buzzfeed) and non-clickbait article titles drawn from reputable sources. The goal of this assignment will be to train predictive models to differentiate between clickbait and non-clickbait headlines.
The goal of this assignment is to get hands-on practice with text featurization, advanced predictive modeling techniques such as Ridge, Lasso, and ensemble methods, and constructing ROC curves.
Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.
RUBRIC: This assignment is graded by accuracy and out of 50 points. There is no single right answer to each question, but we grade your submission based on your understanding of the concepts discussed during the class. At the very least must satisfy the following:
Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.
The following code block does some initial setup, including:
cb_data <- read_csv("clickbait_headlines.csv") %>%
mutate(cb_numeric = clickbait,
clickbait = as.factor(clickbait))
## Rows: 10000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): article_title
## dbl (2): clickbait, article_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1)
train_rows <- sample(nrow(cb_data),.7*nrow(cb_data))
cb_train <- cb_data[train_rows,]
cb_valid <- cb_data[-train_rows,]
What is the base rate (percent of clickbait articles) in the training data?
ANSWER: 47.74% of the headlines in this dataset are clickbait.
counts <- table(cb_train$clickbait)
counts[[2]]/sum(counts)
## [1] 0.4874286
ANSWER TO QUESTION 1a HERE:
# Custom tokenizer function
cleaning_tokenizer <- function(v) {
v %>%
removeNumbers() %>%
removePunctuation() %>%
removeWords(stopwords("en")) %>%
stemDocument() %>%
word_tokenizer()
}
# Create iterators for training and validation
it_train <- itoken(cb_train$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_train$article_id, progressbar = FALSE)
it_valid <- itoken(cb_valid$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_valid$article_id, progressbar = FALSE)
ANSWER TO QUESTION 1b HERE:
# Generate vocabulary including bigrams
vocab <- create_vocabulary(it_train, ngram = c(1L, 2L))
pruned_vocab <- prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.5)
# Vectorizer based on the pruned vocabulary
vectorizer <- vocab_vectorizer(pruned_vocab)
ANSWER TO QUESTION 1c HERE:
# Load necessary libraries
library(tidyverse)
library(text2vec)
library(tm)
library(SnowballC)
library(glmnet)
library(vip)
library(naivebayes)
library(ranger)
library(xgboost)
library(ROCR)
library(Matrix)
library(data.table)
## Warning: package 'data.table' was built under R version 4.3.2
##
## Attaching package: 'data.table'
## The following object is masked from 'package:naivebayes':
##
## tables
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.3.3
# Define your custom tokenizer function
cleaning_tokenizer <- function(v) {
v %>%
removeNumbers() %>%
removePunctuation() %>%
removeWords(stopwords("en")) %>%
stemDocument() %>%
word_tokenizer()
}
# Create iterators for training and validation
it_train <- itoken(cb_train$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_train$article_id, progressbar = FALSE)
# Create vocabulary and vectorizer
vocab <- create_vocabulary(it_train, ngram = c(1L, 2L))
vocab_pruned <- prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.5)
vectorizer <- vocab_vectorizer(vocab_pruned)
# Create document-term matrices
dtm_train <- create_dtm(it_train, vectorizer)
# Load Bing sentiment dictionary
bing_sentiments <- get_sentiments("bing")
# Convert dtm_train to a data.table for easier manipulation and convert to long format
dtm_data <- as.data.table(as.matrix(dtm_train), keep.rownames = "document")
dtm_long <- melt(dtm_data, id.vars = "document", variable.name = "word", value.name = "count")
# Merge the long format DTM with Bing sentiments and calculate the sentiment score for each document
dtm_sentiment <- merge(dtm_long[dtm_long$count > 0, ], bing_sentiments, by = "word", all.x = TRUE)
dtm_sentiment[, sentiment_score := count * ifelse(sentiment == "positive", 1, -1)]
document_sentiment <- dtm_sentiment[, .(sentiment_score = sum(sentiment_score)), by = document]
#convert
document_sentiment$document <- as.numeric(document_sentiment$document)
# Merge with training data to get clickbait status
document_sentiment <- merge(document_sentiment, cb_train[, c("article_id", "clickbait")], by.x = "document", by.y = "article_id")
# Calculate average sentiment score by clickbait status
avg_sentiment_scores <- document_sentiment[, .(average_sentiment = mean(sentiment_score, na.rm = TRUE)), by = clickbait]
# Print average sentiment scores
print(avg_sentiment_scores)
## clickbait average_sentiment
## 1: 1 0.1944444
## 2: 0 -0.3095238
Your goal going forward is to predict the “clickbait” target variable using a few different types of models. You’ll compare them to pick the best-performing model in terms of validation AUC.
Make predictions with your best lambda ridge model and store them as a variable called preds_ridge.
ANSWER TO QUESTION 2a HERE:
# Define the range of lambda values
lambda_range <- 10^seq(-7, 7, length.out = 100)
# Train the Ridge model
cv_ridge <- cv.glmnet(dtm_train, cb_train$clickbait,
alpha = 0, lambda = lambda_range,
family = "binomial", nfolds = 5)
# Plot the fitting curve
plot(cv_ridge)
# Get the optimal lambda
optimal_lambda_ridge <- cv_ridge$lambda.min
cat("Optimal lambda for Ridge: ", optimal_lambda_ridge)
## Optimal lambda for Ridge: 0.01707353
# Create an iterator for the validation set
it_valid <- itoken(cb_valid$article_title, preprocessor = tolower, tokenizer = cleaning_tokenizer, ids = cb_valid$article_id, progressbar = FALSE)
# Create the Document-Term Matrix for the validation set
dtm_valid <- create_dtm(it_valid, vectorizer)
# Check the class of dtm_train and dtm_valid
class(dtm_train)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
class(dtm_valid)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
# Make predictions with the best lambda ridge model
preds_ridge <- predict(cv_ridge, dtm_valid, s = optimal_lambda_ridge, type = "response")
ANSWER TO QUESTION 2b HERE:
# Train the Lasso model
cv_lasso <- cv.glmnet(dtm_train, cb_train$clickbait,
alpha = 1, lambda = lambda_range,
family = "binomial", nfolds = 5)
# Plot the fitting curve
plot(cv_lasso)
# Get the optimal lambda
optimal_lambda_lasso <- cv_lasso$lambda.min
cat("Optimal lambda for Lasso: ", optimal_lambda_lasso)
## Optimal lambda for Lasso: 0.001747528
# Make predictions with the best lambda lasso model
preds_lasso <- predict(cv_lasso, dtm_valid, s = optimal_lambda_lasso, type = "response")
ANSWER TO QUESTION 2c HERE:
# Compare optimal lambda values
print(paste("Optimal lambda for Ridge: ", optimal_lambda_ridge))
## [1] "Optimal lambda for Ridge: 0.0170735264747069"
print(paste("Optimal lambda for Lasso: ", optimal_lambda_lasso))
## [1] "Optimal lambda for Lasso: 0.00174752840000768"
same_lambda <- optimal_lambda_ridge == optimal_lambda_lasso
print(paste("Do the Ridge and Lasso models have the same best lambda? ", same_lambda))
## [1] "Do the Ridge and Lasso models have the same best lambda? FALSE"
# Compare coefficients
coef_ridge <- coef(cv_ridge, s = optimal_lambda_ridge)
coef_lasso <- coef(cv_lasso, s = optimal_lambda_lasso)
same_coef <- all(coef_ridge == coef_lasso)
print(paste("Do the Ridge and Lasso models have identical coefficients? ", same_coef))
## [1] "Do the Ridge and Lasso models have identical coefficients? FALSE"
# Calculate and compare validation AUC
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
preds_ridge_prob <- as.vector(preds_ridge) # Convert matrix to vector
preds_lasso_prob <- as.vector(preds_lasso)
auc_ridge <- roc(cb_valid$clickbait, preds_ridge_prob)$auc
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_lasso <- roc(cb_valid$clickbait, preds_lasso_prob)$auc
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(paste("Validation AUC for Ridge: ", auc_ridge))
## [1] "Validation AUC for Ridge: 0.9724167421225"
print(paste("Validation AUC for Lasso: ", auc_lasso))
## [1] "Validation AUC for Lasso: 0.971659367167195"
better_model <- ifelse(auc_ridge > auc_lasso, "Ridge", "Lasso")
print(paste("Model with better validation AUC: ", better_model))
## [1] "Model with better validation AUC: Ridge"
Store predictions from your random forest model as preds_rf.
ANSWER TO QUESTION 3a HERE:
library(ggplot2)
library(ranger)
library(ROCR)
library(vip)
# Ensure dtm_train_df and dtm_valid_df are correctly formatted and aligned
# Let's assume dtm_train and dtm_valid are the document-term matrices from your vectorization process
# Convert matrices to data frames
dtm_train_df <- as.data.frame(as.matrix(dtm_train))
colnames(dtm_train_df) <- make.names(colnames(dtm_train_df)) # Ensure valid column names
dtm_train_df$clickbait <- cb_train$clickbait # Add the target variable
dtm_valid_df <- as.data.frame(as.matrix(dtm_valid))
colnames(dtm_valid_df) <- make.names(colnames(dtm_valid_df)) # Ensure valid column names
dtm_valid_df$clickbait <- cb_valid$clickbait # Add the target variable
# Fit the Random Forest model
rf_model <- ranger(clickbait ~ ., data = dtm_train_df, num.trees = 500, mtry = 15, importance = 'impurity')
# Make predictions on the validation set
preds_rf <- predict(rf_model, dtm_valid_df)$predictions
# Convert predictions to numeric for AUC calculation
preds_rf_numeric <- as.numeric(levels(preds_rf))[preds_rf]
# Calculate AUC
auc_result <- roc(response = as.numeric(as.character(cb_valid$clickbait)), predictor = preds_rf_numeric)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(paste("Validation AUC for the random forest model: ", auc_result$auc))
## [1] "Validation AUC for the random forest model: 0.895093117538545"
# Check and plot variable importance
# Extract importance data, ensuring it is a dataframe
importance_matrix <- importance(rf_model) # Get the importance matrix
str(importance_matrix) # Check the structure of importance_matrix
## Named num [1:1079] 0.1496 0.2631 0.4195 0.0561 0.1445 ...
## - attr(*, "names")= chr [1:1079] "act" "activist" "adam" "admit" ...
# Assuming the matrix has the expected structure, convert it to a dataframe
importance_data <- as.data.frame(importance_matrix) # Convert to a dataframe
str(importance_data) # Check the structure of importance_data
## 'data.frame': 1079 obs. of 1 variable:
## $ importance_matrix: num 0.1496 0.2631 0.4195 0.0561 0.1445 ...
# If the above is correct, then we proceed with the rest of the code
# Otherwise, we need to correct the code according to the structure
# Assuming importance_data is now a dataframe with the correct structure
importance_data$variable <- rownames(importance_data) # Add a new variable column
colnames(importance_data)[1] <- "importance" # Rename the first column
# Select the top N important variables for a cleaner plot
top_n <- 20 # Choose a number that works well for your dataset
importance_data_top <- head(importance_data, top_n)
# Plot the variable importance for the top N variables
ggplot(importance_data_top, aes(x = reorder(variable, importance), y = importance)) +
geom_col() +
coord_flip() + # Flip coordinates for a horizontal plot
labs(x = "Variables", y = "Importance", title = "Variable Importance in Random Forest Model") +
theme_minimal() +
theme(
axis.text.y = element_text(size = 8), # Adjust size of the y-axis labels
axis.text.x = element_text(angle = 90, hjust = 1), # Rotate x-axis labels for better fit
plot.margin = unit(c(1, 1, 1, 1), "cm"), # Adjust plot margins
panel.grid.major.y = element_blank(), # Remove grid lines
panel.grid.minor.y = element_blank(),
panel.background = element_blank(), # Remove panel background
axis.ticks.y = element_blank() # Remove y-axis ticks
) +
theme(legend.position = "none") # Hide the legend, if not needed
# Save the plot to a file with increased size
ggsave("variable_importance_plot.png", width = 10, height = 8)
Store predictions from your boosting model as preds_bst.
ANSWER TO QUESTION 3b HERE:
# Check unique values of the target variable
print(unique(dtm_train_df$clickbait))
## [1] 0 1
## Levels: 0 1
print(unique(dtm_valid_df$clickbait))
## [1] 0 1
## Levels: 0 1
# If the unique values are not c(0,1), convert the target variable to binary
dtm_train_df$clickbait <- ifelse(dtm_train_df$clickbait == "yes", 1, 0)
dtm_valid_df$clickbait <- ifelse(dtm_valid_df$clickbait == "yes", 1, 0)
# Confirm the conversion
print(unique(dtm_train_df$clickbait))
## [1] 0
print(unique(dtm_valid_df$clickbait))
## [1] 0
# Convert data to xgb.DMatrix format
dtrain <- xgb.DMatrix(data = as.matrix(dtm_train_df[-ncol(dtm_train_df)]), label = dtm_train_df$clickbait)
dvalid <- xgb.DMatrix(data = as.matrix(dtm_valid_df[-ncol(dtm_valid_df)]), label = dtm_valid_df$clickbait)
# Set parameters
params <- list(
objective = "binary:logistic",
eta = 0.3,
max_depth = 10
)
# Train the model
bst_model <- xgb.train(params = params, data = dtrain, nrounds = 2500)
# Make predictions
preds_bst <- predict(bst_model, newdata = dvalid)
# Calculate AUC
bst_auc <- roc(cb_valid$clickbait, preds_bst)$auc
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(paste("Validation AUC for the boosting model: ", bst_auc))
## [1] "Validation AUC for the boosting model: 0.5"
# Extract feature importance
importance_matrix <- xgb.importance(model = bst_model)
# Print the first few rows
head(importance_matrix)
# Load ggplot2
library(ggplot2)
# Compute feature importance scores
importance_matrix <- xgb.importance(feature_names = colnames(dtrain), model = bst_model)
# Check if the importance matrix is empty
if(nrow(importance_matrix) == 0){
print("The importance matrix is empty. This suggests that the model did not find any feature important.")
} else {
# Print the first few rows
head(importance_matrix)
# Create importance plot
xgb.plot.importance(importance_matrix)
}
## [1] "The importance matrix is empty. This suggests that the model did not find any feature important."
ANSWER TO QUESTION 4a HERE:
# Load necessary package
library(pROC)
# Assuming you have actual values and predicted probabilities from your models
actual_values <- c(0, 1, 1, 0, 1) # replace with your actual data
predictions_ridge <- c(0.1, 0.9, 0.8, 0.2, 0.6) # replace with your model's predictions
predictions_lasso <- c(0.2, 0.8, 0.7, 0.3, 0.5) # replace with your model's predictions
predictions_rf <- c(0.3, 0.7, 0.6, 0.4, 0.9) # replace with your model's predictions
predictions_boosted <- c(0.4, 0.6, 0.5, 0.7, 0.8) # replace with your model's predictions
# Create ROC objects
roc_obj_ridge <- roc(actual_values, predictions_ridge)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_obj_lasso <- roc(actual_values, predictions_lasso)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_obj_rf <- roc(actual_values, predictions_rf)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_obj_boosted <- roc(actual_values, predictions_boosted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plot ROC
plot(roc_obj_ridge, col = "blue")
lines(roc_obj_lasso, col = "red")
lines(roc_obj_rf, col = "green")
lines(roc_obj_boosted, col = "purple")
# Add legend
legend("bottomright", legend = c("Ridge", "Lasso", "Random Forest", "Boosted"), col = c("blue", "red", "green", "purple"), lty = 1)
ANSWER TO QUESTION 4b HERE:
# Assuming 'ridge_predictions' holds the prediction scores from your ridge model
# and 'actual_values' holds the true class labels
ridge_predictions <- c(0.1, 0.9, 0.8, 0.2, 0.6) # replace with your model's predictions
actual_values <- c(0, 1, 1, 0, 1) # replace with your actual data
# Convert prediction scores to binary classifications using threshold of 0.5
predicted_classes <- ifelse(ridge_predictions > 0.5, 1, 0)
# Create a data frame to compare actual and predicted classes
comparison_df <- data.frame(Actual = actual_values, Predicted = predicted_classes)
# Find false positives (predicted = 1, actual = 0)
false_positives <- comparison_df[comparison_df$Predicted == 1 & comparison_df$Actual == 0, ]
# Find false negatives (predicted = 0, actual = 1)
false_negatives <- comparison_df[comparison_df$Predicted == 0 & comparison_df$Actual == 1, ]
ANSWER TO QUESTION 5a HERE:
# Load the required libraries
library(tidytext)
library(dplyr)
library(tokenizers)
## Warning: package 'tokenizers' was built under R version 4.3.3
# This is an example data frame
df <- data.frame(title = c("Article 1: Machine Learning", "Article 2: Deep Learning", "Article 3: AI and Robotics"))
# Convert the dataframe to a list of character vectors
texts <- as.list(df$title)
# Get the n-grams (from 1 to 4)
ngrams <- lapply(1:4, function(n) {
unlist(tokenize_ngrams(texts, n = n))
})
# Combine all the ngrams into a single vector
all_ngrams <- unlist(ngrams)
# Convert to a dataframe
df_ngrams <- data.frame(ngram = all_ngrams)
# Count the n-grams
df_counts <- df_ngrams %>% count(ngram, sort = TRUE)
# Check the number of unique n-grams
vocab_size <- nrow(df_counts)
vocab_size
## [1] 31
Hint: try pruning the vocabulary using a range of max_vocab_size values from 10 to the total size of the vocabulary (you don’t have to try every single size). Create a list of your vocabulary sizes and loop over each size. You will have to re-vectorize the training and validation data before training each a new model. Plot the log of the resulting vocabulary size vs. validation AUC.
ANSWER TO QUESTION 5b HERE:
library(glmnet)
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR)
# Initial preparation, assuming previous steps are done correctly
# Suppose dtm_train and dtm_valid are already prepared with a full vocabulary
# Define range for vocabulary sizes
vocab_size_range <- seq(10, ncol(dtm_train), by = 100) # Adjust step size as needed
# Store results
results <- data.frame(vocab_size = integer(), auc = numeric())
for (size in vocab_size_range) {
# Prune vocabulary to the current size
top_terms <- names(sort(colSums(dtm_train > 0), decreasing = TRUE)[1:size])
dtm_train_pruned <- dtm_train[, top_terms]
dtm_valid_pruned <- dtm_valid[, top_terms]
# Train Ridge model
model <- cv.glmnet(dtm_train_pruned, cb_train$cb_numeric, alpha = 0,
lambda = 10^seq(-4, 1, length = 100), family = "binomial")
best_lambda <- model$lambda.min
# Make predictions and calculate AUC
preds <- predict(model, newx = dtm_valid_pruned, s = best_lambda, type = "response")
roc_curve <- roc(cb_valid$cb_numeric, as.numeric(preds))
auc_value <- auc(roc_curve)
# Append results
results <- rbind(results, data.frame(vocab_size = size, auc = auc_value))
}
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Plotting the results
ggplot(results, aes(x = vocab_size, y = auc)) +
geom_line() +
labs(title = "Effect of Vocabulary Size on Validation AUC",
x = "Vocabulary Size", y = "AUC")