Due April 28, 2022

Worth 40 points total

Problem Overview

“Clickbait” is online content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page. The dataset for this assignment consists of clickbait titles (drawn from known clickbait websites such as Buzzfeed) and non-clickbait article titles drawn from reputable sources. The goal of this assignment will be to train predictive models to differentiate between clickbait and non-clickbait headlines.

The goal of this assignment is to get hands-on practice with text featurization and advanced predictive modeling techniques, including Ridge, Lasso, ensemble methods, and Naive Bayes.

RUBRIC: There are three possible grades on this assignment: Fail (F), Pass (P), and High Pass (H). If you receive an F then you will have one more chance to turn it in to receive a P. If you receive H on 3 out of the 4 assignments this semester you’ll get a bonus point on your final average.

  1. Turn in a well-formatted compiled HTML document using R markdown. If you turn in a different file type or your code doesn’t compile, you will be asked to redo the assignment.
  2. Provide clear answers to the questions and the correct R commands as necessary, in the appropriate places. You may answer up to three sub-questions incorrectly and still receive a P on this assignment (for example, 1(a) counts as one sub-question). If you answer all sub-questions correctly on your first submission you will receive an H.
  3. The entire document must be clear, concise, readable, and well-formatted. If your assignment is unreadable or if you include more output than necessary to answer the questions you will be asked to redo the assignment.

Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.

The following code block does some initial setup, including:

  1. Reading the dataset (make sure to set your working directory)
  2. Creating the target variable
  3. Setting the random seed
  4. Splitting into 70% training and 30% validation data
cb_data <- read_csv("clickbait_headlines.csv") %>%
  mutate(cb_numeric = clickbait,
    clickbait = as.factor(clickbait))
## Rows: 10000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): article_title
## dbl (2): clickbait, article_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1)
train_rows <- sample(nrow(cb_data),.7*nrow(cb_data))
cb_train <- cb_data[train_rows,]
cb_valid <- cb_data[-train_rows,]

tr_y <- cb_train$clickbait
va_y <- cb_valid$clickbait

0: Example answer

What is the base rate (percent of clickbait articles) in the training data?

ANSWER: 51.26% of the headlines in this dataset are clickbait.

counts <- table(tr_y)
counts[0]/sum(counts)
## named numeric(0)

1: Text Featurization

  1. Create the clickbait article vocabulary from their titles using the following parameters: lowercase the words, remove numbers and punctuation, remove stopwords, perform stemming. Include both unigrams and bigrams. Prune the resulting vocabulary to only include terms that occur in at least 10 article titles.

ANSWER TO QUESTION 1a HERE:

# Convert all words to lowercase
prep_fun = tolower

# Define a "tokenizer" - a way to convert documents to tokens (i.e. features)
cleaning_tokenizer <- function(v) {
  v %>%
    removeNumbers %>% #remove all numbers
    removePunctuation %>% #remove all punctuation
    removeWords(stopwords(kind="en")) %>% #remove stopwords
    stemDocument %>%
    word_tokenizer 
}

tok_fun = cleaning_tokenizer

# Iterate over the individual documents and convert them to tokens
# Uses the functions defined above.
it_train = itoken(cb_train$article_title, 
                  preprocessor = prep_fun, 
                  tokenizer = tok_fun, 
                  ids = cb_train$article_id, 
                  progressbar = FALSE)

# Create the vocabulary from the itoken object
vocab = create_vocabulary(it_train)

#Include ngrams
vocab <- create_vocabulary(it_train, ngram = c(1L, 2L))

#Prune vocabulary
vocab = prune_vocabulary(vocab, term_count_min = 10)
  1. Vectorize the training and validation emails and convert them into TFIDF representation.

ANSWER TO QUESTION 1b HERE:

# Create a vectorizer object using the vocabulary we learned
vectorizer = vocab_vectorizer(vocab)

# Convert the training documents into a DTM
dtm_train = create_dtm(it_train, vectorizer)

# Convert the validation documents into a DTM

it_valid = tok_fun(prep_fun(cb_valid$article_title))
it_valid = itoken(it_valid, 
                  ids = cb_train$article_id, 
                  progressbar = FALSE)

#Converting the validation documents into a DTM 
dtm_valid = create_dtm(it_valid, vectorizer)

# Make a TFIDF DTM
tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm_valid_tfidf = fit_transform(dtm_valid, tfidf)

2: Ridge and Lasso

  1. Train 5-fold cross validated Ridge and lasso with lambda selected from a grid of 100 values ranging between 10^-7 and 10^7 (hint: use cv.glmnet). Include the plots showing the effect of lambda.

ANSWER TO QUESTION 2a HERE:

# Train a ridge model
grid <- 10^seq(7,-7,length=100)
k <- 5
cv.out.ridge <- cv.glmnet(dtm_train, tr_y, family="binomial", alpha=0, lambda=grid, nfolds=k)
plot(cv.out.ridge)

# Training a lasso model
grid <- 10^seq(7,-7,length=100)
k<-5
cv.out.lasso <- cv.glmnet(dtm_train, tr_y, family="binomial", alpha=1, lambda=grid, nfolds=k)
plot(cv.out.lasso)

  1. Do the ridge and lasso models have the same best lambda? Inspect the coefficients of the best lasso and ridge models. Are the coefficients exactly the same?

ANSWER TO QUESTION 2b HERE:

No, the ridge and lasso models do not have the same best lambda. The best lambda for the ridge model is 0.01707353 and for the lasso model is 0.001747528

No, the coefficients of the best ridge and lasso models are not the same. The intercept of the best ridge model is: -0.2719953 and for the best lasso model is -0.45843068.

bestlambda_ridge <- cv.out.ridge$lambda.min
bestlambda_ridge
## [1] 0.01707353
#Determining the best lambda for lasso model

bestlambda_lasso <- cv.out.lasso$lambda.min
bestlambda_lasso
## [1] 0.001747528
#Inspecting the coefficients of the best ridge model

coef_bestridge <- coef(cv.out.ridge, s=bestlambda_ridge)[1:15,]
coef_bestridge
## (Intercept)         act    activist        adam       admit      africa 
##  -0.2719953  -1.3360211  -0.0520564   0.6718300  -0.6162645  -0.7800130 
##       alert       alway      amazon       angel      apolog         bar 
##  -1.0484121   2.0304607  -0.3281494   0.3113925  -2.8483251   0.5268706 
##         bid         bin birth_month 
##  -1.8267505  -1.5691325   1.9255937
#Inspecting the coefficients of the best lasso model

coef_bestlasso <- coef(cv.out.lasso, s=bestlambda_lasso)[1:15,]
coef_bestlasso
## (Intercept)         act    activist        adam       admit      africa 
## -0.45843068 -1.51175051  0.00000000  0.07070444  0.00000000  0.00000000 
##       alert       alway      amazon       angel      apolog         bar 
## -0.98402434  1.38154279 -0.06558884  0.00000000 -2.89875864  0.00000000 
##         bid         bin birth_month 
## -1.54588292 -1.23200938  1.01085263
  1. Using the best lasso and ridge models, make predictions in the validation set. What are the accuracies of your best ridge and lasso models?

ANSWER TO QUESTION 2c HERE:

The accuracy of the best ridge model is 0.9143333

The accuracy of the best lasso model is 0.9146667

#Making the predictions in the validation set using the best ridge model

pred_r <- predict(cv.out.ridge, s=bestlambda_ridge, newx = dtm_valid,type="response")
class_r <- ifelse(pred_r > 0.5, 1, 0)
acc_r = mean(ifelse(class_r == va_y, 1, 0))
acc_r
## [1] 0.9143333
#Making the predictions in the validation set using the best lasso model

pred_l <- predict(cv.out.lasso, s=bestlambda_lasso, newx = dtm_valid,type="response")
class_l <- ifelse(pred_l > 0.5, 1, 0)
acc_l = mean(ifelse(class_l == va_y, 1, 0))
acc_l
## [1] 0.9146667

3: Ensemble Methods

  1. Use ranger() to train a random forest model with 500 trees and m = 15. (Be patient, this one takes a few minutes to run). Do the predictions/classifications in the validation set and report the accuracy. Create a variable importance plot. Which are the most important terms?

ANSWER TO QUESTION 3a HERE:

The accuracy of the random forest model is 0.901

The most important terms according to the variable importance plot are: ‘thing’, ‘peopl’, ‘will’, ‘know’, ‘can’, ‘actual’, ‘time’, ‘kill’, ‘us’, and ‘here’.

#Training a random forest model
rf.mod <- ranger(x = dtm_train, y = tr_y,
                 mtry=15, num.trees=500,
                 importance="impurity",
                 probability = TRUE)

#Making predictions and classifications in the validation set
rf_preds <- predict(rf.mod, data=dtm_valid)$predictions[,2]
rf_classifications <- ifelse(rf_preds>0.5, 1, 0)
rf_acc <- mean(ifelse(rf_classifications == va_y, 1, 0))
rf_acc
## [1] 0.901
#Plotting the variable importance plot
vip(rf.mod)

  1. Use xgboost() to train a boosting model with max.depth = 2, eta = 1, and nrounds = 1000. Do the classifications in the validation set and make predictions. Report the accuracy of your boosting model. Create another variable importance plot. Are the most important terms the same as for the random forest model?

ANSWER TO QUESTION 3b HERE:

The accuracy of the boosting model is 0.9103333

Most important terms are similar for the random forest model but the order of importance is different in comparison to the xgboost model. However, for both models, the most important term is ‘thing’.

tr_y_num <- cb_data[train_rows,]$cb_numeric
va_y_num <- cb_data[-train_rows,]$cb_numeric

bst <- xgboost(data = dtm_train, label = tr_y_num, max.depth = 2, eta = 1, nrounds = 1000,  objective = "binary:logistic", verbose = 0)

bst_pred <- predict(bst, dtm_valid)
bst_classifications <- ifelse(bst_pred > 0.5, 1, 0)
bst_acc <- mean(ifelse(bst_classifications == va_y_num, 1, 0))
bst_acc
## [1] 0.9103333
vip_plot <- vip(bst)
vip_plot

4: Naive Bayes

  1. Train two naive bayes models using multinomial_naive_bayes() - one with laplace = 3 and one with laplace = 0.

ANSWER TO QUESTION 4a HERE:

NB_smoothed_lp3 <- multinomial_naive_bayes(x = dtm_train, y = tr_y, laplace = 3)
  1. For both models, make predictions in the validation set, classify using a cutoff of 0.5, and report the accuracy. Do the two models have different performance?

ANSWER TO QUESTION 4b HERE:

The accuracy for the multinomial naive bayes model is 0.9223333

nb_preds_lp3 <- predict(NB_smoothed_lp3, dtm_valid, type = "prob")[,2]
nb_class_lp3 <- ifelse(nb_preds_lp3 > 0.5, 1, 0)
nb_acc_lp3 <- mean(ifelse(nb_class_lp3 == va_y, 1, 0))
nb_acc_lp3
## [1] 0.9223333
  1. Inspect some of the misclassifications. Find two false positives and two false negatives and explain why you think they may have been misclassified.

ANSWER TO QUESTION 4c HERE:

False Positives

The following are titles which were misclassified as ‘Clickbait’ although they were not clickbait titles:

  • Beatles’ ‘Hey Jude’ becomes their most downloaded song on iTunes
  • Republicans Choose First Black Party Chairman

Clickbait titles tend to contain extreme words. We can observe in the above examples that they contain extreme words such as ‘most’ and ‘first’, the model may have misclassified them as clickbait titles even though they are not actually clickbait.

False Negatives

Following are two titles which were misclassified as ‘Not Clickbait’ although they were clickbait titles:

  • 14 Of The Hottest Topics From 2015
  • 18 Reasons Spotify Ads Are Worse Than Dying A Horrifying, Painful Death

These clickbait articles usually have titles which have a number describing the number of items on the list. In my opinion, removing the numbers while creating the vocabulary resulted in the models not recognizing these types of titles as clickbait.

#Putting the actual values as well as the predictions into a separate dataframe
clickbait_valid <- cb_valid
clickbait_valid$prediction <- nb_class_lp3

#Determining the false positives
false_positive <- clickbait_valid %>%
  filter(prediction == 1 & clickbait == 0) %>%
  select(article_title)
false_positive
#Determining the false negatives
false_negative <- clickbait_valid %>%
  filter(prediction == 0 & clickbait == 1)%>%
  select(article_title)
false_negative

5: Variable Selection

  1. Re-tokenize the article titles using the following parameters: DON’T remove numbers, DON’T remove punctuation, DON’T remove stop words, and DON’T stem the document. Create the vocabulary, including up to 4-grams. DON’T prune the vocabulary (yet).

ANSWER TO QUESTION 5a HERE:

#Creating a new tokenizer
cleaning_tokenizer_new <- function(v) {
  v %>%
    word_tokenizer 
}

tok_fun_new = cleaning_tokenizer_new

# Iterating over the individual documents and convert them to tokens

it_train_new = itoken(cb_train$article_title, 
                  preprocessor = prep_fun, 
                  tokenizer = tok_fun_new, 
                  ids = cb_train$article_id,
                  progressbar = FALSE)

#Creating the vocabulary from the itoken object and Including upto 4-grams

vocab_new <- create_vocabulary(it_train_new, ngram = c(1L, 4L))

# Tokenizing the validation documents

it_valid_new = tok_fun_new(prep_fun(cb_valid$article_title))
it_valid_new = itoken(it_valid_new, ids = cb_valid$article_id, progressbar = FALSE)
  1. Using smoothed Naive Bayes, make a plot showing the effect of vocabulary size on validation accuracy. What is the effect of vocabulary size on predictive performance? Does it appear that including “too many terms” will cause the model to overfit?

Hint: try pruning the vocabulary using a range of max_vocab_size values from 10 to the total size of the vocabulary. Create a list of your vocabulary sizes and loop over each size. You will have to re-vectorize the training and validation data before training each a new model. Plot the log of the resulting vocabulary size vs. validation accuracy.

ANSWER TO QUESTION 5b HERE:

We can observe in the graph that as the vocabulary size increases, the accuracy increases as well. But after the log(vocabulary_size) ~ 9, the accuracy starts dropping. Therefore, it does appear that after a certain vocabulary size the model begins to overfit and hence having “too many terms” is not good for the model as it may lead to overfitting.

#Specifying the vocabulary sizes

vocab_size<- c(100,500,1000,1500,2000,2500,5000,10000,15000,50000,100000,nrow(vocab_new))

va_acc <- rep(0, length(vocab_size))

#looping through the vocabulary sizes and determining the accuracy at each size

for(i in c(1:length(vocab_size))){
  l <- vocab_size[i]
  
  vocab_pruned = prune_vocabulary(vocab_new, vocab_term_max = l)
  
  vectorizer_new = vocab_vectorizer(vocab_pruned)
  
  dtm_train_new = create_dtm(it_train_new, vectorizer_new)
  dim(dtm_train_new)
  
  dtm_valid_new = create_dtm(it_valid_new, vectorizer_new)
  dim(dtm_valid_new)
  
  smoothed_model <- multinomial_naive_bayes(x = dtm_train_new, y = tr_y, laplace = 3)
  
  pred_va <- predict(smoothed_model, dtm_valid_new, type = "prob")[,2]
  class_va <- ifelse(pred_va > 0.5, 1, 0)
  va_acc[i] <- mean(ifelse(va_y == class_va, 1, 0))
}

#Plotting the vocab size vs the accuracy
plot(log(vocab_size), va_acc, type = 'l', col = 'red', ylim = c(0.5,1), xlab = 'Log(Vocabulary Size)', ylab = 'Validation Accuracy')