“Clickbait” is online content whose main purpose is to attract attention and encourage visitors to click on a link to a particular web page. The dataset for this assignment consists of clickbait titles (drawn from known clickbait websites such as Buzzfeed) and non-clickbait article titles drawn from reputable sources. The goal of this assignment will be to train predictive models to differentiate between clickbait and non-clickbait headlines.
The goal of this assignment is to get hands-on practice with text featurization and advanced predictive modeling techniques, including Ridge, Lasso, ensemble methods, and Naive Bayes.
RUBRIC: There are three possible grades on this assignment: Fail (F), Pass (P), and High Pass (H). If you receive an F then you will have one more chance to turn it in to receive a P. If you receive H on 3 out of the 4 assignments this semester you’ll get a bonus point on your final average.
Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.
The following code block does some initial setup, including:
cb_data <- read_csv("clickbait_headlines.csv") %>%
mutate(cb_numeric = clickbait,
clickbait = as.factor(clickbait))
## Rows: 10000 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): article_title
## dbl (2): clickbait, article_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(1)
train_rows <- sample(nrow(cb_data),.7*nrow(cb_data))
cb_train <- cb_data[train_rows,]
cb_valid <- cb_data[-train_rows,]
tr_y <- cb_train$clickbait
va_y <- cb_valid$clickbait
What is the base rate (percent of clickbait articles) in the training data?
ANSWER: 51.26% of the headlines in this dataset are clickbait.
counts <- table(tr_y)
counts[0]/sum(counts)
## named numeric(0)
ANSWER TO QUESTION 1a HERE:
# Convert all words to lowercase
prep_fun = tolower
# Define a "tokenizer" - a way to convert documents to tokens (i.e. features)
cleaning_tokenizer <- function(v) {
v %>%
removeNumbers %>% #remove all numbers
removePunctuation %>% #remove all punctuation
removeWords(stopwords(kind="en")) %>% #remove stopwords
stemDocument %>%
word_tokenizer
}
tok_fun = cleaning_tokenizer
# Iterate over the individual documents and convert them to tokens
# Uses the functions defined above.
it_train = itoken(cb_train$article_title,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = cb_train$article_id,
progressbar = FALSE)
# Create the vocabulary from the itoken object
vocab = create_vocabulary(it_train)
#Include ngrams
vocab <- create_vocabulary(it_train, ngram = c(1L, 2L))
#Prune vocabulary
vocab = prune_vocabulary(vocab, term_count_min = 10)
ANSWER TO QUESTION 1b HERE:
# Create a vectorizer object using the vocabulary we learned
vectorizer = vocab_vectorizer(vocab)
# Convert the training documents into a DTM
dtm_train = create_dtm(it_train, vectorizer)
# Convert the validation documents into a DTM
it_valid = tok_fun(prep_fun(cb_valid$article_title))
it_valid = itoken(it_valid,
ids = cb_train$article_id,
progressbar = FALSE)
#Converting the validation documents into a DTM
dtm_valid = create_dtm(it_valid, vectorizer)
# Make a TFIDF DTM
tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm_valid_tfidf = fit_transform(dtm_valid, tfidf)
ANSWER TO QUESTION 2a HERE:
# Train a ridge model
grid <- 10^seq(7,-7,length=100)
k <- 5
cv.out.ridge <- cv.glmnet(dtm_train, tr_y, family="binomial", alpha=0, lambda=grid, nfolds=k)
plot(cv.out.ridge)
# Training a lasso model
grid <- 10^seq(7,-7,length=100)
k<-5
cv.out.lasso <- cv.glmnet(dtm_train, tr_y, family="binomial", alpha=1, lambda=grid, nfolds=k)
plot(cv.out.lasso)
ANSWER TO QUESTION 2b HERE:
No, the ridge and lasso models do not have the same best lambda. The best lambda for the ridge model is 0.01707353 and for the lasso model is 0.001747528
No, the coefficients of the best ridge and lasso models are not the same. The intercept of the best ridge model is: -0.2719953 and for the best lasso model is -0.45843068.
bestlambda_ridge <- cv.out.ridge$lambda.min
bestlambda_ridge
## [1] 0.01707353
#Determining the best lambda for lasso model
bestlambda_lasso <- cv.out.lasso$lambda.min
bestlambda_lasso
## [1] 0.001747528
#Inspecting the coefficients of the best ridge model
coef_bestridge <- coef(cv.out.ridge, s=bestlambda_ridge)[1:15,]
coef_bestridge
## (Intercept) act activist adam admit africa
## -0.2719953 -1.3360211 -0.0520564 0.6718300 -0.6162645 -0.7800130
## alert alway amazon angel apolog bar
## -1.0484121 2.0304607 -0.3281494 0.3113925 -2.8483251 0.5268706
## bid bin birth_month
## -1.8267505 -1.5691325 1.9255937
#Inspecting the coefficients of the best lasso model
coef_bestlasso <- coef(cv.out.lasso, s=bestlambda_lasso)[1:15,]
coef_bestlasso
## (Intercept) act activist adam admit africa
## -0.45843068 -1.51175051 0.00000000 0.07070444 0.00000000 0.00000000
## alert alway amazon angel apolog bar
## -0.98402434 1.38154279 -0.06558884 0.00000000 -2.89875864 0.00000000
## bid bin birth_month
## -1.54588292 -1.23200938 1.01085263
ANSWER TO QUESTION 2c HERE:
The accuracy of the best ridge model is 0.9143333
The accuracy of the best lasso model is 0.9146667
#Making the predictions in the validation set using the best ridge model
pred_r <- predict(cv.out.ridge, s=bestlambda_ridge, newx = dtm_valid,type="response")
class_r <- ifelse(pred_r > 0.5, 1, 0)
acc_r = mean(ifelse(class_r == va_y, 1, 0))
acc_r
## [1] 0.9143333
#Making the predictions in the validation set using the best lasso model
pred_l <- predict(cv.out.lasso, s=bestlambda_lasso, newx = dtm_valid,type="response")
class_l <- ifelse(pred_l > 0.5, 1, 0)
acc_l = mean(ifelse(class_l == va_y, 1, 0))
acc_l
## [1] 0.9146667
ANSWER TO QUESTION 3a HERE:
The accuracy of the random forest model is 0.901
The most important terms according to the variable importance plot are: ‘thing’, ‘peopl’, ‘will’, ‘know’, ‘can’, ‘actual’, ‘time’, ‘kill’, ‘us’, and ‘here’.
#Training a random forest model
rf.mod <- ranger(x = dtm_train, y = tr_y,
mtry=15, num.trees=500,
importance="impurity",
probability = TRUE)
#Making predictions and classifications in the validation set
rf_preds <- predict(rf.mod, data=dtm_valid)$predictions[,2]
rf_classifications <- ifelse(rf_preds>0.5, 1, 0)
rf_acc <- mean(ifelse(rf_classifications == va_y, 1, 0))
rf_acc
## [1] 0.901
#Plotting the variable importance plot
vip(rf.mod)
ANSWER TO QUESTION 3b HERE:
The accuracy of the boosting model is 0.9103333
Most important terms are similar for the random forest model but the order of importance is different in comparison to the xgboost model. However, for both models, the most important term is ‘thing’.
tr_y_num <- cb_data[train_rows,]$cb_numeric
va_y_num <- cb_data[-train_rows,]$cb_numeric
bst <- xgboost(data = dtm_train, label = tr_y_num, max.depth = 2, eta = 1, nrounds = 1000, objective = "binary:logistic", verbose = 0)
bst_pred <- predict(bst, dtm_valid)
bst_classifications <- ifelse(bst_pred > 0.5, 1, 0)
bst_acc <- mean(ifelse(bst_classifications == va_y_num, 1, 0))
bst_acc
## [1] 0.9103333
vip_plot <- vip(bst)
vip_plot
ANSWER TO QUESTION 4a HERE:
NB_smoothed_lp3 <- multinomial_naive_bayes(x = dtm_train, y = tr_y, laplace = 3)
ANSWER TO QUESTION 4b HERE:
The accuracy for the multinomial naive bayes model is 0.9223333
nb_preds_lp3 <- predict(NB_smoothed_lp3, dtm_valid, type = "prob")[,2]
nb_class_lp3 <- ifelse(nb_preds_lp3 > 0.5, 1, 0)
nb_acc_lp3 <- mean(ifelse(nb_class_lp3 == va_y, 1, 0))
nb_acc_lp3
## [1] 0.9223333
ANSWER TO QUESTION 4c HERE:
The following are titles which were misclassified as ‘Clickbait’ although they were not clickbait titles:
Clickbait titles tend to contain extreme words. We can observe in the above examples that they contain extreme words such as ‘most’ and ‘first’, the model may have misclassified them as clickbait titles even though they are not actually clickbait.
Following are two titles which were misclassified as ‘Not Clickbait’ although they were clickbait titles:
These clickbait articles usually have titles which have a number describing the number of items on the list. In my opinion, removing the numbers while creating the vocabulary resulted in the models not recognizing these types of titles as clickbait.
#Putting the actual values as well as the predictions into a separate dataframe
clickbait_valid <- cb_valid
clickbait_valid$prediction <- nb_class_lp3
#Determining the false positives
false_positive <- clickbait_valid %>%
filter(prediction == 1 & clickbait == 0) %>%
select(article_title)
false_positive
#Determining the false negatives
false_negative <- clickbait_valid %>%
filter(prediction == 0 & clickbait == 1)%>%
select(article_title)
false_negative
ANSWER TO QUESTION 5a HERE:
#Creating a new tokenizer
cleaning_tokenizer_new <- function(v) {
v %>%
word_tokenizer
}
tok_fun_new = cleaning_tokenizer_new
# Iterating over the individual documents and convert them to tokens
it_train_new = itoken(cb_train$article_title,
preprocessor = prep_fun,
tokenizer = tok_fun_new,
ids = cb_train$article_id,
progressbar = FALSE)
#Creating the vocabulary from the itoken object and Including upto 4-grams
vocab_new <- create_vocabulary(it_train_new, ngram = c(1L, 4L))
# Tokenizing the validation documents
it_valid_new = tok_fun_new(prep_fun(cb_valid$article_title))
it_valid_new = itoken(it_valid_new, ids = cb_valid$article_id, progressbar = FALSE)
Hint: try pruning the vocabulary using a range of max_vocab_size values from 10 to the total size of the vocabulary. Create a list of your vocabulary sizes and loop over each size. You will have to re-vectorize the training and validation data before training each a new model. Plot the log of the resulting vocabulary size vs. validation accuracy.
ANSWER TO QUESTION 5b HERE:
We can observe in the graph that as the vocabulary size increases, the accuracy increases as well. But after the log(vocabulary_size) ~ 9, the accuracy starts dropping. Therefore, it does appear that after a certain vocabulary size the model begins to overfit and hence having “too many terms” is not good for the model as it may lead to overfitting.
#Specifying the vocabulary sizes
vocab_size<- c(100,500,1000,1500,2000,2500,5000,10000,15000,50000,100000,nrow(vocab_new))
va_acc <- rep(0, length(vocab_size))
#looping through the vocabulary sizes and determining the accuracy at each size
for(i in c(1:length(vocab_size))){
l <- vocab_size[i]
vocab_pruned = prune_vocabulary(vocab_new, vocab_term_max = l)
vectorizer_new = vocab_vectorizer(vocab_pruned)
dtm_train_new = create_dtm(it_train_new, vectorizer_new)
dim(dtm_train_new)
dtm_valid_new = create_dtm(it_valid_new, vectorizer_new)
dim(dtm_valid_new)
smoothed_model <- multinomial_naive_bayes(x = dtm_train_new, y = tr_y, laplace = 3)
pred_va <- predict(smoothed_model, dtm_valid_new, type = "prob")[,2]
class_va <- ifelse(pred_va > 0.5, 1, 0)
va_acc[i] <- mean(ifelse(va_y == class_va, 1, 0))
}
#Plotting the vocab size vs the accuracy
plot(log(vocab_size), va_acc, type = 'l', col = 'red', ylim = c(0.5,1), xlab = 'Log(Vocabulary Size)', ylab = 'Validation Accuracy')