The goal of this project is to use natural language processing on tweets and build a model to figure out whether a tweet is about a disaster or not.
The original dataset can be found on the “Real or Not ? NLP with Disaster Tweets” Kaggle competition, and, while the actual results are public (which explains the abnormally high amount of 100% accuracy submissions on the leaderboard), our goal here is to use efficient NLP tools and spend time on feature engineering and model building to maximize our submissions’ accuracies.
Throughout this project, we will use the following packages:
library(ggplot2)
library(ggrepel)
library(gridExtra)
library(stringi)
library(stringr)
library(dplyr)
library(data.table)
library(quanteda)
library(quanteda.textmodels)
library(RColorBrewer)
library(irlba)
library(caret)
library(randomForest)
library(gbm)
library(xgboost)
library(e1071)
As a personal habit, I automate the downloading and loading processes.
# This will create a "data" folder in the current directory and store the data in it
if (!file.exists("data")){
dir.create("data")
}
trainurl <- "https://storage.googleapis.com/kagglesdsdata/competitions/17777/869809/train.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1602402299&Signature=bP6vq%2FzMH7wwZfHDce5GKr2WaTSDQZuEtqX%2FNRtX%2B5qA0m10jVcrDUsj555XRdeijGUnJ6HkY%2BtIgLpbIsgjrFC5UUimgqwbexpPi8VaL0gFOIAa88e01FMSMNlk7lBDkzDsHmLR1pYjqY5SCX4YXFCNMnuXYbU5D87cQHMxgDZn1BFz2dJgNtj1yoympLH4UMIfvICKj8SD3xAfSFU913i6qqvVRoX19Jlmw4ejJOoJO2wFLexoaQ6UlezBDavpUpZUaxp1YzM6Y4NHNvIuIslSz45DJuYtBlDwoHOaJtrMnggAohZqHPHoz51BI%2Fsr6ImCnqVVlM5Beme%2BLTCsEw%3D%3D&response-content-disposition=attachment%3B+filename%3Dtrain.csv"
trainfile <- "./data/train.csv"
testurl <- "https://storage.googleapis.com/kagglesdsdata/competitions/17777/869809/test.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1602402306&Signature=KUi2ioNjD0fmj%2BNFnndPkeVjMzaV8CwlG90ZwrwC1x6rdwS8Cxz5P7a%2BdEhhfqrWk5N7cirMTcekO21t2mSHJ0b0cyH5hhxZvMdT0is9Q5oQqC852MZ5DRGFWxHHRmyqiaoqTznNn5qp6TIoqGDqu7iQkLbejBVJx57ledBVQASqN%2BvQJNJNwMduqGKWdNKnjKuG0Q9B%2F830jZS5AxQHyiSlXqIG5j9QyotLJU0MjX%2BI920vfmj6seeG57ls9ENmb7Fd5VgWdwr0Wo%2BKLGizjszLL%2BfVMmW%2F4OZC%2FlOWgvn4LpH5NrTNwPTZGA5VLUhcnHpYHH2rtxsi0Ax3tEq6zQ%3D%3D&response-content-disposition=attachment%3B+filename%3Dtest.csv"
testfile <- "./data/test.csv"
if (!file.exists(trainfile)){
download.file(trainurl, trainfile, method = "curl")
}
if (!file.exists(testfile)){
download.file(testurl, testfile, method = "curl")
}
testing <- read.csv(testfile)
training <- read.csv(trainfile)
It’s now time to take a look at our data. The first thing we’ll do is to convert both train and test sets to data tables to make future modifications easier.
training <- data.table(training)
testing <- data.table(testing)
For future data cleaning and processing, we’ll combine the training and testing set into one complete set :
comb <- data.table(rbind(training, mutate(testing, target = NA)))
We’ll also store the testing set IDs which will be necessary for future submissions :
test_id <- testing$id
Let’s have a look at the dimensions of our training set :
dim(training)
## [1] 7613 5
Ou response variable is target, which takes the value 1 if a tweet is about a real disaster and 0 if not.
ct <- c(nrow(filter(comb[!is.na(target), ], target == 0)),
nrow(filter(comb[!is.na(target), ], target == 1)))
ggplot(data = comb[!is.na(target), ], aes(x = factor(target), fill = target)) +
geom_bar(stat = "count") + theme(legend.position = "none") +
labs(x = "Target (1 if disaster related tweet, 0 if not)", y = "Count")+
ggtitle("Response variable repartition in the training set") +
geom_text(stat = "count", aes(label = ..count..), vjust = -1) +
coord_cartesian(ylim = c(0, 5000))
In our training set, the response variable’s repartition is rather balanced, which means we have the possibility to built a very accurate model.
Here we’ll create the following variables :
comb[, text_length := nchar(text), by = seq_len(nrow(comb))]
ggplot(comb[!is.na(target), ], aes(x = factor(target), y = text_length, fill = target)) + geom_boxplot()
Disaster tweets seem to be a bit longer
comb[, ncaps := str_count(text, "[A-Z]"), by = seq_len(nrow(comb))]
comb[, linkcount := str_count(text, "http"), by = seq_len(nrow(comb))]
comb[, punc_count := str_count(text, "[\\!\\?\\:\\.]"), by = seq_len(nrow(comb))]
comb[, mention_count := str_count(text, "@"), by = seq_len(nrow(comb))]
The first thing we will do with our training set is tokenization : we will split our lines of text into chunks of words. For instance, we want the sentence “This watch is Mike’s” to become [“this” “watch” “is” “mike” “s”].
This is where the package quanteda comes in ; it can automate this process with its tokens() function. We will set the following parameters :
comb_corpus <- corpus(comb$text)
comb_corpus <- stri_replace_all_regex(comb_corpus, "[\\p{p}\\p{S}]", "")
comb_corpus <- stri_replace_all_regex(comb_corpus, "http.*", "")
comb_tokens <- tokens(comb_corpus, what = "word", remove_numbers = T,
remove_punct = T, remove_symbols = T, split_hyphens = T,
remove_url = T, remove_twitter = T)
comb_tokens <- tokens_tolower(comb_tokens)
comb_tokens <- tokens_select(comb_tokens, stopwords(), selection = "remove")
comb_tokens <- tokens_wordstem(comb_tokens)
comb_dfm <- dfm(comb_tokens)
comb_dfm <- dfm_trim(comb_dfm, min_termfreq = 5)
We obtain our document frequency matrix, for which we trimmed down words that appear less than 5 times overall.
We still end up with a very large matrix :
dim(comb_dfm)
## [1] 10876 3024
Now that we ceated our tokens and DFM, we can find out what the most frequent single words - or unigrams - are for disaster tweets and non-disaster tweets. After that, we’ll add new variables that count how many of these words are in a tweet.
We’ll also do that for 2-word associations - or bigrams.
dis_ss <- comb$target == 1 ; dis_ss[is.na(dis_ss)] <- F
nodis_ss <- comb$target == 0 ; nodis_ss[is.na(nodis_ss)] <- F
dis_unidfm <- comb_dfm[dis_ss, ]
dis_unicount <- colSums(dis_unidfm)
dis_unicount <- dis_unicount[dis_unicount > 80]
dis_freq1gs <- names(dis_unicount)
We can now plot a wordcloud which contains the most frequent unigrams for disaster tweets :
textplot_wordcloud(dis_unidfm, min.freq = 80, color = brewer.pal(10, "BrBG"))
title("Most frequent words in disaster tweets", col.main = "grey14")
And we do the same for non-disaster tweets :
nodis_unidfm <- comb_dfm[nodis_ss, ]
nodis_unicount <- colSums(nodis_unidfm)
nodis_unicount <- nodis_unicount[nodis_unicount > 80]
nodis_freq1gs <- names(nodis_unicount)
textplot_wordcloud(nodis_unidfm, min.freq = 80, color = brewer.pal(10, "BrBG"))
title("Most frequent words in non-disaster tweets", col.main = "grey14")
Before creating our new features, we’ll remove frequent unigrams that are common to both disaster and non-disaster tweets :
red_dis1g <- which(dis_freq1gs %in% nodis_freq1gs)
red_nodis1g <- which(nodis_freq1gs %in% dis_freq1gs)
dis_freq1gs <- dis_freq1gs[-red_dis1g]
nodis_freq1gs <- nodis_freq1gs[-red_nodis1g]
comb[, nr_disasterWords := sum(str_count(text, dis_freq1gs)), by = seq_len(nrow(comb))]
comb[, nr_noDisasterWords := sum(str_count(text, nodis_freq1gs)), by = seq_len(nrow(comb))]
We’ll repeat the exact same process on bigrams.
We’ll start by creating the bigram DFM :
comb_bigrams <- tokens_ngrams(comb_tokens, n = 2)
comb_dfm2g <- dfm(comb_bigrams)
We can now isolate those from disaster tweets and plot the wordcloud :
dis_dfm2g <- comb_dfm2g[dis_ss, ]
dis_count2g <- colSums(dis_dfm2g)
dis_count2g <- dis_count2g[dis_count2g > 25]
textplot_wordcloud(dis_dfm2g, min.freq = 25, color = brewer.pal(10, "BrBG"))
title("Most frequent bigrams (2Gs) in disaster tweets", col.main = "grey14")
Now we do the same for non-disaster tweets :
nodis_dfm2g <- comb_dfm2g[nodis_ss, ]
nodis_count2g <- colSums(nodis_dfm2g)
nodis_count2g <- nodis_count2g[nodis_count2g > 15]
textplot_wordcloud(nodis_dfm2g, min.freq = 15, color = brewer.pal(10, "BrBG"))
title("Most frequent bigrams (2Gs) in non-disaster tweets", col.main = "grey14")
We’ll now create the a list of these frequent bigrams and remove those that are common for both disaster and non-disaster tweets :
dis_freq2gs <- names(dis_count2g)
nodis_freq2gs <- names(nodis_count2g)
red_dis2g <- which(dis_freq2gs %in% nodis_freq2gs)
red_nodis2g <- which(nodis_freq2gs %in% dis_freq2gs)
dis_freq2gs <- dis_freq2gs[-red_dis2g]
nodis_freq2gs <- nodis_freq2gs[-red_nodis2g]
When we look at our bigrams list, for instance in the wordclouds, we can see that the words are stemmed and joined by underscores :
dis_freq2gs[1:5]
## [1] "debri_found" "latest_home" "home_raze"
## [4] "raze_northern" "northern_california"
Therefore, when we want to count how many time those words appear in a tweet, we can’t use the raw data in which the tweets are neither stemmed nor joined. Luckily, we can use the words from our tokens list that have been stemmed :
comb_tokens[[1]][1:3]
## [1] "deed" "reason" "earthquak"
We’ll just have to pre-process it so that the words are joined by underscores, and then we can find the matches :
preprocessed_tokens <- list()
for (i in 1:length(comb_tokens)){
rslt <- paste(comb_tokens[i], collapse = "_")
preprocessed_tokens <- append(preprocessed_tokens, rslt)
}
bigram_match <- function(textinput, bigram_list){
num_output <- sum(str_count(textinput, bigram_list))
num_output
}
nr_disaster2gs <- sapply(preprocessed_tokens, bigram_match, dis_freq2gs)
nr_noDisaster2gs <- sapply(preprocessed_tokens, bigram_match, nodis_freq2gs)
comb$nr_disaster2gs <- nr_disaster2gs
comb$nr_noDisaster2gs <- nr_noDisaster2gs
Given the size of our DFM, we’ll want to see which features provide the most variability, which means analyzing its singular value decomposition.
Out of the 3024 unigrams and 61866 bigrams we have, we’ll only keep the 100 for each DFM that provide the most variability. This SVD process can be automated by the irlba function by the package of the same name :
comb_svd1g <- irlba(t(comb_dfm), nv = 100, maxit = 1000)
comb_topfeatures1g <- comb_svd1g$v
comb_svd2g <- irlba(t(comb_dfm2g), nv = 100, maxit = 1000)
comb_topfeatures2g <- comb_svd2g$v
comb_topfeatures1g <- data.table(comb_topfeatures1g)
names(comb_topfeatures1g) <- paste("U", as.character(1:100), sep = "")
comb_topfeatures2g <- data.table(comb_topfeatures2g)
names(comb_topfeatures2g) <- paste("B", as.character(1:100), sep = "")
To use machine learning models to their top efficiency, all of the predictors need to be numeric, which means that categorical variables need to be encoded into dummy variables. We’ll do that for our keyword variable.
Firstly, we’ll change empty keyword values to “None” :
comb[keyword == ""]$keyword <- "None"
Then we do the actual encoding :
kw <- data.table(comb$keyword) ; names(kw) <- "keyword_"
ohe <- dummyVars(~ ., data = kw)
ohecat_kw <- predict(ohe, kw)
ohecat_kw <- data.table(ohecat_kw)
Encoding added a lot of variables in our dataset and there could be some that could have near zero occurences. Not only would these bring low predictive value to our model, it could even make it overfit our training set and lower its performance.
What we’ll do here is that we will remove dummy variables that appear less than 15 times.
low_occurence <- names(which(ohecat_kw[, lapply(.SD, sum) < 15]))
ohecat_kw <- select(ohecat_kw, -low_occurence)
comb_complete <- data.table(target = comb$target,
ncaps = comb$ncaps,
text_length = comb$text_length,
linkcount = comb$linkcount,
hashtag_count = comb$hashtag_count,
mention_count = comb$mention_count,
punc_count = comb$mention_count,
nr_disasterWords = comb$nr_disasterWords,
nr_noDisasterWords = comb$nr_noDisasterWords,
nr_disaster2gs = comb$nr_disaster2gs,
nr_noDisaster2gs = comb$nr_noDisaster2gs,
ohecat_kw,
comb_topfeatures1g,
comb_topfeatures2g)
training_range <- 1:nrow(training)
testing_range <- (nrow(training)+1):nrow(comb)
model_train <- comb_complete[training_range]
preproc_test <- comb_complete[testing_range]
preproc_test <- select(preproc_test, -target)
We’ll use caret’s cross validation to find out what the best tune is for an XGBoost model :
library(parallel)
library(doParallel)
cluster1 <- makeCluster(detectCores() - 2)
registerDoParallel(cluster1)
trctrl <- trainControl(method = "cv", number = 5, allowParallel = T,
verboseIter = F)
xgbGrid <- expand.grid(nrounds = 750,
max_depth = seq(2, 10, by = 2),
colsample_bytree = seq(0.2, 0.6, length.out = 5),
eta = seq(0.005, 0.015, length.out = 3),
gamma=0,
min_child_weight = c(0.8, 1, 1.2),
subsample = 1
)
set.seed(69420)
modfit_xgbm <- caret::train(target ~ ., data = model_train,
method = "xgbTree",
trControl = trctrl,
tuneGrid = xgbGrid, verbose = F)
stopCluster(cluster1)
registerDoSEQ()
Let’s see what the best tune for the XGB model is :
modfit_xgbm$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 126 750 6 0.01 0 0.5 1 1
We set these as default hyperparameters :
xgbparam <- list(objective = "binary:logistic",
booster = "gbtree",
max_depth = 6,
colsample_bytree = 0.5,
eta = 0.01,
gamma=0,
min_child_weight = 1,
subsample = 1
)
Now we’ll do another series of cross-validations, this time to figure out the best number of rounds :
set.seed(1234)
xgb_cv <- xgb.cv(params = xgbparam, data = as.matrix(model_train),
nrounds = 10000, nfold = 5, showsd = T, stratified = T,
print_every_n = 15, early_stopping_rounds = 5, maximize = F,
label = model_train$target)
## [1] train-error:0.045328+0.090655 test-error:0.054199+0.108399
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 5 rounds.
##
## Stopping. Best iteration:
## [2] train-error:0.000000+0.000000 test-error:0.000000+0.000000
we now know all the parameters for a perfect xgb model fit :
dmat_train <- xgb.DMatrix(data = as.matrix(select(model_train, -target)), label = model_train$target)
modfit_xgbm2 <- xgb.train(dmat_train, params = xgbparam, nrounds = 20)
XGBoost models can also be used to plot feature importance graphs :
imp_mat <- xgb.importance(feature_names = colnames(model_train), model = modfit_xgbm2)
xgb.ggplot.importance(importance_matrix = imp_mat[1:20], rel_to_first = T)
library(parallel)
library(doParallel)
cluster1 <- makeCluster(detectCores() - 2)
registerDoParallel(cluster1)
set.seed(69420)
modfit_rf <- caret::train(target ~ ., data = model_train,
method = "rf",
trControl = trctrl)
stopCluster(cluster1)
registerDoSEQ()
Just like XGBoost models, we can also use Random Forest models to plot feature importance graphs :
imp_rf <- importance(modfit_rf$finalModel)
imp_rf <- data.table(variables = row.names(imp_rf), inc_mse = imp_rf[, 1])
imp_rf <- arrange(imp_rf, desc(inc_mse))
imp_rf <- imp_rf[1:20]
ggplot(data = imp_rf,
aes(x = reorder(variables, inc_mse), y = inc_mse, fill = inc_mse)) +
geom_bar(stat = "identity") +
theme(legend.position = "none")+
labs(x = "Features",
y = "% increase in MSE if variable is shuffled")+
coord_flip()
So far, our models disagree on which features are the most important. We can expect a difference in accuracies between the two.
Lastly, we’ll build a classification SVM model :
modfit_svm <- svm(target ~ ., data = model_train,
type = "C-classification",
kernel = "radial",
cost = 2,
coef0 = 0.8)
dmat_pptest <- xgb.DMatrix(data = as.matrix(preproc_test))
answer_xgbm <- predict(modfit_xgbm2, dmat_pptest)
answer_xgbm_r <- round(answer_xgbm)
answer_rf <- predict(modfit_rf, preproc_test)
answer_rf_r <- round(answer_rf)
answer_svm <- predict(modfit_svm, preproc_test)
answer_ensemble <- (answer_xgbm_r + answer_rf_r + (as.numeric(answer_svm)) - 1 ) / 3
answer_ensemble_r <- round(answer_ensemble)
solution_xgbm <- data.frame(Id = test_id, target = answer_xgbm_r)
solution_rf <- data.frame(Id = test_id, target = answer_rf_r)
solution_svm <- data.frame(Id = test_id, target = answer_svm)
solution_ensemble <- data.frame(Id = test_id, target = answer_ensemble_r)
write.csv(solution_xgbm, "disaster_tweet_solution_xgbm.csv", row.names = F)
write.csv(solution_rf, "disaster_tweet_solution_rf.csv", row.names = F)
write.csv(solution_svm, "disaster_tweet_solution_svm.csv", row.names = F)
write.csv(solution_ensemble, "disaster_tweet_solution_ensemble.csv", row.names = F)
We have the following results :
Random Forest : 0.77382
XGBM : 0.75237
SVM : 0.78118
Ensemble : 0.78148