1 Summary

The goal of this project is to use natural language processing on tweets and build a model to figure out whether a tweet is about a disaster or not.

The original dataset can be found on the “Real or Not ? NLP with Disaster Tweets” Kaggle competition, and, while the actual results are public (which explains the abnormally high amount of 100% accuracy submissions on the leaderboard), our goal here is to use efficient NLP tools and spend time on feature engineering and model building to maximize our submissions’ accuracies.

2 Loading packages

Throughout this project, we will use the following packages:

library(ggplot2)
library(ggrepel)
library(gridExtra)
library(stringi)
library(stringr)
library(dplyr)
library(data.table)
library(quanteda)
library(quanteda.textmodels)
library(RColorBrewer)
library(irlba)
library(caret)
library(randomForest)
library(gbm)
library(xgboost)
library(e1071)

3 Loading the data

As a personal habit, I automate the downloading and loading processes.

# This will create a "data" folder in the current directory and store the data in it

if (!file.exists("data")){
   dir.create("data")
}

trainurl <- "https://storage.googleapis.com/kagglesdsdata/competitions/17777/869809/train.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1602402299&Signature=bP6vq%2FzMH7wwZfHDce5GKr2WaTSDQZuEtqX%2FNRtX%2B5qA0m10jVcrDUsj555XRdeijGUnJ6HkY%2BtIgLpbIsgjrFC5UUimgqwbexpPi8VaL0gFOIAa88e01FMSMNlk7lBDkzDsHmLR1pYjqY5SCX4YXFCNMnuXYbU5D87cQHMxgDZn1BFz2dJgNtj1yoympLH4UMIfvICKj8SD3xAfSFU913i6qqvVRoX19Jlmw4ejJOoJO2wFLexoaQ6UlezBDavpUpZUaxp1YzM6Y4NHNvIuIslSz45DJuYtBlDwoHOaJtrMnggAohZqHPHoz51BI%2Fsr6ImCnqVVlM5Beme%2BLTCsEw%3D%3D&response-content-disposition=attachment%3B+filename%3Dtrain.csv"

trainfile <- "./data/train.csv"

testurl <- "https://storage.googleapis.com/kagglesdsdata/competitions/17777/869809/test.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1602402306&Signature=KUi2ioNjD0fmj%2BNFnndPkeVjMzaV8CwlG90ZwrwC1x6rdwS8Cxz5P7a%2BdEhhfqrWk5N7cirMTcekO21t2mSHJ0b0cyH5hhxZvMdT0is9Q5oQqC852MZ5DRGFWxHHRmyqiaoqTznNn5qp6TIoqGDqu7iQkLbejBVJx57ledBVQASqN%2BvQJNJNwMduqGKWdNKnjKuG0Q9B%2F830jZS5AxQHyiSlXqIG5j9QyotLJU0MjX%2BI920vfmj6seeG57ls9ENmb7Fd5VgWdwr0Wo%2BKLGizjszLL%2BfVMmW%2F4OZC%2FlOWgvn4LpH5NrTNwPTZGA5VLUhcnHpYHH2rtxsi0Ax3tEq6zQ%3D%3D&response-content-disposition=attachment%3B+filename%3Dtest.csv"

testfile <- "./data/test.csv"

if (!file.exists(trainfile)){
   download.file(trainurl, trainfile, method = "curl")
}

if (!file.exists(testfile)){
   download.file(testurl, testfile, method = "curl")
}

testing <- read.csv(testfile)
training <- read.csv(trainfile)

4 Exploratory data analysis

It’s now time to take a look at our data. The first thing we’ll do is to convert both train and test sets to data tables to make future modifications easier.

training <- data.table(training)
testing <- data.table(testing)

For future data cleaning and processing, we’ll combine the training and testing set into one complete set :

comb <- data.table(rbind(training, mutate(testing, target = NA)))

We’ll also store the testing set IDs which will be necessary for future submissions :

test_id <- testing$id

4.1 Dimensions

Let’s have a look at the dimensions of our training set :

dim(training)

## [1] 7613    5

4.2 Exploring the response variable

Ou response variable is target, which takes the value 1 if a tweet is about a real disaster and 0 if not.

ct <- c(nrow(filter(comb[!is.na(target), ], target == 0)), 
        nrow(filter(comb[!is.na(target), ], target == 1)))

ggplot(data = comb[!is.na(target), ], aes(x = factor(target), fill = target)) + 
   geom_bar(stat = "count") + theme(legend.position = "none") +
   labs(x = "Target (1 if disaster related tweet, 0 if not)", y = "Count")+
   ggtitle("Response variable repartition in the training set") +
   geom_text(stat = "count", aes(label = ..count..), vjust = -1) +
   coord_cartesian(ylim = c(0, 5000))

In our training set, the response variable’s repartition is rather balanced, which means we have the possibility to built a very accurate model.

5 Feature Engineering

Here we’ll create the following variables :

Text length
Number of capital letters
Number of links in the tweet
Number of punctuation
Number of hashtags
Number of mentions

5.1 Text Length

comb[, text_length := nchar(text), by = seq_len(nrow(comb))]

ggplot(comb[!is.na(target), ], aes(x = factor(target), y = text_length, fill = target)) + geom_boxplot()

Disaster tweets seem to be a bit longer

5.2 Number of capital letters

comb[, ncaps := str_count(text, "[A-Z]"), by = seq_len(nrow(comb))]

5.3 Number of links

comb[, linkcount := str_count(text, "http"), by = seq_len(nrow(comb))]

5.4 Number of punctuation

comb[, punc_count := str_count(text, "[\\!\\?\\:\\.]"), by = seq_len(nrow(comb))]

5.5 Number of hashtags

comb[, hashtag_count := str_count(text, "#"), by = seq_len(nrow(comb))]

5.6 Number of mentions

comb[, mention_count := str_count(text, "@"), by = seq_len(nrow(comb))]

6 Creating a DFM

6.1 Tokenization

The first thing we will do with our training set is tokenization : we will split our lines of text into chunks of words. For instance, we want the sentence “This watch is Mike’s” to become [“this” “watch” “is” “mike” “s”].

This is where the package quanteda comes in ; it can automate this process with its tokens() function. We will set the following parameters :

Do not tokenize numbers
Do not tokenize punctuation
Do not tokenize isolated symbols such as dollar signs or hash
Do not tokenize URLS
Do not tokenize twitter words such as “rt”
Split hyphenated words

comb_corpus <- corpus(comb$text)
comb_corpus <- stri_replace_all_regex(comb_corpus, "[\\p{p}\\p{S}]", "")
comb_corpus <- stri_replace_all_regex(comb_corpus, "http.*", "")
comb_tokens <- tokens(comb_corpus, what = "word", remove_numbers = T,
                       remove_punct = T, remove_symbols = T, split_hyphens = T,
                       remove_url = T, remove_twitter = T)

6.2 Preprocessing the tokens

comb_tokens <- tokens_tolower(comb_tokens)
comb_tokens <- tokens_select(comb_tokens, stopwords(), selection = "remove")
comb_tokens <- tokens_wordstem(comb_tokens)

6.3 Creating the DFM

comb_dfm <- dfm(comb_tokens)
comb_dfm <- dfm_trim(comb_dfm, min_termfreq = 5)

We obtain our document frequency matrix, for which we trimmed down words that appear less than 5 times overall.

We still end up with a very large matrix :

dim(comb_dfm)

## [1] 10876  3024

7 Back to feature engineering : word matches

Now that we ceated our tokens and DFM, we can find out what the most frequent single words - or unigrams - are for disaster tweets and non-disaster tweets. After that, we’ll add new variables that count how many of these words are in a tweet.

We’ll also do that for 2-word associations - or bigrams.

7.1 Unigrams

7.1.1 Most frequent unigrams

dis_ss <- comb$target == 1 ; dis_ss[is.na(dis_ss)] <- F
nodis_ss <- comb$target == 0 ; nodis_ss[is.na(nodis_ss)] <- F

dis_unidfm <- comb_dfm[dis_ss, ]
dis_unicount <- colSums(dis_unidfm)
dis_unicount <- dis_unicount[dis_unicount > 80]
dis_freq1gs <- names(dis_unicount)

We can now plot a wordcloud which contains the most frequent unigrams for disaster tweets :

textplot_wordcloud(dis_unidfm, min.freq = 80, color = brewer.pal(10, "BrBG"))  
title("Most frequent words in disaster tweets", col.main = "grey14")

And we do the same for non-disaster tweets :

nodis_unidfm <- comb_dfm[nodis_ss, ]
nodis_unicount <- colSums(nodis_unidfm)
nodis_unicount <- nodis_unicount[nodis_unicount > 80]
nodis_freq1gs <- names(nodis_unicount)

textplot_wordcloud(nodis_unidfm, min.freq = 80, color = brewer.pal(10, "BrBG"))  
title("Most frequent words in non-disaster tweets", col.main = "grey14")

7.1.2 Number of disaster or non-disaster words per tweet

Before creating our new features, we’ll remove frequent unigrams that are common to both disaster and non-disaster tweets :

red_dis1g <- which(dis_freq1gs %in% nodis_freq1gs)
red_nodis1g <- which(nodis_freq1gs %in% dis_freq1gs)

dis_freq1gs <- dis_freq1gs[-red_dis1g]
nodis_freq1gs <- nodis_freq1gs[-red_nodis1g]

comb[, nr_disasterWords := sum(str_count(text, dis_freq1gs)), by = seq_len(nrow(comb))]
comb[, nr_noDisasterWords := sum(str_count(text, nodis_freq1gs)), by = seq_len(nrow(comb))]

7.2 Bigrams

We’ll repeat the exact same process on bigrams.

7.2.1 Most frequent bigrams :

We’ll start by creating the bigram DFM :

comb_bigrams <- tokens_ngrams(comb_tokens, n = 2)
comb_dfm2g <- dfm(comb_bigrams)

We can now isolate those from disaster tweets and plot the wordcloud :

dis_dfm2g <- comb_dfm2g[dis_ss, ]
dis_count2g <- colSums(dis_dfm2g)
dis_count2g <- dis_count2g[dis_count2g > 25]

textplot_wordcloud(dis_dfm2g, min.freq = 25, color = brewer.pal(10, "BrBG"))  
title("Most frequent bigrams (2Gs) in disaster tweets", col.main = "grey14")

Now we do the same for non-disaster tweets :

nodis_dfm2g <- comb_dfm2g[nodis_ss, ]
nodis_count2g <- colSums(nodis_dfm2g)
nodis_count2g <- nodis_count2g[nodis_count2g > 15]

textplot_wordcloud(nodis_dfm2g, min.freq = 15, color = brewer.pal(10, "BrBG"))  
title("Most frequent bigrams (2Gs) in non-disaster tweets", col.main = "grey14")

We’ll now create the a list of these frequent bigrams and remove those that are common for both disaster and non-disaster tweets :

dis_freq2gs <- names(dis_count2g)
nodis_freq2gs <- names(nodis_count2g)

red_dis2g <- which(dis_freq2gs %in% nodis_freq2gs)
red_nodis2g <- which(nodis_freq2gs %in% dis_freq2gs)

dis_freq2gs <- dis_freq2gs[-red_dis2g]
nodis_freq2gs <- nodis_freq2gs[-red_nodis2g]

When we look at our bigrams list, for instance in the wordclouds, we can see that the words are stemmed and joined by underscores :

dis_freq2gs[1:5]

## [1] "debri_found"         "latest_home"         "home_raze"          
## [4] "raze_northern"       "northern_california"

Therefore, when we want to count how many time those words appear in a tweet, we can’t use the raw data in which the tweets are neither stemmed nor joined. Luckily, we can use the words from our tokens list that have been stemmed :

comb_tokens[[1]][1:3]

## [1] "deed"      "reason"    "earthquak"

We’ll just have to pre-process it so that the words are joined by underscores, and then we can find the matches :

preprocessed_tokens <- list()

for (i in 1:length(comb_tokens)){
   rslt <- paste(comb_tokens[i], collapse = "_")
   preprocessed_tokens <- append(preprocessed_tokens, rslt)
}

bigram_match <- function(textinput, bigram_list){
   num_output <- sum(str_count(textinput, bigram_list))
   num_output
}

nr_disaster2gs <- sapply(preprocessed_tokens, bigram_match, dis_freq2gs)
nr_noDisaster2gs <- sapply(preprocessed_tokens, bigram_match, nodis_freq2gs)

comb$nr_disaster2gs <- nr_disaster2gs
comb$nr_noDisaster2gs <- nr_noDisaster2gs

8 SVD

Given the size of our DFM, we’ll want to see which features provide the most variability, which means analyzing its singular value decomposition.

Out of the 3024 unigrams and 61866 bigrams we have, we’ll only keep the 100 for each DFM that provide the most variability. This SVD process can be automated by the irlba function by the package of the same name :

comb_svd1g <- irlba(t(comb_dfm), nv = 100, maxit = 1000)
comb_topfeatures1g <- comb_svd1g$v

comb_svd2g <- irlba(t(comb_dfm2g), nv = 100, maxit = 1000)
comb_topfeatures2g <- comb_svd2g$v

comb_topfeatures1g <- data.table(comb_topfeatures1g)
names(comb_topfeatures1g) <- paste("U", as.character(1:100), sep = "")

comb_topfeatures2g <- data.table(comb_topfeatures2g)
names(comb_topfeatures2g) <- paste("B", as.character(1:100), sep = "")

9 Encoding the keyword variables

To use machine learning models to their top efficiency, all of the predictors need to be numeric, which means that categorical variables need to be encoded into dummy variables. We’ll do that for our keyword variable.

Firstly, we’ll change empty keyword values to “None” :

comb[keyword == ""]$keyword <- "None"

Then we do the actual encoding :

kw <- data.table(comb$keyword) ; names(kw) <- "keyword_"
ohe <- dummyVars(~ ., data = kw)
ohecat_kw <- predict(ohe, kw)
ohecat_kw <- data.table(ohecat_kw)

9.1 Low variation dummy variables

Encoding added a lot of variables in our dataset and there could be some that could have near zero occurences. Not only would these bring low predictive value to our model, it could even make it overfit our training set and lower its performance.

What we’ll do here is that we will remove dummy variables that appear less than 15 times.

low_occurence <- names(which(ohecat_kw[, lapply(.SD, sum) < 15]))
ohecat_kw <- select(ohecat_kw, -low_occurence)

10 Creating the complete data table

comb_complete <- data.table(target = comb$target, 
                            ncaps = comb$ncaps,
                            text_length = comb$text_length,
                            linkcount = comb$linkcount,
                            hashtag_count = comb$hashtag_count,
                            mention_count = comb$mention_count,
                            punc_count = comb$mention_count,
                            nr_disasterWords = comb$nr_disasterWords,
                            nr_noDisasterWords = comb$nr_noDisasterWords,
                            nr_disaster2gs = comb$nr_disaster2gs,
                            nr_noDisaster2gs = comb$nr_noDisaster2gs,
                            ohecat_kw,
                            comb_topfeatures1g,
                            comb_topfeatures2g)

10.1 splitting back to training and testing

training_range <- 1:nrow(training)
testing_range <- (nrow(training)+1):nrow(comb)
model_train <- comb_complete[training_range]
preproc_test <- comb_complete[testing_range]
preproc_test <- select(preproc_test, -target)

11 Model Building

11.1 XGBoost

We’ll use caret’s cross validation to find out what the best tune is for an XGBoost model :

library(parallel)
library(doParallel)
cluster1 <- makeCluster(detectCores() - 2)
registerDoParallel(cluster1)

trctrl <- trainControl(method = "cv", number = 5, allowParallel = T, 
                       verboseIter = F)

xgbGrid <- expand.grid(nrounds = 750,  
                       max_depth = seq(2, 10, by = 2),
                       colsample_bytree = seq(0.2, 0.6, length.out = 5),
                       eta = seq(0.005, 0.015, length.out = 3),
                       gamma=0,
                       min_child_weight = c(0.8, 1, 1.2),
                       subsample = 1
                      )
set.seed(69420)
modfit_xgbm <- caret::train(target ~ ., data = model_train,
                            method = "xgbTree", 
                            trControl = trctrl,
                            tuneGrid = xgbGrid, verbose = F)
stopCluster(cluster1)
registerDoSEQ()

Let’s see what the best tune for the XGB model is :

modfit_xgbm$bestTune

##     nrounds max_depth  eta gamma colsample_bytree min_child_weight subsample
## 126     750         6 0.01     0              0.5                1         1

We set these as default hyperparameters :

xgbparam <- list(objective = "binary:logistic",
                 booster = "gbtree",
                 max_depth = 6,
                 colsample_bytree = 0.5,
                 eta = 0.01,
                 gamma=0,
                 min_child_weight = 1,
                 subsample = 1
                )

Now we’ll do another series of cross-validations, this time to figure out the best number of rounds :

set.seed(1234)
xgb_cv <- xgb.cv(params = xgbparam, data = as.matrix(model_train), 
                 nrounds = 10000, nfold = 5, showsd = T, stratified = T, 
                 print_every_n = 15, early_stopping_rounds = 5, maximize = F, 
                 label = model_train$target)

## [1]  train-error:0.045328+0.090655   test-error:0.054199+0.108399 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 5 rounds.
## 
## Stopping. Best iteration:
## [2]  train-error:0.000000+0.000000   test-error:0.000000+0.000000

we now know all the parameters for a perfect xgb model fit :

dmat_train <- xgb.DMatrix(data = as.matrix(select(model_train, -target)), label = model_train$target)

modfit_xgbm2 <- xgb.train(dmat_train, params = xgbparam, nrounds = 20)

XGBoost models can also be used to plot feature importance graphs :

imp_mat <- xgb.importance(feature_names = colnames(model_train), model = modfit_xgbm2)
xgb.ggplot.importance(importance_matrix = imp_mat[1:20], rel_to_first = T)

11.2 Random Forest

library(parallel)
library(doParallel)
cluster1 <- makeCluster(detectCores() - 2)
registerDoParallel(cluster1)

set.seed(69420)
modfit_rf <- caret::train(target ~ ., data = model_train,
                            method = "rf", 
                            trControl = trctrl)
stopCluster(cluster1)
registerDoSEQ()

Just like XGBoost models, we can also use Random Forest models to plot feature importance graphs :

imp_rf <- importance(modfit_rf$finalModel)
imp_rf <- data.table(variables = row.names(imp_rf), inc_mse = imp_rf[, 1])
imp_rf <- arrange(imp_rf, desc(inc_mse))
imp_rf <- imp_rf[1:20]

ggplot(data = imp_rf, 
       aes(x = reorder(variables, inc_mse), y = inc_mse, fill = inc_mse)) + 
   geom_bar(stat = "identity") +
   theme(legend.position = "none")+
   labs(x = "Features", 
        y = "% increase in MSE if variable is shuffled")+
   coord_flip()

So far, our models disagree on which features are the most important. We can expect a difference in accuracies between the two.

11.3 SVM

Lastly, we’ll build a classification SVM model :

modfit_svm <- svm(target ~ ., data = model_train, 
                  type = "C-classification",
                  kernel = "radial",
                  cost = 2,
                  coef0 = 0.8)

12 Predictions

dmat_pptest <- xgb.DMatrix(data = as.matrix(preproc_test))
answer_xgbm <- predict(modfit_xgbm2, dmat_pptest)
answer_xgbm_r <- round(answer_xgbm)

answer_rf <- predict(modfit_rf, preproc_test)
answer_rf_r <- round(answer_rf)

answer_svm <- predict(modfit_svm, preproc_test)

answer_ensemble <- (answer_xgbm_r + answer_rf_r + (as.numeric(answer_svm)) - 1 ) / 3
answer_ensemble_r <- round(answer_ensemble)

solution_xgbm <- data.frame(Id = test_id, target = answer_xgbm_r)
solution_rf <- data.frame(Id = test_id, target = answer_rf_r)
solution_svm <- data.frame(Id = test_id, target = answer_svm)
solution_ensemble <- data.frame(Id = test_id, target = answer_ensemble_r)

write.csv(solution_xgbm, "disaster_tweet_solution_xgbm.csv", row.names = F)
write.csv(solution_rf, "disaster_tweet_solution_rf.csv", row.names = F)
write.csv(solution_svm, "disaster_tweet_solution_svm.csv", row.names = F)
write.csv(solution_ensemble, "disaster_tweet_solution_ensemble.csv", row.names = F)

We have the following results :

Random Forest : 0.77382
XGBM : 0.75237
SVM : 0.78118
Ensemble : 0.78148

Disaster Tweet Prediction

Anvil

08/10/2020