Installing Packages

# install.packages("extraTrees")
# install.packages("randomForest")
# install.packages("ranger")
require("klaR")
## Loading required package: klaR
## Loading required package: MASS

Loading Libraries

library(readr)
library(purrr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caTools)
# Extremely Randomized Trees

library(extraTrees)
## Loading required package: rJava
library(MASS) # Modern Applied Statistics with S (4th edition, 2002).
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ranger)
## 
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
## 
##     importance
library(ModelMetrics)
## 
## Attaching package: 'ModelMetrics'
## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity
## The following object is masked from 'package:base':
## 
##     kappa
library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(e1071) # e1071 for Navie Bayes Classifier

Text Classification

Real or Not? NLP with Disaster Tweets

What am I predicting?

You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

Motivation behind using this dataset is nothing more than the ease of finding the dataset on Kaggle as well as the nature of predicting earthquakes prompted me to dig into tweets of this sort.

Here is the layout of my analysis for this text classification problem.

  • Exporatory Data Analysis (examining of data)
  • Corpus building and Data Parsing (Pre-Processing)
  • Feature Extraction
  • Model Fitting
    • Random Forest via Ranger, an alternative package for fitting a random forest
    • XGBoost, an alternative boosting package
    • Naïve Bayes Classifier (NBC)
  • Conclusions and Next Steps
df <- read_csv('https://raw.githubusercontent.com/metis-macys-66898/data_607_sp2020/master/train.csv')
## Parsed with column specification:
## cols(
##   id = col_double(),
##   keyword = col_character(),
##   location = col_character(),
##   text = col_character(),
##   target = col_double()
## )

Columns
id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (may be blank)
keyword - a particular keyword from the tweet (may be blank)
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

head(df)
## # A tibble: 6 x 5
##      id keyword location text                                             target
##   <dbl> <chr>   <chr>    <chr>                                             <dbl>
## 1     1 <NA>    <NA>     Our Deeds are the Reason of this #earthquake Ma…      1
## 2     4 <NA>    <NA>     Forest fire near La Ronge Sask. Canada                1
## 3     5 <NA>    <NA>     All residents asked to 'shelter in place' are b…      1
## 4     6 <NA>    <NA>     13,000 people receive #wildfires evacuation ord…      1
## 5     7 <NA>    <NA>     Just got sent this photo from Ruby #Alaska as s…      1
## 6     8 <NA>    <NA>     #RockyFire Update => California Hwy. 20 closed …      1
# percentages of NAs for each of the columns in df
map_dbl(df, ~ sum(is.na(.))/nrow(df)*100)
##        id   keyword  location      text    target 
##  0.000000  0.801261 33.285170  0.000000  0.000000
df <- df %>% dplyr::select(-c(id, keyword, location))
head(df)
## # A tibble: 6 x 2
##   text                                                                    target
##   <chr>                                                                    <dbl>
## 1 Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all        1
## 2 Forest fire near La Ronge Sask. Canada                                       1
## 3 All residents asked to 'shelter in place' are being notified by office…      1
## 4 13,000 people receive #wildfires evacuation orders in California             1
## 5 Just got sent this photo from Ruby #Alaska as smoke from #wildfires po…      1
## 6 #RockyFire Update => California Hwy. 20 closed in both directions due …      1
df %>% group_by(target) %>% tally() %>% mutate(Percentage=n/sum(n)*100) %>% round(digits = 2)
## # A tibble: 2 x 3
##   target     n Percentage
##    <dbl> <dbl>      <dbl>
## 1      0  4342       57.0
## 2      1  3271       43.0

We see that there are 57% of the tweets are labeled as 0, namely, a tweet not about a disaster, and 43% is about a disaster.

Next, we’ll need to build a corpus using tm and do two data cleaning steps before I use some standing functions that come with Document Term Matrix, which I have used in project 3 (haha!)

  1. Remove whitespaces
  2. Convert the whole document to lower cases for uniformity
corpus <- tm::VCorpus(tm::VectorSource(df$text))

# 1. remove whitespaces
corpus.ws_rm <- tm::tm_map(corpus, tm::stripWhitespace)

# 2. to_lowercase 
corpus.tolower <- tm::tm_map(corpus.ws_rm , tm::content_transformer(tolower))
control_list <- list(removePunctuation = TRUE,
                     stopwords = TRUE,
                     stemming = TRUE
                    )
  1. Remove puncutations
  2. Remove stopwords
  3. Stem words in a text document using Porter’s stemming algorithm
# 3. remove puncutations
dtm <- tm::DocumentTermMatrix(corpus.tolower, control = control_list)
dtm
## <<DocumentTermMatrix (documents: 7613, terms: 19114)>>
## Non-/sparse entries: 71632/145443250
## Sparsity           : 100%
## Maximal term length: 52
## Weighting          : term frequency (tf)
tm::inspect(dtm[1:13, 1:15])
## <<DocumentTermMatrix (documents: 13, terms: 15)>>
## Non-/sparse entries: 0/195
## Sparsity           : 100%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 0011 001116 0025 005225 010156 010217 0104 010401 0106 0111 012032 012624
##   1     0      0    0      0      0      0    0      0    0    0      0      0
##   2     0      0    0      0      0      0    0      0    0    0      0      0
##   3     0      0    0      0      0      0    0      0    0    0      0      0
##   4     0      0    0      0      0      0    0      0    0    0      0      0
##   5     0      0    0      0      0      0    0      0    0    0      0      0
##   6     0      0    0      0      0      0    0      0    0    0      0      0
##   7     0      0    0      0      0      0    0      0    0    0      0      0
##   8     0      0    0      0      0      0    0      0    0    0      0      0
##   9     0      0    0      0      0      0    0      0    0    0      0      0
##   10    0      0    0      0      0      0    0      0    0    0      0      0
##   11    0      0    0      0      0      0    0      0    0    0      0      0
##   12    0      0    0      0      0      0    0      0    0    0      0      0
##   13    0      0    0      0      0      0    0      0    0    0      0      0
##     Terms
## Docs 015025 0206 0215
##   1       0    0    0
##   2       0    0    0
##   3       0    0    0
##   4       0    0    0
##   5       0    0    0
##   6       0    0    0
##   7       0    0    0
##   8       0    0    0
##   9       0    0    0
##   10      0    0    0
##   11      0    0    0
##   12      0    0    0
##   13      0    0    0

Documents has 19,114 terms while there are a total of 7,613 documents. Sparsity 100% means there is nothing currently removed from the Document Term Matrix. I’m going to alter the thresholds to 99%, meaning if there is a sparse word that only appears <1 % in the documents, I’ll remove it.

# remove sparse words that appears less than 1% of the time. Essentially setting the sparsity thresholds at 99%
dtm <- tm::removeSparseTerms(dtm, sparse = 0.999) 
# Converting DocumentTermMatrix into a data frame 
tweet_final_df <- data.frame(as.matrix(dtm), stringsAsFactors = FALSE)

# Adding the target column to enable modeling. Each obs in tweet_final_df is at the tweet level.
tweet_final_df$target <- df$target

Splitting train and test set into 80/20.

+ train_df is the traning set of 6073 observations
+ test_df is the test set of 1540 observations
set.seed(8898)
split = sample.split(tweet_final_df, SplitRatio = 0.80)

train_df = subset(tweet_final_df, split == TRUE)
test_df  = subset(tweet_final_df, split == FALSE)



# QA step: 
unique(train_df$target)
## [1] 1 0
tally(train_df)
##      n
## 1 6085
unique(test_df$target)
## [1] 1 0
tally(test_df)
##      n
## 1 1528

Model selection and comparison

There are two criteria for evaluting the models. One is via the confusion matrix. The other is run time. This is for practicality purposes. In real-world situations, model will have to pass the run-time test to meet certain thresholds for response time.

str(train_df)
## 'data.frame':    6085 obs. of  1573 variables:
##  $ X100             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1000            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X11yearold       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X12000           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X16yr            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1980            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1st             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2011            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2013            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2014            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2015            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X2nd             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X4000            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X4x4             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X500             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X5km             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X600             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X70th            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X911             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aba              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abandon          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abc              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ablaz            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ absolut          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abstorm          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abus             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ access           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ accid            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ accord           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ account          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ accus            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ acr              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ across           : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ act              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ action           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ activ            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ actual           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ admit            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ adult            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ advanc           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ advisori         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ affect           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ afghanistan      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ africa           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ afternoon        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aftershock       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ago              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ agre             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ahead            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aint             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ air              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aircraft         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ airlin           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ airplan          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ airport          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alabama          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alarm            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ album            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ aliv             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ allah            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alleg            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ allow            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ almost           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alon             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ along            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alreadi          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ also             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ alway            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amaz             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amazon           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ambul            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ america          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ american         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amid             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ among            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ amp              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ancient          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ angel            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ angri            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anim             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ annihil          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anniversari      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ announc          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anoth            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ answer           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anthrax          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ antioch          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anymor           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anyon            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anyth            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ apart            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ apc              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ apocalyps        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ apollo           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ app              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appar            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appear           : num  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]
# cv6 = trainControl(method = "cv", number = 6)
# cv4 = trainControl(method = "cv", number = 4)
# cv2 = trainControl(method = "cv", number = 2)
# rf_grid =  expand.grid(mtry = 1:13) # 1573 terms
# rf_grid =  expand.grid(mtry = 1:14)

Train the model

Ranger

A fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data.

# set.seed(43)
# system.time({ranger_fit = train(as.factor(train_df$target) ~ ., data = train_df,
#                                 method = "ranger",
#                                 trControl = cv4,
#                                 num.threads = 1,
#                                 tuneGrid = (mtry, ))
#             }
#           )

time <- system.time({
        ranger_fit = ranger (target ~ ., data = train_df)
})
## Growing trees.. Progress: 39%. Estimated remaining time: 48 seconds.
## Growing trees.. Progress: 80%. Estimated remaining time: 15 seconds.
mins = unlist(sum(time[2] + time[3]))%/%60
secs = unlist(sum(time[2] + time[3]))%%60
cat("Time taken is:", mins , " mins and ", secs, " seconds.")
## Time taken is: 1  mins and  17.752  seconds.
# calculating Root-Mean Square Error (RMSE)
cat("Ranger's RMSE is:", rmse(predict(ranger_fit, test_df)$predictions, test_df$target) )
## Ranger's RMSE is: 0.3850353
# Confusion Matrix
Ranger_pred <- predict(ranger_fit, test_df)

Ranger_pred <- ifelse(Ranger_pred$predictions > 0.5, 1,0)

caret::confusionMatrix(data = factor(Ranger_pred, levels=c(1,0)),
                        reference =factor(test_df$target, levels=c(1,0))
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 494 151
##          0 171 712
##                                           
##                Accuracy : 0.7893          
##                  95% CI : (0.7679, 0.8095)
##     No Information Rate : 0.5648          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5699          
##                                           
##  Mcnemar's Test P-Value : 0.2897          
##                                           
##             Sensitivity : 0.7429          
##             Specificity : 0.8250          
##          Pos Pred Value : 0.7659          
##          Neg Pred Value : 0.8063          
##              Prevalence : 0.4352          
##          Detection Rate : 0.3233          
##    Detection Prevalence : 0.4221          
##       Balanced Accuracy : 0.7839          
##                                           
##        'Positive' Class : 1               
## 
# Accuracy: 0.7633 (Balanced Accuracy: 0.7530)

Time taken: 1 min and 35 seconds Accuracy: 0.7633 (Balanced Accuracy: 0.7530)

XGBoost : eXtreme Gradient Boosting

An ensemble of randomized decision trees that basically averages out the results of individual decision trees based on the parameter nrounds, which controls the max number of boosting iterations.

# removing the target variable, which is a dependent variable, at column location 1353. 
# which( colnames(train_df)=="target" )
time <- system.time({
            XGBoost <-xgboost(as.matrix(train_df[-which( colnames(train_df)=="target" )]), label=as.vector(train_df$target),nrounds=100)
})
## [1]  train-rmse:0.483764 
## [2]  train-rmse:0.474891 
## [3]  train-rmse:0.467010 
## [4]  train-rmse:0.460302 
## [5]  train-rmse:0.453979 
## [6]  train-rmse:0.449407 
## [7]  train-rmse:0.445298 
## [8]  train-rmse:0.440646 
## [9]  train-rmse:0.437026 
## [10] train-rmse:0.432906 
## [11] train-rmse:0.429013 
## [12] train-rmse:0.425450 
## [13] train-rmse:0.422317 
## [14] train-rmse:0.419028 
## [15] train-rmse:0.416314 
## [16] train-rmse:0.414121 
## [17] train-rmse:0.411551 
## [18] train-rmse:0.409194 
## [19] train-rmse:0.406875 
## [20] train-rmse:0.404494 
## [21] train-rmse:0.401546 
## [22] train-rmse:0.399059 
## [23] train-rmse:0.396652 
## [24] train-rmse:0.394843 
## [25] train-rmse:0.392673 
## [26] train-rmse:0.390988 
## [27] train-rmse:0.388743 
## [28] train-rmse:0.386863 
## [29] train-rmse:0.384953 
## [30] train-rmse:0.383363 
## [31] train-rmse:0.381809 
## [32] train-rmse:0.380407 
## [33] train-rmse:0.378879 
## [34] train-rmse:0.377575 
## [35] train-rmse:0.376003 
## [36] train-rmse:0.374952 
## [37] train-rmse:0.373539 
## [38] train-rmse:0.372328 
## [39] train-rmse:0.370357 
## [40] train-rmse:0.369158 
## [41] train-rmse:0.367992 
## [42] train-rmse:0.366645 
## [43] train-rmse:0.365715 
## [44] train-rmse:0.364529 
## [45] train-rmse:0.363445 
## [46] train-rmse:0.361855 
## [47] train-rmse:0.360839 
## [48] train-rmse:0.359638 
## [49] train-rmse:0.358672 
## [50] train-rmse:0.357513 
## [51] train-rmse:0.356794 
## [52] train-rmse:0.356075 
## [53] train-rmse:0.355024 
## [54] train-rmse:0.354131 
## [55] train-rmse:0.353301 
## [56] train-rmse:0.352410 
## [57] train-rmse:0.351152 
## [58] train-rmse:0.349905 
## [59] train-rmse:0.348713 
## [60] train-rmse:0.347928 
## [61] train-rmse:0.347172 
## [62] train-rmse:0.346478 
## [63] train-rmse:0.345863 
## [64] train-rmse:0.345050 
## [65] train-rmse:0.344324 
## [66] train-rmse:0.343661 
## [67] train-rmse:0.342598 
## [68] train-rmse:0.341967 
## [69] train-rmse:0.341097 
## [70] train-rmse:0.340472 
## [71] train-rmse:0.339706 
## [72] train-rmse:0.339176 
## [73] train-rmse:0.338287 
## [74] train-rmse:0.337354 
## [75] train-rmse:0.336538 
## [76] train-rmse:0.335854 
## [77] train-rmse:0.335189 
## [78] train-rmse:0.334598 
## [79] train-rmse:0.333663 
## [80] train-rmse:0.333162 
## [81] train-rmse:0.332370 
## [82] train-rmse:0.331946 
## [83] train-rmse:0.331268 
## [84] train-rmse:0.330723 
## [85] train-rmse:0.330015 
## [86] train-rmse:0.329277 
## [87] train-rmse:0.328396 
## [88] train-rmse:0.327616 
## [89] train-rmse:0.327189 
## [90] train-rmse:0.326606 
## [91] train-rmse:0.325664 
## [92] train-rmse:0.325031 
## [93] train-rmse:0.324385 
## [94] train-rmse:0.323741 
## [95] train-rmse:0.323354 
## [96] train-rmse:0.322666 
## [97] train-rmse:0.321842 
## [98] train-rmse:0.320970 
## [99] train-rmse:0.320565 
## [100]    train-rmse:0.320040
mins = unlist(sum(time[2] + time[3]))%/%60
secs = unlist(sum(time[2] + time[3]))%%60
cat("Time taken is:", mins , " mins and ", secs, " seconds.")
## Time taken is: 0  mins and  57.078  seconds.
XGBoostpred <- predict(XGBoost, as.matrix(test_df[-which( colnames(train_df)=="target" )]))


# XGBoostpred

# recoding the outcome from a range from 0 to 1 to categorical variables 1 and 0.
XGBoostpred <- ifelse(XGBoostpred >0.5, 1,0)
caret::confusionMatrix(data = factor(XGBoostpred, levels=c(1,0)),
                        reference =factor(test_df$target, levels=c(1,0))
                        
                      )
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   0
##          1 453  85
##          0 212 778
##                                           
##                Accuracy : 0.8056          
##                  95% CI : (0.7849, 0.8252)
##     No Information Rate : 0.5648          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5958          
##                                           
##  Mcnemar's Test P-Value : 2.646e-13       
##                                           
##             Sensitivity : 0.6812          
##             Specificity : 0.9015          
##          Pos Pred Value : 0.8420          
##          Neg Pred Value : 0.7859          
##              Prevalence : 0.4352          
##          Detection Rate : 0.2965          
##    Detection Prevalence : 0.3521          
##       Balanced Accuracy : 0.7914          
##                                           
##        'Positive' Class : 1               
## 
# Accuracy:
# 10 = 0.7297 (Balanced Accuracy: 0.7000)
# 51 = 0.7971 (Balanced Accuracy: 0.7809)  
# 94 = 0.7991 (Balanced Accuracy: 0.7849)
# 100 = 0.8056 (Balanced Accuracy: 0.7914)
# 104 = 0.8037 (Balanced Accuracy: 0.7891)
# 204 = 0.7997 (Balanced Accuracy: 0.7877)

Time taken: 1 min and 3.8 seconds Accuracy: 0.8056 (Balanced Accuracy: 0.7914)

Naïve Bayes Classifier

A simple probabilistic classifier which is based on Bayes theorem but with strong assumptions regarding independence. Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. Although it is often outperformed by other techniques, and despite the naïve design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, it is definitely a machine learning algorithm to have in your toolkit.

# create response and feature data
features <- setdiff(names(train_df), "Target")
x <- train_df[, features]
y <- train_df$target


# set up 10-fold cross validation procedure
# train_control <- trainControl(
#   method = "cv", 
#   number = 10
#   )
# 
# # train model
# nb.m1 <- train(
#   x = x,
#   y = factor(y, levels = c(1,0)),
#   method = "nb",
#   trControl = train_control
#   )
# 
# # results
# confusionMatrix(nb.m1)

time <- system.time({
              NBC <- naiveBayes(train_df[, features], as.factor(y))
})

cat("Time taken is:", sum(time[2] + time[3]), " seconds.")
## Time taken is: 1.034  seconds.
NBC_pred <-predict(NBC,test_df[, features])

caret::confusionMatrix(data = NBC_pred, reference = as.factor(test_df$target), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 765 392
##          1  98 273
##                                           
##                Accuracy : 0.6793          
##                  95% CI : (0.6553, 0.7027)
##     No Information Rate : 0.5648          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3128          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4105          
##             Specificity : 0.8864          
##          Pos Pred Value : 0.7358          
##          Neg Pred Value : 0.6612          
##              Prevalence : 0.4352          
##          Detection Rate : 0.1787          
##    Detection Prevalence : 0.2428          
##       Balanced Accuracy : 0.6485          
##                                           
##        'Positive' Class : 1               
## 
# Accuracy:  0.6793 (Balanced Accuracy: 0.6485)

Time taken: 1.23 seconds Accuracy: 0.6793 (Balanced Accuracy: 0.6485)

Conclusions

For this classification modeling exercise, it’s obvious that we not only have to evaluate based on accuracy (or balanced accuracy taking everything into account) but also the time taken to execute the training of the model.

From the accuracy perspective, XGBoost, namely, the extreme Graident Boosting, method is by far the most accurate, as I get 81% accuracy as well as 79% balanced accuracy. Naive Bayes Classifier for this problem is probably the faster as it only took 1.23 seconds to train.

On the other hand, Ranger was developed in 2001. Not surprisingly, it’s with a decent accuracy at 76% and Balanced Accuracy at 75%. It took the longest among the three methods to train the model.

Next Steps

I was only using the default algorithm for Naive Bayes Classifer. I believe there are more ways to tune the model. I believe that’s something that I can invest some time into as the next stage of this project.

Weighting was something that I left off intentionally this time after using weight in my project 3 (the group project). I believe there is value in it. But I just wanted to keep everything else as simple and as smoothly run as possibble before comparing any models. Of course, there should be some weights if twitter text has strings like USGS and earthquake.usgs.gov, I’d definitely give more weight to these terms for determining whether the disaster is real or not.

References

  1. Real or Not? NLP with Disaster Tweets|Data
  2. Naïve Bayes Classifier
  3. Extremely Randomized Trees, Ranger, XGBoost
  4. Naives Bayes – Specific Classifier Optimizations