Summary

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data (wikipedia).

A particular challenge of this subfield is to structure the data to adapt it to the ML models. In particular in this notebook we will use the data from Real or Not? NLP with Disaster Tweets to be able to predict whether a disaster announcement tweet is real or not. For this we will begin exploring each of the variables, where clean and process text strings to give them an appropriate structure to make an ML model.

Exploratory analysis

We will start exploring what the dataset contains. Below are the first 6 observations of the data set, it shows that it consists of 5 columns:

  • id: a unique identifier for each tweet.
  • keyword : a particular keyword from the tweet (may be blank).
  • location: the location the tweet was sent from (may be blank).
  • text: the text of the tweet.
  • target: denotes whether a tweet is about a real disaster (1) or not (0).
Table 1. Dataset sample.
id keyword location text target
1 1 Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all 1
2 4 Forest fire near La Ronge Sask. Canada 1
3 5 All residents asked to ‘shelter in place’ are being notified by officers. No other evacuation or shelter in place orders are expected 1
4 6 13,000 people receive #wildfires evacuation orders in California 1
5 7 Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 1
6 8 #RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires 1
† first 6 observations of the data set

Target

The target distribution is relatively balanced, where 43% of the observations are desaster and 57% are not.

Table 2. Target distribution.
target count
1 0 4342
2 1 3271

Keywords

Keyword represents a representative word of the tweet, then the first 10 are shown.

Table 3. keyword listing.
keyword
1
2 ablaze
3 accident
4 aftershock
5 airplane%20accident
6 ambulance
7 annihilated
8 annihilation
9 apocalypse
10 armageddon
† 10 keywords of 222

Looking at the keywords, we have some have a %20 in their description, we will correct this replaced all %20 by a space. Then we will review the number of times each word appears in the ** keyword ** column, the graph shows 40 observations sorted by number of occurrences.

It is useful to observe which are the keywords that are most associated with disasters or non-disasters.

When observing the graph with the count of keywords that appear above 75% of disasters we have that the terms are associated with disasters that normally occur (e.g. outbreak) and related activities (e.g. rescuers).

On the other hand, the count of keywords that appear below 16% of disasters we have that there are many ambiguous terms, this means that a human can associate it with a disaster just by considering the keyword and not the context (e.g. deluge). We also have terms related to feelings of fear (e.g. fear or panicking). This makes it difficult to estimate whether a tweet is a disaster or not with just the keyword.

Now we will see if the association of each word with a feeling indicates a relationship with the target. For this we will perform an analysis of feelings of each keyword.

When observing the correlation matrix, a null correlation is observed with each of the sentiments.

When reviewing in greater detail, we have that most keywords do not have a sentiment associated. The keywords that have a sentiment value associated are associated with negative, fear or sadness, which is quite consistent with the problem, since we try to predict disaster.

Location

The location represents the location from where the tweet was generated, then the first 10 are shown.

Table 4. location listing.
location
1
2 Birmingham
3 Est. September 2012 - Bristol
4 AFRICA
5 Philadelphia, PA
6 London, UK
7 Pretoria
8 World Wide!!
9 Paranaque City
10 Live On Webcam
† 10 locations of 3294

There are 3294 locations, some repeated and most with less than 12 observation. This makes this variable not so reliable to make a prediction.

As with keyword, it is interesting to review how the target variable behaves with respect to location.

This variable will not be considered to predict disasters, as it may not be reliable because there are nonsense locations (e.g. ??? Dreamz) and there are many locations for the number of observations (3294 of 7613 ~ 43%).

Text

Now we will review the variable text and take advantage of creating auxiliary variables that normally serve to extract useful information from the texts, in this case nword0 and nchar0 which are the initial number of words and characters ( before processing the text).

Table 5. text listing.
text
1 Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
2 Forest fire near La Ronge Sask. Canada
3 All residents asked to ‘shelter in place’ are being notified by officers. No other evacuation or shelter in place orders are expected
4 13,000 people receive #wildfires evacuation orders in California
5 Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
6 #RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
7 #flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
8 I’m on top of the hill and I can see a fire in the woods…
9 There’s an emergency evacuation happening now in the building across the street
10 I’m afraid that the tornado is coming to our area…

Now we proceed to clean the text, we will do this according to the following:

  • Eliminate punctuation marks.
  • Remove web pages.
  • Delete numbers.
  • Eliminate line breaks.
  • Remove single letters.
  • Remove stopwords.
  • Remove extra spaces.
  • Transform to lower case.

In addition, 4 more auxiliary variables will be created, nword1, nchar1, rel_0 and rel_1, where the first 2 are the final number of words and characters, and the second 2 are the relationship between the number of words and the number of initial and final characters respectively.

Table 6. text listing (after processing).
text
1 deeds reason earthquake may allah forgive us
2 forest fire near la ronge sask canada
3 residents asked shelter place notified officers evacuation shelter place orders expected
4 people receive wildfires evacuation orders california
5 just got sent photo ruby alaska smoke wildfires pours school
6 rockyfire update california hwy closed directions due lake county fire cafire wildfires
7 flood disaster heavy rain causes flash flooding streets manitou colorado springs areas
8 top hill can see fire woods
9 emergency evacuation happening now building across street
10 afraid tornado coming area

These new variables have a low correlation with target, although higher than those obtained using information from keyword.

Looking at the 1-grams we see that the words associated with disasters are some kind of disaster and tasks associated with disasters.

In the case of the 2-grams and 3-grams we have the same thing as in the case of the 1-grams and also a temporal context is appreciated, that is, the Tweets were written when there was a fire in California or a disaster in Malasya.

Looking at the detail of the distributions we have that it is not very clear a separation between desaster / not desaster.

Next we will perform a sentiment analysis of the text variable to check if we can extract any relationship with target.

The correlation between text and target sentiments is low, observing the distribution detail we have that the variables fear and negative of text could be useful, since in the other sentiments they have very few different values of 0.

Now we will tokenize the observations to be able to work with them, eliminating the less common words.

token1gram <- function(x) {
   unlist(lapply(NLP::ngrams(words(x), 1), paste, collapse = " "),
          use.names = FALSE)}
token2gram <- function(x) {
   unlist(lapply(NLP::ngrams(words(x), 2), paste, collapse = " "),
          use.names = FALSE)}
token3gram <- function(x) {
   unlist(lapply(NLP::ngrams(words(x), 3), paste, collapse = " "),
          use.names = FALSE)}


dtm <- train$text %>% VectorSource() %>% VCorpus() %>%
   DocumentTermMatrix(control = list(removeNumbers = TRUE, stopwords = TRUE,
                                     stemming = TRUE, tokenize = token1gram)) %>%
   removeSparseTerms(1-(1-0.995)/1) %>% as.matrix() %>% as.data.frame() %>% cbind(target = train$target)

dtm2 <- train$text %>% VectorSource() %>% VCorpus() %>%
   DocumentTermMatrix(control = list(removeNumbers = TRUE, stopwords = TRUE,
                                     stemming = TRUE, tokenize = token2gram)) %>%
   removeSparseTerms(1-(1-0.995)/3) %>% as.matrix() %>% as.data.frame() %>% cbind(target = train$target)

dtm3 <- train$text %>% VectorSource() %>% VCorpus() %>%
   DocumentTermMatrix(control = list(removeNumbers = TRUE, stopwords = TRUE,
                                     stemming = TRUE, tokenize = token3gram)) %>%
   removeSparseTerms(1-(1-0.995)/5) %>% as.matrix() %>% as.data.frame() %>% cbind(target = train$target)


col_dtm <- data.frame(word = colnames(dtm), colsum = colSums(dtm), 
                      disaster = colSums(dtm*dtm$target)*100/colSums(dtm), row.names = NULL)

col_dtm %>% arrange(desc(colsum)) %>% filter(word != "target") %>% head(40) %>% 
   ggplot(aes(x = reorder(word,colsum), y = colsum, fill = disaster)) + geom_col() +
   scale_fill_gradientn(colours = cols(20)) +
   geom_text(aes(label = as.character(round(disaster,0))), 
             position = position_dodge(width = 0.9), size=2.5, vjust=0, hjust=-0.5 ,col="black")+
   labs(y = "count", x = "1-grams", title = "1-grams count in training set", 
        caption = "color represents the percentage at which the word appears related to a disaster") +
   coord_flip()

Predictions

Split data set

After all the previously performed analyzes, the variables are as follows:

  • nword0: word count of the original variable text before processing.
  • nword1: word count of the original variable text after processing.
  • nchar0: character count of the original variable text before processing.
  • nchar1: character count of the original variable text after processing.
  • rel_0: relationship between nword0 and nchar0.
  • rel_1: relationship between nword1 and nchar1.
  • key_x: each of the keywords. In total there are 221 variables. x represents each of the keywords
  • keyword_sent: each of the sentiments associated with the keywords. In total there are 10 variables associated with sentiments.
  • text_sent: each of the sentiments associated with the variable text. In total there are 10 variables associated with sentiments.
  • 1-grams: 1-grams with less sparse than the variable text. There are 327 variables.
  • 2-grams: 2-grams with less sparse than the variable text. They are 242 variables.
  • 3-grams: 3-grams with less sparse than the variable text. There are 305 variables.
  • target: variable original target.

When dividing the dataset, it is considered a training and test set with 70% and 30% of the original respectively.

Table 7. train and test set dimensions.
rows cols
trainset 6091 1122
testset 1522 1122

Model

For the predictions, a xgbTree model will be used according to the ROC metric.

The hyperparameters of the model are:

  • cross validation with 5 folds.
  • search random with 50 random points per fold.
  • sampling up and preprocessing of center and scale.

The model delivers good results of the ROC metric in a range between 0.7 and 0.83.

The most significant variables are those created from the number of words and characters, as well as some related to the analysis of feelings. The importance of the variable rel_0 is remarkable. So in order to identify a desaster, the relationship between the number of words and the length of each word is important.

When observing the results of the confusion matrix we have very similar results so we can rule out overfitting.

Confusion Matrix and Statistics

          Reference
Prediction   X0   X1
        X0 2958  639
        X1  516 1978
                                          
               Accuracy : 0.8104          
                 95% CI : (0.8003, 0.8202)
    No Information Rate : 0.5703          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6108          
                                          
 Mcnemar's Test P-Value : 0.0003309       
                                          
            Sensitivity : 0.8515          
            Specificity : 0.7558          
         Pos Pred Value : 0.8224          
         Neg Pred Value : 0.7931          
             Prevalence : 0.5703          
         Detection Rate : 0.4856          
   Detection Prevalence : 0.5905          
      Balanced Accuracy : 0.8036          
                                          
       'Positive' Class : X0              
                                          
Confusion Matrix and Statistics

          Reference
Prediction  X0  X1
        X0 721 175
        X1 147 479
                                         
               Accuracy : 0.7884         
                 95% CI : (0.767, 0.8087)
    No Information Rate : 0.5703         
    P-Value [Acc > NIR] : <2e-16         
                                         
                  Kappa : 0.566          
                                         
 Mcnemar's Test P-Value : 0.1324         
                                         
            Sensitivity : 0.8306         
            Specificity : 0.7324         
         Pos Pred Value : 0.8047         
         Neg Pred Value : 0.7652         
             Prevalence : 0.5703         
         Detection Rate : 0.4737         
   Detection Prevalence : 0.5887         
      Balanced Accuracy : 0.7815         
                                         
       'Positive' Class : X0