Executive Summary

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships that can be observed in the data and prepare to build your first linguistic models.

This work is part of the Capstone project with SwiftKey which provided the dataset. The main library used here has been quanteda.

This report wants to cover and answer the following:

  1. Exploratory analysis, understanding the distribution of words and relationship between them
  2. Understand frequencies of words and word pairs

There is no clear conclusion in this analysis but rather some intelligence gathering for the rest of our project. Mostly we understood which preparatory steps and parameters value (ie. cleaning) we should use.


Exploratory Analysis

File-checks & Corpus

First of all, using an homemade function (using quanteda::corpus(readtext(file)), we create a quanteda corpus and load it with samples (to speed up things as we are just in discovery mode, not yet modelling) the 3 files from the dataset. It also print some basic summaries of the loaded file.

crps <- loadFile2Quanteda(filepath_blogs)
## [1] "filename: en_US.blogs.sample.txt"
## [1] "size: 2.26 MB"
## [1] "nb. lines: 30000"
crps <- crps + loadFile2Quanteda(filepath_news)
## [1] "filename: en_US.news.sample.txt"
## [1] "size: 1.95 MB"
## [1] "nb. lines: 30000"
crps <- crps + loadFile2Quanteda(filepath_twitter)
## [1] "filename: en_US.twitter.sample.txt"
## [1] "size: 2.02 MB"
## [1] "nb. lines: 30000"
summary(crps, showmeta=T)

Tokens and TDM/DFM

We now proceed to the tokenization of the corpus docs and build the Term Document Matrix (TDM or DFM).
In order to better understand the corpus, several tokenization and DFM are created playing with the available options:

  1. raw: untouched text from dataset
  2. lowercase: uniformisation of the text (so ‘House’ and ‘house’ can be treated as the same)
  3. clean: removing punctuation and numbers and hastags from tweets
  4. filter: filtering to catch up strong language (using wordpress profanity list)
  5. stopword: removing low-values words (eg. the, to, of, in,…) to better identify meaningful words
tkns_raw   <- tokens(crps)
tkns_clean <- tokens(crps, remove_numbers=T, remove_punct=T)
crps_dfm          <- dfm(crps)
crps_dfm_clean    <- dfm(crps, remove_numbers=T, remove_punct=T, remove="#*")
crps_dfm_twitless <- dfm(crps, remove_numbers=T, remove_punct=T, remove="#*") #remove tweet#
crps_dfm_filter   <- dfm(crps, remove_numbers=T, remove_punct=T, remove=c("#*", profanities))
crps_dfm_stopword <- dfm(crps, remove_numbers=T, remove_punct=T, remove=c("#*", profanities, stopwords('english')))

Word counts

We can now get the number of words per document for our 5 tests:
Next check is equivalent but counting unique words per document:

Finally, we can look at the most recurrent words

cbind(lowercase=textstat_frequency(crps_dfm, n=10)[,2:1],
       filtered=textstat_frequency(crps_dfm_filter, n=10)[,2:1],
      stopwords=textstat_frequency(crps_dfm_stopword, n=10)[,2:1])

It is quite easy to spot the differences between each of our category (they are nothing alike!); meaning that some bad parameters can ruin a study.

Vocabulary

As a bonus, we can display a cloudtag, which is a nice visualization to get a glimpse of the main vocabulary. This one is done based on our last setting (clean+filter+stopword) so we get a idea of meaningful words.

    textplot_wordcloud(crps_dfm_stopword, max_words = 100)

rm(tkns_raw)
rm(crps_dfm, crps_dfm_clean, crps_dfm_twitless, crps_dfm_filter, crps_dfm_stopword)

n-Grams

Because the final project’s objective is to predict a word based on 1 to 3 pre-existing words, we need to build a huge n-gram containing: a bigram (n-gram with n=2), a trigram (n=3) and a tetragram (n=4)

ngram     <- tokens_ngrams(tkns_clean, n=2:4)  # n=2:4 to generate bi,tri and tetragrams
ngram_dfm <- dfm(ngram, remove=c(stopwords('english'), profanities))
We have now access to the numbers of ngram per types:

Plotting n-grams

Plotting the most recurrent n-gram for our 3 combinations is probably the best way to see their repartition

ggplot(data=textstat_frequency(ngram_dfm, n=15)[,1:2], 
             aes(y=frequency, x=reorder(feature, -frequency), fill=feature)) +
             geom_bar(stat='identity') + labs(x="top ngram") +
             theme(axis.text.x=element_text(angle = 60, hjust = 1)) + guides(fill=F)

The disparity seems to get smoother the bigger the ngram are (diff in frequence are between 1k-4k for bigrams, 300-100 for trigrams and 80-30 for tetragrams), this should be reflected in the final results, and seems quite logical, the more words we have, the more accurata we should predict.

Vocabulary (bis)

We can also try to get an idea of the vocabulary used per document source; we can look at the top tetragrams since they have more complexity/meaning that bi/trigram It appears that twitter is more of the conversation register while blogs and news seem to have a more complex structures.


Coverage Tests

First let’s define a small function to compute the coverage from the token frequencies in a unigram

ngram_1 <- tokens_ngrams(tkns_clean, n=1)
coverage <- function (ngram, cover) {
    dfm_freq <- textstat_frequency(dfm(ngram, remove=c(stopwords('english'), profanities)))[,2]
    total <- cover * sum(dfm_freq) ; counts <- 0 ; i <- 1
    while (counts < total | i < length(dfm_freq)) {
      counts <- counts + dfm_freq[i]
      i <- i+1
    }
    i
}

According to this homemade function we need 1000 and 1.58810^{4} words to respectively get 50% and 90% coverage.
The plot below display more values that could help us tunned our model to get the best accuracy/performance ratio without any surprise over 90% it takes a lot of effort (much more words) to get additional percentage points but it also seems that is starts to be already the case around 80%…


Conclusions & Next Steps


That’s all folks! Thanks for reading it ’till the end.