Executive Summary

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships that can be observed in the data and prepare to build your first linguistic models.

This work is part of the Capstone project with SwiftKey which provided the dataset. The main library used here has been quanteda.

This report wants to cover and answer the following:

Exploratory analysis, understanding the distribution of words and relationship between them
Understand frequencies of words and word pairs

There is no clear conclusion in this analysis but rather some intelligence gathering for the rest of our project. Mostly we understood which preparatory steps and parameters value (ie. cleaning) we should use.

Exploratory Analysis

File-checks & Corpus

First of all, using an homemade function (using quanteda::corpus(readtext(file)), we create a quanteda corpus and load it with samples (to speed up things as we are just in discovery mode, not yet modelling) the 3 files from the dataset. It also print some basic summaries of the loaded file.

crps <- loadFile2Quanteda(filepath_blogs)

## [1] "filename: en_US.blogs.sample.txt"
## [1] "size: 2.26 MB"
## [1] "nb. lines: 30000"

crps <- crps + loadFile2Quanteda(filepath_news)

## [1] "filename: en_US.news.sample.txt"
## [1] "size: 1.95 MB"
## [1] "nb. lines: 30000"

crps <- crps + loadFile2Quanteda(filepath_twitter)

## [1] "filename: en_US.twitter.sample.txt"
## [1] "size: 2.02 MB"
## [1] "nb. lines: 30000"

summary(crps, showmeta=T)

Tokens and TDM/DFM

We now proceed to the tokenization of the corpus docs and build the Term Document Matrix (TDM or DFM).
In order to better understand the corpus, several tokenization and DFM are created playing with the available options:

raw: untouched text from dataset
lowercase: uniformisation of the text (so ‘House’ and ‘house’ can be treated as the same)
clean: removing punctuation and numbers and hastags from tweets
filter: filtering to catch up strong language (using wordpress profanity list)
stopword: removing low-values words (eg. the, to, of, in,…) to better identify meaningful words

tkns_raw   <- tokens(crps)
tkns_clean <- tokens(crps, remove_numbers=T, remove_punct=T)
crps_dfm          <- dfm(crps)
crps_dfm_clean    <- dfm(crps, remove_numbers=T, remove_punct=T, remove="#*")
crps_dfm_twitless <- dfm(crps, remove_numbers=T, remove_punct=T, remove="#*") #remove tweet#
crps_dfm_filter   <- dfm(crps, remove_numbers=T, remove_punct=T, remove=c("#*", profanities))
crps_dfm_stopword <- dfm(crps, remove_numbers=T, remove_punct=T, remove=c("#*", profanities, stopwords('english')))

Word counts

We can now get the number of words per document for our 5 tests:

Next check is equivalent but counting unique words per document:

Finally, we can look at the most recurrent words

cbind(lowercase=textstat_frequency(crps_dfm, n=10)[,2:1],
       filtered=textstat_frequency(crps_dfm_filter, n=10)[,2:1],
      stopwords=textstat_frequency(crps_dfm_stopword, n=10)[,2:1])

It is quite easy to spot the differences between each of our category (they are nothing alike!); meaning that some bad parameters can ruin a study.

Vocabulary

As a bonus, we can display a cloudtag, which is a nice visualization to get a glimpse of the main vocabulary. This one is done based on our last setting (clean+filter+stopword) so we get a idea of meaningful words.

    textplot_wordcloud(crps_dfm_stopword, max_words = 100)

rm(tkns_raw)
rm(crps_dfm, crps_dfm_clean, crps_dfm_twitless, crps_dfm_filter, crps_dfm_stopword)

n-Grams

Because the final project’s objective is to predict a word based on 1 to 3 pre-existing words, we need to build a huge n-gram containing: a bigram (n-gram with n=2), a trigram (n=3) and a tetragram (n=4)

ngram     <- tokens_ngrams(tkns_clean, n=2:4)  # n=2:4 to generate bi,tri and tetragrams
ngram_dfm <- dfm(ngram, remove=c(stopwords('english'), profanities))

We have now access to the numbers of ngram per types:

Plotting n-grams

Plotting the most recurrent n-gram for our 3 combinations is probably the best way to see their repartition

ggplot(data=textstat_frequency(ngram_dfm, n=15)[,1:2], 
             aes(y=frequency, x=reorder(feature, -frequency), fill=feature)) +
             geom_bar(stat='identity') + labs(x="top ngram") +
             theme(axis.text.x=element_text(angle = 60, hjust = 1)) + guides(fill=F)

The disparity seems to get smoother the bigger the ngram are (diff in frequence are between 1k-4k for bigrams, 300-100 for trigrams and 80-30 for tetragrams), this should be reflected in the final results, and seems quite logical, the more words we have, the more accurata we should predict.

Vocabulary (bis)

We can also try to get an idea of the vocabulary used per document source; we can look at the top tetragrams since they have more complexity/meaning that bi/trigram It appears that twitter is more of the conversation register while blogs and news seem to have a more complex structures.

Coverage Tests

First let’s define a small function to compute the coverage from the token frequencies in a unigram

ngram_1 <- tokens_ngrams(tkns_clean, n=1)
coverage <- function (ngram, cover) {
    dfm_freq <- textstat_frequency(dfm(ngram, remove=c(stopwords('english'), profanities)))[,2]
    total <- cover * sum(dfm_freq) ; counts <- 0 ; i <- 1
    while (counts < total | i < length(dfm_freq)) {
      counts <- counts + dfm_freq[i]
      i <- i+1
    }
    i
}

According to this homemade function we need 1000 and 1.58810^{4} words to respectively get 50% and 90% coverage.
The plot below display more values that could help us tunned our model to get the best accuracy/performance ratio without any surprise over 90% it takes a lot of effort (much more words) to get additional percentage points but it also seems that is starts to be already the case around 80%…

Conclusions & Next Steps

It is clear that parameters and cleaning steps have a huge impact on the results ; therefore, it will be primary to confirm our various settings with model comparisons.
Obvisouly, given our objectives, we will not be using the stopword filtering (we want every highly recurring word combinations regardless of their meaning value) but I will definitly try to use the resulting top unique words as a way to predict when we don’t have relevan ngram in store
it comes without saying that I allowed myself to use rather-small samples to speed up things as we were just in discovery mode, not yet modelling ; t won’t be possible in next phase so…
Except for the creation of the big ngram and DFMs, there were no memory problems but it will most probably be an issue during the next phases (modelling) so the tests made on coverage could be of use to optimize the balance between model size and performances.
This EDA has been performed on a sample of the original dataset. The first step to optimise it was to find adequate file loading function (I started with readLines but switch to the more effective readtext). The next steps, if necessary, would be to split the corpus_segment.
Looking for optimizing the loading has lead me to the discovery of the quanteda library, I could not be more thanksful (although I would have love to do that before reading the still-mentor-recommended-but-outdated-50p-tutorial for the more complex tm library)
More (litterate) content could be added to the corpus to improve the accuracy of our futur model. I will try to find some relevant one (but always cautious about the memory/model size)
Playing with dictionnary to either detect&remove words from foreign languages or keep only thesaurized english words was quite disapointing as it was time/ressources consumming (not displayed here for that reason) and did not help reduce much the ngram size.

That’s all folks! Thanks for reading it ’till the end.

EDA of Corpus from Web Content

jnabonne (January’19)