The goal of the project is to create a shiny application that predicts next word using the one, two or three preceding words within a sentence. This report consists on an exploratory data analysis of a text corpus conformed by several text sources (twitter, blogs, news) provided by Coursera along with Swiftkey. This file does not pretend to be technical but to show the general insights of the text corpus to build a suitable algorithm that allow me to achieve the goal of the project with a reasonable accuracy.

Loading data, summary tables

##   text.type Number.of.lines
## 1   Twitter         2360148
## 2     Blogs          899288
## 3      News           77259

Text sampling and merging into a single text corpus:

In order to enhance the efficiency of the shiny app , it’s necessary to subset the text corpus thus obtaining a meaning full sample of each of the text sources. To achieve this, I randomly sampled each of the sentences of text sources using a binary distribution with a probability success of 20%. Doing reduces the size of the corpus in about 80 % without loosing a significant accuracy. Finally I got a merged text corpus from all the sources that would be used as training set for my future algorithm.

##        text.type Number.of.lines
## 1 Sample Twitter          471431
## 2   Sample Blogs          179676
## 3    Sample News           15338

## [1] "total number of sentences in the orginal corpus: 3336695 lines"

## [1] "total number of sentences in the sampled corpus: 666445 lines"

Tockenization and text pre-processing:

The next step is to “tokenize” the text corpus and to preprocess the text. The “tokenization” process, consist on splitting the sentences of the corpus into basic components, in this case, single words. I included some preprocessing techniques in order to remove and gather some words regarding their conceptual meaning. These techniques included, transforming the text to lowercase, removal of punctuation, hashtags, numbers, separators, urls, etc.

As we were using some informal language text, it was necessary to do some profanity cleaning. So I downloaded a dictionary of profany words in English in order to remove this words from our corpus.

The following plot shows the frequency of unique words that are part of our text corpus vocabulary. It’s important to notice that even though we have a really big collection of unique words, just a few thousands of these words represent the 90% of the vocabulary in the corpus. This means we got a high sparsity; hence, in order to reduce the size of our app, it will be a good idea to subset the words used in our algorithm using the most frequent vocabulary present in our corpus.

## [1] "Total unique words in the text corpus 207946"

## [1] "Number of unique words that gather 90% of the corpus vocabulary: 6366"

Obtaining n-grams from text corpus:

The term N-gram means a structural language body made of “n” consecutive words. In the last step we already obtained what we called “unigrams” that are n-grams where n=1 (1 word). From this we can get the rest of n-grams from our text corpus, as bigrams, n-grams conformed by two consecutive words, and trigrams, n-grams conformed by 3 consecutive words.

Splitting our corpus vocabulary into n-grams is necessary when we want to predict the next word given the last words. So we can use the frequency of n-grams in our corpus as a maximum likelihood estimate (MLE) of the probability of appearence of each of these n-grams in our language. The algorithm proposed will be explained at the end of this file.

The next histograms plots containg the 100 most used n-grams in ur corpus and their respective word soup. Notice that the same principle will be used as before, I’m just going to use the n-grams with higher frequency over the text corpus so the app works faster and doesn’t saturate our storage on the cloud.

Top 100 unigrams

Top 100 bigrams

Top 100 trigrams

Once I obtained the frequency table of the different ngrams, it was easier to identify unwanted tokens, hence the pre-processing method was carried out in an iterative way until I get a set of “clean” tokens.

Next steps: Building an n-gram algorithm and a shiny app .

The algorithm that will be used is called Kaltz back off model. As it was explained before, this algorithm uses our n-grams frequency tables and trough the probability chain rules of dependendant events along with the Markov’s assuption, we could predict our next word given the last one or two words.

This algorithm , by default, uses the higher level n-gram to match the one that has the highest probability given by their n-1 previous words. If there is no match, it will reduce the level of the n-gram until it finds a match using n-1 previous words. One of the problem of this kind of methods consist on estimating the probability of unobserved n-grams. In this case, this algorithm will discount a probability mass from observed n-grams and will distibute it to the unobserved n-grams thus avoiding getting zero probabilites which are far away from reality.

Exploratory Data Analysis Capstone Project Coursera

Pablo Rueda

2/15/2021

Loading data, summary tables

Text sampling and merging into a single text corpus:

Tockenization and text pre-processing:

Obtaining n-grams from text corpus:

Next steps: Building an n-gram algorithm and a shiny app .