Overview
This project is a part of the Data Science Capstone and has the objective of developing a text prediction algorithm. This first report is centered on providing an overview of the data, data visualization and exploratory analysis, and outlining the steps to be taken to create the prediction algorithm and Shiny app.
Please note that for efficiency purposes, this report only reads in 5% of the data provided, but the percentage used for the final algorithm will, of course, be higher depending on the trade-off between accuracy and speed discussed throughout the project.
Step 1. Reading in dataset and preprocessing
Loading the data in
We begin by reading only a sample of the data and creating a training set to develop our prediction model, and a held-out set to fine-tune and test the model. Both datasets will be subject to the same transformations.
Below are some main statistics for the portion of the data that we have downloaded.
Main statistics
## Blogs Twitter News
## Total line length 42791.000 112578.000 3679.0000
## Training line length 2139.550 5628.900 183.9500
## Held-out line length 2032.572 5347.455 174.7525
## Total words 1796330.000 1449153.000 123823.0000
## Training words 1798620.000 1454967.000 124033.0000
## Held-out words 93234.000 75360.000 7237.0000
Step 2. Text Normalization and Processing
We know subject the text to sevel processing and normalization rounds, including language detection and string operations such as lowercasing and contraction correction. Additionally, we remove profanity from the datasets.
Step 3. Tokenization
At this stage, we tokenize the text to words and combine the text from the three datasets (blogs, twitter and news) into a single corpi. We also perform an additional profanity check and homogenize English words by using British, Austrailian, and other dictionaries.
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : número de items leÃdos no es múltiplo del número de columnas
Step 4. N-grams and word frequencies per order
In this step, we create n-grams up to five-grams and obtain data tables with frequencies per each order.
Analysis. Understanding the distribution and relationship between words, tokens and phrases
1. Exploratory analysis
We now perform an analysis of frequencies per n-grams for orders one through three to understand word distributions and patterns relevant for our prediction algorithm.
Trigrams

## Warning in wordcloud(words = trigramg$word, freq = trigramg$frequency,
## max.words = 90, : laughing_out_loud could not be fit on page. It will not
## be plotted.


2. Frequency and word coverage
Using the frequency table for unigrams plotted in the section above, we note that total number of frequencies is 3327099. We only need 104 words to cover 50% of all word instances in the corpus. That is, we only need 0.1135545% of words to cover 50% of instances.
What about 90%?
We only need 5935 words to cover 90% of all word instances in the corpus. That is, we only need 6.4802481% of words to cover 90% of instances.
3. Words from foreign languages
To evaluate how many words come from foreign languages, we conduct a search of every word in our created dictionary.
The percentage of words from foreign languages are thus 0.3921014. We shall convert these words to UNK.
Next steps: Prediction algorithm
As a next step, our first goal will be to create a simple prediction model as a basic n-gram model using word frequencies. We calculate n-gram probabilities and return the highest probability given the number of words inputed with no interpolation or smoothing. We use from unigrams to fivegrams.
Our second goal will be to create a model that includes back-off or interpolation and smoothing.
Thirdly, our goal will be to compare the performance of both models.
Next steps: Shiny App
Our first goal in this section will be to detect the amount of memory that our current best model requires and how it fits the limits of the Shiny App. We will then proceed to tweak the model by removing singletons or reading in less data, sacrificing accuracy, in order to improve its speed and performance in the Shiny App.