Data Science Capstone: Milestone Report 1

Overview

This project is a part of the Data Science Capstone and has the objective of developing a text prediction algorithm. This first report is centered on providing an overview of the data, data visualization and exploratory analysis, and outlining the steps to be taken to create the prediction algorithm and Shiny app.

Please note that for efficiency purposes, this report only reads in 5% of the data provided, but the percentage used for the final algorithm will, of course, be higher depending on the trade-off between accuracy and speed discussed throughout the project.

Step 1. Reading in dataset and preprocessing

Loading the data in

We begin by reading only a sample of the data and creating a training set to develop our prediction model, and a held-out set to fine-tune and test the model. Both datasets will be subject to the same transformations.

Below are some main statistics for the portion of the data that we have downloaded.

Main statistics

##                            Blogs     Twitter        News
## Total line length      42791.000  112578.000   3679.0000
## Training line length    2139.550    5628.900    183.9500
## Held-out line length    2032.572    5347.455    174.7525
## Total words          1796330.000 1449153.000 123823.0000
## Training words       1798620.000 1454967.000 124033.0000
## Held-out words         93234.000   75360.000   7237.0000

Step 2. Text Normalization and Processing

We know subject the text to sevel processing and normalization rounds, including language detection and string operations such as lowercasing and contraction correction. Additionally, we remove profanity from the datasets.

Step 3. Tokenization

At this stage, we tokenize the text to words and combine the text from the three datasets (blogs, twitter and news) into a single corpi. We also perform an additional profanity check and homogenize English words by using British, Austrailian, and other dictionaries.

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : número de items leídos no es múltiplo del número de columnas

Step 4. N-grams and word frequencies per order

In this step, we create n-grams up to five-grams and obtain data tables with frequencies per each order.

Analysis. Understanding the distribution and relationship between words, tokens and phrases

1. Exploratory analysis

We now perform an analysis of frequencies per n-grams for orders one through three to understand word distributions and patterns relevant for our prediction algorithm.

Unigrams

Bigrams

Trigrams

## Warning in wordcloud(words = trigramg$word, freq = trigramg$frequency,
## max.words = 90, : laughing_out_loud could not be fit on page. It will not
## be plotted.

2. Frequency and word coverage

Using the frequency table for unigrams plotted in the section above, we note that total number of frequencies is 3327099. We only need 104 words to cover 50% of all word instances in the corpus. That is, we only need 0.1135545% of words to cover 50% of instances.

What about 90%?

We only need 5935 words to cover 90% of all word instances in the corpus. That is, we only need 6.4802481% of words to cover 90% of instances.

3. Words from foreign languages

To evaluate how many words come from foreign languages, we conduct a search of every word in our created dictionary.

The percentage of words from foreign languages are thus 0.3921014. We shall convert these words to UNK.

Next steps: Prediction algorithm

As a next step, our first goal will be to create a simple prediction model as a basic n-gram model using word frequencies. We calculate n-gram probabilities and return the highest probability given the number of words inputed with no interpolation or smoothing. We use from unigrams to fivegrams.

Our second goal will be to create a model that includes back-off or interpolation and smoothing.

Thirdly, our goal will be to compare the performance of both models.

Next steps: Shiny App

Our first goal in this section will be to detect the amount of memory that our current best model requires and how it fits the limits of the Shiny App. We will then proceed to tweak the model by removing singletons or reading in less data, sacrificing accuracy, in order to improve its speed and performance in the Shiny App.