****Please use Chrome or Firefox to view this report as it uses rCharts that cannot be displayed on IE****

Executive Summary

The capstone project allows us (students) to create a usable/public data product that can be used to show the skills developed throughout the nine courses of the data science specialization. In this occasion, we’ll work on understanding and building Predictive Text Models like the ones used by SwiftKey - Coursera’s corporate partner for this capstone project.

This Milestone Report will cover:

Note: If you are interested in the code used to create this .Rmd file you can go to GitHub

General Overview

The data used in this project is from a corpus called HC Corpora. The files have been language filtered by Coursera but may still contain some foreign text. The *.zip file contains the following language folder:

Language Folder Name files included
Deutsche de_DE blogs.txt, news.txt, twitter.txt
English en_US blogs.txt, news.txt, twitter.txt
Russian fi_FI blogs.txt, news.txt, twitter.txt
Finnish ru_RU blogs.txt, news.txt, twitter.txt

Each Language folder contains .txt files from 3 different sources: blogs, news and twitter.

English folder review

Looking at the files contained in en_US gives the following characteristics:

Name File Name File Size Lines Words
blogs en_US.blogs.txt 200.4242 Mb 899,288 37,510,168
news en_US.news.txt 196.2775 Mb 77,259 2,673,480
twitter en_US.twitter.txt 159.3641 Mb 2,360,148 30,088,564

As expected, twitter has the most lines despite being the smallest file in the folder. This has to be related to the 140 character limit twitter has.

The following histogram shows how the word count is distributed across the entire en_US folder (e.i. all files within the folder). It is interesting to see how twitter skew the plot to the right until the 28 word mark.

English Corpus Word Histogram

Note: If we were to use number of characters instead of words, twitter will still skew the plot but now it will be close to the 140 character mark

Tokenization

From this point and on we are going to sample the dataset to control de processing power required to do the next operations. For now, we are going to create a subset with 3,000 lines per file.

We start with describing Tokenization as the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.

The following plot shows the 25 most frequent words from our sample data set:

Most of the words shone above are also called stop words. In general, stop words are the most common worlds in a language.

If we remove the stop words from our data set and recreate the previous plot we get the following top 25 words:

The word said moved from the 14th position to the 1st position by removing the stop words. Removing stop words is used for other NLP usages, but in this case it’s only to show the difference between datasets. For this specific project we’ll need to leave all stop words in the dataset as we are trying to predict phrases.

Next Steps

  1. N-Grams: continue working on n-grams and improve accuracy of the predictive model
    • This should take most of the upcomming weeks
    • Special focus on improving model performance vs processing time
  2. Research: Understand how to leverage various resources (e.i. Google, Microsoft, Standford, among others)
  3. Shiny App: explore best User Interface to be used by the predictive model
    • Request more processing time limit from Shiny
    • If needed, Set an Rstudio server using Amazon Web Services (AWS)
  4. Slide Deck: Work ahead of time on the slide Deck and try to embed the shiny app into the presentation
    • See if it’s possible to leverage the RSdustio Server on AWS as an alternative for the Slide Deck