Coursera’s Data Science Specialization Capstone Milestone Report

Please use Chrome or Firefox to view this report as it uses rCharts that cannot be displayed on IE

Executive Summary

The capstone project allows us (students) to create a usable/public data product that can be used to show the skills developed throughout the nine courses of the data science specialization. In this occasion, we’ll work on understanding and building Predictive Text Models like the ones used by SwiftKey - Coursera’s corporate partner for this capstone project.

This Milestone Report will cover:

General Overview of the *.zip file provided for this capstone project
Detailed review of the files to be used
First glimpse at Tokenization
Next steps

Note: If you are interested in the code used to create this .Rmd file you can go to GitHub

General Overview

The data used in this project is from a corpus called HC Corpora. The files have been language filtered by Coursera but may still contain some foreign text. The *.zip file contains the following language folder:

Language	Folder Name	files included
Deutsche	de_DE	blogs.txt, news.txt, twitter.txt
English	en_US	blogs.txt, news.txt, twitter.txt
Russian	fi_FI	blogs.txt, news.txt, twitter.txt
Finnish	ru_RU	blogs.txt, news.txt, twitter.txt

Each Language folder contains .txt files from 3 different sources: blogs, news and twitter.

English folder review

Looking at the files contained in en_US gives the following characteristics:

Name	File Name	File Size	Lines	Words
blogs	en_US.blogs.txt	200.4242 Mb	899,288	37,510,168
news	en_US.news.txt	196.2775 Mb	77,259	2,673,480
twitter	en_US.twitter.txt	159.3641 Mb	2,360,148	30,088,564

As expected, twitter has the most lines despite being the smallest file in the folder. This has to be related to the 140 character limit twitter has.

The following histogram shows how the word count is distributed across the entire en_US folder (e.i. all files within the folder). It is interesting to see how twitter skew the plot to the right until the 28 word mark.

English Corpus Word Histogram

Note: If we were to use number of characters instead of words, twitter will still skew the plot but now it will be close to the 140 character mark

Tokenization

From this point and on we are going to sample the dataset to control de processing power required to do the next operations. For now, we are going to create a subset with 3,000 lines per file.

We start with describing Tokenization as the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.

The following plot shows the 25 most frequent words from our sample data set:

Most of the words shone above are also called stop words. In general, stop words are the most common worlds in a language.

If we remove the stop words from our data set and recreate the previous plot we get the following top 25 words:

The word said moved from the 14th position to the 1st position by removing the stop words. Removing stop words is used for other NLP usages, but in this case it’s only to show the difference between datasets. For this specific project we’ll need to leave all stop words in the dataset as we are trying to predict phrases.

Next Steps

N-Grams: continue working on n-grams and improve accuracy of the predictive model
- This should take most of the upcomming weeks
- Special focus on improving model performance vs processing time
Research: Understand how to leverage various resources (e.i. Google, Microsoft, Standford, among others)
Shiny App: explore best User Interface to be used by the predictive model
- Request more processing time limit from Shiny
- If needed, Set an Rstudio server using Amazon Web Services (AWS)
Slide Deck: Work ahead of time on the slide Deck and try to embed the shiny app into the presentation
- See if it’s possible to leverage the RSdustio Server on AWS as an alternative for the Slide Deck

Coursera’s Data Science Specialization Capstone Milestone Report

Arturo Cardenas

July 25, 2015

Please use Chrome or Firefox to view this report as it uses rCharts that cannot be displayed on IE

Executive Summary

General Overview

English folder review

English Corpus Word Histogram

Tokenization

Next Steps

Coursera’s Data Science Specialization Capstone Milestone Report

Arturo Cardenas

July 25, 2015

****Please use Chrome or Firefox to view this report as it uses rCharts that cannot be displayed on IE****

Executive Summary

General Overview

English folder review

English Corpus Word Histogram

Tokenization

Next Steps

Please use Chrome or Firefox to view this report as it uses rCharts that cannot be displayed on IE