Week 2 - SwitftKey N-Grams Milestone Report

Summary

In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Data Processing

The data files are from the SwiftKey, a Coursera corporate partnet. The database files were downloaded using the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In this exercise, we use the English database but there are three other databases in German, Russian and Finnish. in the following the steps we read the files and load subset data in memory.

1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.

In this step the database files,en_US.twitter.txt, en_US.blogs.txt and en_US.news.txt, are downloaded and placed on local pc. Then the files on the English database area opened ad read to compute the statistics for each file. A file connection is open and closed for each file to save memory. The top 10 lines are printed for each file.

Summary statistics per file

	Characters	Words	Sentences	FileSize(MB)
twitter	167,105,119	30,373,995	4,346,303	159
blogs	210,160,012	37,334,450	2,530,958	200
news	15,838,281	2,643,972	174,879	196

Read big files and obtained a small sample of about 10 percent of each of files Display the top 6 sentences from each file:twitter,blogs and news.

Data cleaning and Exploratory analysis

Clean up the data by removing punctuation, numbers and separators. Build a document-frequence matrix which performs tokenization and tabulates the extracted features into a matrix of documents by features. This helps to obtain an understanding of multiple n-grams created and their frequency on the data subset. The n-grams created were from 1 to 4. N-gram frequency was plotted to better visualize the distribution of the top 10 n-grams.

## quanteda version 0.9.9.50

## Using 3 of 4 cores for parallel computing

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

Plot the four generated n-grams individually

Findings

news file had bigger size than the twitter file but it had a less number of sentences and words in it.
n-grams where n=1 have the highest frequency, this is expected because is comprised of stopwords which are require for this excercise. The higher n is the lower the frequency of n-grams.

-It is a memory and time consuming task to read big files…

-It does not seem that with these 3 files we might be able to predict any sentence that someone might think of.

Plan for creating the prediction model.

The 4 ngrams have been save as files or database that can be loaded up any time they are needed and they can be remove from memory when not needed. The algorithm will try to find a match for the provided input words, if there is a match(es) they will be provided immediately.

If there is no a match, the algorithm will use a backoff method to compare the words list against the previous n-gram or n-1 gram. Probabilities will be assigned to each occurence of the ngram based on their frequency. For those occurrences with zero frequency, a smoothing method will be used to make sure that their probability can be calculated and will not result on a error or exception in the code.