Capstone Report

Executive Summary

The goal of this project is just to display that we’re working with the data and that we have a plan to create a prediction algorithm. The reprot below explains our exploratory analysis and goals for the eventual app and algorithm. This document is intended for non-data scientist, therefore it is concise and explains only the major features of the data, and briefly summarize our plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

Motivation

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Limitations for the purpose of this project, even though justifications can be made to do otherwise

  1. Displaying R code is beyond the objective since the intended audience is Non-Data Scientist Managers
  2. Due to the size of the files, and possible constraints on performance, we’re displaying small samples of the dataset
  3. Smaller samples will be used for deriving observations, stats and conclusions
  4. Sample size should be randomly selected to avoid possible bias
  5. For the purpose of this project we’re limiting the language to English.

Packages

Below is the list of packages we’re using to deliver the functionality we need for this report:

# library("NLP") #Generics NLP Function set
# library("openNLP") #Generics NLP Function set
# library("tm") #For Text Mining & Corpus workings
# library("RWeka") #For n-gram vector generation
# library("qdap") #For Text Mining & Corpus workings
# library("ggplot2") #Charting functionality
# library(stringi)    #String Processing Package
# library(pander)# R Doc writer package
# library(wordcloud) # Plot a word cloud
# library(RCurl) # General network (HTTP/FTP/...) client interface for R

Downloading the data

To demonstrate we successfuly downloaded the dataset and unzip it, below is the Metadata and File size for each of the contents. The source data is located at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Meta data from the downloaded files

# Upon downloading the dataset, unzipping all files and segregating the english data files, we summarize the following Metadata: 

## Available meta data pairs are:
##   Author       : 
##   DateTimeStamp: 2015-03-14 11:20:41
##   Description  : 
##   Heading      : 
##   ID           : en_US.blogs.txt
##   Language     : en_US
##   Origin       : 
## NULL

## Available meta data pairs are:
##   Author       : 
##   DateTimeStamp: 2014-03-14 11:20:43
##   Description  : 
##   Heading      : 
##   ID           : en_US.news.txt
##   Language     : en_US
##   Origin       : 
## NULL

## Available meta data pairs are:
##   Author       : 
##   DateTimeStamp: 2015-03-14 11:20:42
##   Description  : 
##   Heading      : 
##   ID           : en_US.twitter.txt
##   Language     : en_US
##   Origin       : 
## NULL

Summary of files Size, Words and Line Counts

# Here is the Dataset Stats

#Put in the stats from the sip location

# What do the data look like?
# Here are samples from the English Set of 3 files: Blog, Twitter and News:
#                    blogs    news twitters
# number of lines   899288   77259  2360148
# number of words 37334131 2643969 30373543
# file size in MB   205.24  200.99   163.19

Common steps in natural language processing

  1. Sentence Detection: Finding sentence boundaries
  2. Tokenization: Identifying words in sentences
  3. Tagging: defining verbs, nouns etc
  4. Decomposition: Finding compund words
  5. Parsing: Building Sentences
  6. Segmentation: Meaningful groups

Common issues in the analysis of text data

  1. Spelling and Grammar error identification
  2. Named Entity Recognition

Among the above include: Word/Phrase variation, Inflection, Synonymy and Homograph Ploysemy

NLP is a subbranch of the Data Science representing a specific segment of STatistics combined with Linguistic application in Speech Recognition, OCR/ICR, Translation, Text Suggestion/Prediction, Summarization and Segmentation in many domains

Loading Smaller Sample of the dataset.

Term Document Matrix

## A 93872x3 simple triplet matrix.

Samples from the data sets.

# Here are the Summary Stats from the English Set of 3 files: Blog, Twitter and News

# Example from the Sample Dataset

#               blogsample.txt newssample.txt twittersample.txt
# ''and                      1              0                 0
# ''he                       0              2                 0
# ''he's                     0              2                 0
# ''it's                     1              0                 0
# ''lazy''                   0              1                 0
# ''lespecial''              0              0                 1
# ''really?"                 0              0                 1
# ''so                       0              1                 0
# ''the                      0              2                 0
# ''when                     0              1                 0
# '$1.99                     0              1                 0
# '(expletive),              0              1                 0
# '08                        0              1                 0
# '08,                       0              2                 1
# '08.                       0              1                 0
# '09                        0              0                 1
# '12                        0              0                 1
# '13!!                      0              0                 1
# '14,                       0              0                 1
# '1960s'                    1              0                 0

Building the Corpus Sample and Data Cleaning

  1. Punctuation: Although there could be a value in punctuations, for the purpose of this project we elimination them
  2. Times, dates, numbers and currency values: In some business specific fields, these could be useful, but for this project we eliminate them to reduce complexity.
  3. Finding typos: This is tricky issue, with text messaging, there are pros and cons. Using a dictionary eliminates them, but in today’s environment, shorting words could be mistaken for typos. A possible solution is addition or new words and updating occurences will increase their weight and therefore their becoming main stream.
  4. Identify garbage, or wrong language: Using tm package is very useful to eliminate a lot of garbage, and elimination foreign language. However, there are pros anf cons. In today’s Global Trade, multi-lingual communication is a necessity.
  5. Profanity: We found a prfanity list, and used it to exclude unwanted words.
  6. Capital and lower cases: For the purpose of keeping memory use to a minimum, we convert all upper case to lower case.
  7. Best set of features to use in predicting the next word: Use the NGram that represents (number of typed word) + 1, find the matching entry of that NGram with the highest probability/frequency value to suggest the next word. In short, the NGrams drive the prediction intelligence.

Exploring NGrams Statistics

Frequency Distribution for the 1st 4 NGrams, as seen demonstarted below in the first graph, the word “the” has the highest probability of been the 1st sing word in the prediction.

## Warning in wordcloud(words = df.3g$words, freq = df.3g$total, random.order
## = FALSE, : one of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = df.3g$words, freq = df.3g$total, random.order
## = FALSE, : a lot of could not be fit on page. It will not be plotted.

Interesting Findings.

  1. As as seen demonstarted above in the first graph, “the” hase the highest probability of been the 1st sing word in the prediction.
  2. Subsequent 2, 3 and 4 grams show the top 10 correspnding grams with the highest frequency first.
  3. Using 1Gram, the frequency sorted dictionary to cover 50% of all word instances is below, followed by the coverage for 90% in the language.
  4. The Graph below that shows the Percent Coverage Curv in relation to the Number of Words
## [1] 322
## [1] 10408

Plans for creating a prediction algorithm and Shiny app.

  1. Modeling: Using the Serial Dependence of Markov Chain: Use the NGram that represents (number of known/typed word) + 1, find the matching entry of that NGram with the highest probability/frequency value to suggest the next word. For example: User types: “I am runing”, we use 4Gram to detect matches of the 1st 3 words records, select the record with the highest frequency, in this case: “I am running late”, and present to the user for selection.

  2. Managing Memory: If time permits LM Smoothing, use of K LM files, and Caching can be applied from the 2Gram to NGram to generate interpolation, then we apply compilation and quantization to reduce size. Caching of Probabilities will optimize memory use. Memory Mapping can be deployed to then reduce dependence on memory size.

  3. NGrams: We went up to the 10Gram for increased observation purposes. However, as we reached above 5, the size started decreasing, but slower than the increasing rate from 1 to 5. Another challenge with 6 and up, we found entries with the same word repeated many times, ex: “fellow fellow fellow …..”. Only 1-4 Grams will be deployed for size consideration.

  4. Unoboserved NGrams: As new unobserved ngrams are encountered, the Add-one method can be applied to the corresponding NGram and assigned a frequency of 1. however, if using Backoff Model, it is given a conditional probality value based on the related entried and their probability history in the neiboring grams. This model has its own drawback of overestimation

  5. Model Evaluation: To evaluate the model, application performance will need to be embeded in it, as well as the tightness of its fit to the test data, and its predictive capability

  6. Profanity: Moving forward, building intelligence in managing Profanity dictionary and updates to the NGrams is the best approach in dealing with this issue.

References

The data is from a corpus called HC Corpora: http://www.corpora.heliohost.org/.
For details on the corpora, a readme file located at: http://www.corpora.heliohost.org/aboutcorpus.html.
Natural language processing Wikipedia page: http://en.wikipedia.org/wiki/Natural_language_processing
Text mining infrastucture in R: http://www.jstatsoft.org/v25/i05/
CRAN Task View: Natural Language Processing: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Coursera course on NLP (not in R): https://www.coursera.org/course/nlp