The goal of this project is just to display that we’re working with the data and that we have a plan to create a prediction algorithm. The reprot below explains our exploratory analysis and goals for the eventual app and algorithm. This document is intended for non-data scientist, therefore it is concise and explains only the major features of the data, and briefly summarize our plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
The motivation for this project is to:
Below is the list of packages we’re using to deliver the functionality we need for this report:
# library("NLP") #Generics NLP Function set
# library("openNLP") #Generics NLP Function set
# library("tm") #For Text Mining & Corpus workings
# library("RWeka") #For n-gram vector generation
# library("qdap") #For Text Mining & Corpus workings
# library("ggplot2") #Charting functionality
# library(stringi) #String Processing Package
# library(pander)# R Doc writer package
# library(wordcloud) # Plot a word cloud
# library(RCurl) # General network (HTTP/FTP/...) client interface for R
To demonstrate we successfuly downloaded the dataset and unzip it, below is the Metadata and File size for each of the contents. The source data is located at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
# Upon downloading the dataset, unzipping all files and segregating the english data files, we summarize the following Metadata:
## Available meta data pairs are:
## Author :
## DateTimeStamp: 2015-03-14 11:20:41
## Description :
## Heading :
## ID : en_US.blogs.txt
## Language : en_US
## Origin :
## NULL
## Available meta data pairs are:
## Author :
## DateTimeStamp: 2014-03-14 11:20:43
## Description :
## Heading :
## ID : en_US.news.txt
## Language : en_US
## Origin :
## NULL
## Available meta data pairs are:
## Author :
## DateTimeStamp: 2015-03-14 11:20:42
## Description :
## Heading :
## ID : en_US.twitter.txt
## Language : en_US
## Origin :
## NULL
# Here is the Dataset Stats
#Put in the stats from the sip location
# What do the data look like?
# Here are samples from the English Set of 3 files: Blog, Twitter and News:
# blogs news twitters
# number of lines 899288 77259 2360148
# number of words 37334131 2643969 30373543
# file size in MB 205.24 200.99 163.19
Among the above include: Word/Phrase variation, Inflection, Synonymy and Homograph Ploysemy
NLP is a subbranch of the Data Science representing a specific segment of STatistics combined with Linguistic application in Speech Recognition, OCR/ICR, Translation, Text Suggestion/Prediction, Summarization and Segmentation in many domains
## A 93872x3 simple triplet matrix.
# Here are the Summary Stats from the English Set of 3 files: Blog, Twitter and News
# Example from the Sample Dataset
# blogsample.txt newssample.txt twittersample.txt
# ''and 1 0 0
# ''he 0 2 0
# ''he's 0 2 0
# ''it's 1 0 0
# ''lazy'' 0 1 0
# ''lespecial'' 0 0 1
# ''really?" 0 0 1
# ''so 0 1 0
# ''the 0 2 0
# ''when 0 1 0
# '$1.99 0 1 0
# '(expletive), 0 1 0
# '08 0 1 0
# '08, 0 2 1
# '08. 0 1 0
# '09 0 0 1
# '12 0 0 1
# '13!! 0 0 1
# '14, 0 0 1
# '1960s' 1 0 0
Frequency Distribution for the 1st 4 NGrams, as seen demonstarted below in the first graph, the word “the” has the highest probability of been the 1st sing word in the prediction.
## Warning in wordcloud(words = df.3g$words, freq = df.3g$total, random.order
## = FALSE, : one of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = df.3g$words, freq = df.3g$total, random.order
## = FALSE, : a lot of could not be fit on page. It will not be plotted.
## [1] 322
## [1] 10408
Modeling: Using the Serial Dependence of Markov Chain: Use the NGram that represents (number of known/typed word) + 1, find the matching entry of that NGram with the highest probability/frequency value to suggest the next word. For example: User types: “I am runing”, we use 4Gram to detect matches of the 1st 3 words records, select the record with the highest frequency, in this case: “I am running late”, and present to the user for selection.
Managing Memory: If time permits LM Smoothing, use of K LM files, and Caching can be applied from the 2Gram to NGram to generate interpolation, then we apply compilation and quantization to reduce size. Caching of Probabilities will optimize memory use. Memory Mapping can be deployed to then reduce dependence on memory size.
NGrams: We went up to the 10Gram for increased observation purposes. However, as we reached above 5, the size started decreasing, but slower than the increasing rate from 1 to 5. Another challenge with 6 and up, we found entries with the same word repeated many times, ex: “fellow fellow fellow …..”. Only 1-4 Grams will be deployed for size consideration.
Unoboserved NGrams: As new unobserved ngrams are encountered, the Add-one method can be applied to the corresponding NGram and assigned a frequency of 1. however, if using Backoff Model, it is given a conditional probality value based on the related entried and their probability history in the neiboring grams. This model has its own drawback of overestimation
Model Evaluation: To evaluate the model, application performance will need to be embeded in it, as well as the tightness of its fit to the test data, and its predictive capability
Profanity: Moving forward, building intelligence in managing Profanity dictionary and updates to the NGrams is the best approach in dealing with this issue.
The data is from a corpus called HC Corpora: http://www.corpora.heliohost.org/.
For details on the corpora, a readme file located at: http://www.corpora.heliohost.org/aboutcorpus.html.
Natural language processing Wikipedia page: http://en.wikipedia.org/wiki/Natural_language_processing
Text mining infrastucture in R: http://www.jstatsoft.org/v25/i05/
CRAN Task View: Natural Language Processing: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Coursera course on NLP (not in R): https://www.coursera.org/course/nlp