Initial Insights

Executive Summary

This documents provides our initial insight into the language corpus files provided to build an Natural Language predictor. We focussed on the English corpus, and used the tm package to provide some inital analysis. The analysis performed below tells us there are interesting patterns and word occurances that need to be examined deeper inorder to come up with an efficient NLP solution.

Data Sample

Each of the files in their respective language directories are too large for any meaningful initial review. A sample of the English files was extracted using the following unix commands. This allows a more manageable data set for initial discovery

gshuf -n 20000 en_US.twitter.txt > sample/en_US.twitter.txt.2K

gshuf -n 20000 en_US.blogs.txt > sample/en_US.blogs.txt.2K

gshuf -n 20000 en_US.news.txt > sample/en_US.news.txt.2K

Data Prep

The tm package in the r language was used.The tm package provides many tools and internal functions to clean up the corpus. We performed a series of transformations (Appendix) to obtain a summary of the loaded corpus we use to study the data.

##                      Length Class             Mode
## en_US.blogs.txt.2K   2      PlainTextDocument list
## en_US.news.txt.2K    2      PlainTextDocument list
## en_US.twitter.txt.2K 2      PlainTextDocument list

Corpus Statistics

Term Document Matrices (DTM) are the most common formats to organize text for computation purposes. Some important statistics are provided in the below subsections.

Corpus Document Statistics

## <<DocumentTermMatrix (documents: 3, terms: 69175)>>
## Non-/sparse entries: 105636/101889
## Sparsity           : 49%
## Maximal term length: 79
## Weighting          : term frequency (tf)

Word Counts

##   rowSums.dtm_matrix.
## 1              427403
## 2              389929
## 3              138949

Line Count

## [1] 316908

Word Frequencies (Top 10)

## said will  one just like  can time  get  new  now 
## 5842 5602 5236 4556 4366 4185 3825 3309 3267 2748

Word Associations

We looked at additional tools/functions in the package with the goal of analyzing the word corpus in a more systematic fashion. The findFreqTerms has proven to be a useful function to give us more insight into the data.

Currently we limit our analyases of frequently occurring terms to those words (9 words) occuring extremely frequently (>3000). To those words we examine word associations with the findAssocs

##       [,1]  [,2]  [,3]
## can   2698 26609 28022
## get  20662 24189 26598
## just   786 23152 24927
## like 20896 24328 26873
## new   1064  8584 11203
## one   2739  7099 31481
## said 19504 23130 24494
## time  2413 26649 27999
## will  1563  8845 12088

Graphing Word Relationships

A visual display of the data characteristics would help significantly. We plot a line plot of most common word frequncies and their associations across the corpus for a series of correllations.

Next Steps

Our goal is to explore and create an effective model to predict text entry. At our disposal is the existing corpus of words. We look to a) Expand beyond the sampling of the corpus b) Come up with more sophisticated word associations c) Examine deeper language characteristics such as n-grams d) Examine current text prediction literature and e) Construct and deploy a model for text prediction.

Following the completion of the above steps we will start construction of our Shiny App which integrates the inference engine with a GUI model.

Conclusion and Goals

We have analyzed a sample of the English Corpus and have unearthed some initial statistics and some initial word relationships. As we move towards a full solution, we beleive our analysis will get more sophisticated and encompass larger portions of the data.

Appendix

setwd("/Users/rajmaddali/GitHub/Capstone/final/en_US")
cname <- "/Users/rajmaddali/GitHub/Capstone/final/en_US/en_US.twitter.txt.2K"
cdirname <- "/Users/rajmaddali/GitHub/Capstone/final/en_US/sample"
library(tm)   
library(utils)
library(R.utils)

# Corpus
docs <- Corpus(DirSource(cdirname))
#
#summary(docs)

# Cleanup
#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toEnglish <- content_transformer(function(x) { return (iconv(x, "latin1", "ASCII", sub=""))})

docs <- tm_map(docs, toSpace, ",")
#docs <- docs2
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, ":")
docs <- tm_map(docs, toSpace, '"')
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeNumbers)   
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords("english"))   # *Removing "stopwords" 
docs <- tm_map(docs, PlainTextDocument)