Data Science Capstone - Milestone Report

Introduction This document describes the data acquisition, cleaning and exploratory analysis that I’ve done so far for the Coursera Data Science Capstone Project.Also the thinking of the remaining tasks in the project for the near future.Building the Shiny app for word prediction.Here is a summary of the line counts for the three US files.

The US blogs file has 899,288 lines and approximately 4,799,000 words.

The US news file has 1,010,242 lines and approximately 886,300 words.

The US Twitter file has 2,360,148 lines and approximately 4,424,800 words.

Data Acquisition and Pre-Processing

Prepared the enviroment by loading the needed packages and enabling multicore processing
Pulled in the data
Created a function to pull a random % of the records from each of the three US files and write to a separate file
Read in the three sample files and combine into one
Cleaned up some contractions in the files
Created a corpus and pre-process the text data
Created a TermDocumentMatrix

##   Dataset Filesize   Lines    Words
## 1   Blogs       NA  899288 37334131
## 2    News       NA   77259  2643969
## 3 Twitter       NA 2360148 30373583

Exploratory Analysis:

After a bit of cleaning and inspecting the data sets I’ve done some text mining/analytics using the following steps: - Inspection of the Test Data Managment (TDM) - Prepared the TDM for analysis, combining the sample files into one - Plotted the top terms, word frequency - Created functions to tokenize the n-grams using the NLP package - Transformed the text data with tokenizer function and plotted top bigrams - Transformed the text data with tokenizer function and plotted top trigrams

Building the Corpus and Preparing the data for plotting

Crating a Term Document Matrix

## Term Document Matrix and Exploratory analysis of the Corpus 
cleanTDM <- TermDocumentMatrix(cleanedSample)

# inspect Term Document Matrix 
inspect(cleanTDM)

## <<TermDocumentMatrix (terms: 146458, documents: 333667)>>
## Non-/sparse entries: 3411713/48864789773
## Sparsity           : 100%
## Maximal term length: 105
## Weighting          : term frequency (tf)
## Sample             :
##       Docs
## Terms  10851 16237 37360 53216 53970 54046 61677 62684 81830 88237
##   can      1     0     0     2     6    31     2     1     0     2
##   day      0     0     0     0     0     5     0     0     0     3
##   get      1     4     6     3     0     7     1     0     0     2
##   just     0     1     2     6     1     2     5     0     3     3
##   like     2     3     2     5     0     1     2     0     3     1
##   love     0     1     1     2     0     1     0     0     0     0
##   make     3     0     1     3     1    14     2     2     1     1
##   one      2     1     0     1     0    15     0     1     1     1
##   time     0     2     5     1     1     6    11     0     0     1
##   will     9     0     1     1     0    41     4     1     2     4

dim(cleanTDM)

## [1] 146458 333667

terms <- Terms(cleanTDM)
length(terms)

## [1] 146458

unique(Encoding(terms))

## [1] "unknown"

Preparing the TDM for ploting and analysis

## [1]    140 333667

## tomorrow    final     away    check    tweet    didnt 
##     3471     3534     3536     3578     3606     3617

## today   say thing great  back  need peopl  come  dont  year  want  look 
## 10050 10306 10522 10915 11194 11432 11534 11567 11784 11801 12474 12573 
## think   new  work   see   now thank  good  know  make   day  love   can 
## 12819 12820 12849 13454 14269 14746 15364 15497 16100 18175 19052 19377 
##  time  will   one  like   get  just 
## 19468 22072 22467 24316 24455 25339

## [1] 140

##   [1] "back"     "good"     "like"     "look"     "man"      "need"    
##   [7] "talk"     "tri"      "feel"     "first"    "part"     "someth"  
##  [13] "didnt"    "one"      "stop"     "well"     "home"     "hous"    
##  [19] "best"     "call"     "cant"     "come"     "day"      "friend"  
##  [25] "girl"     "help"     "made"     "morn"     "next"     "peopl"   
##  [31] "tell"     "thing"    "think"    "time"     "week"     "work"    
##  [37] "around"   "dont"     "much"     "realli"   "sinc"     "world"   
##  [43] "can"      "find"     "give"     "itÃ"      "littl"    "mani"    
##  [49] "will"     "â<U+0080><U+009C>"      "even"     "great"    "live"     "take"    
##  [55] "ever"     "everi"    "get"      "keep"     "mean"     "old"     
##  [61] "place"    "play"     "start"    "still"    "tomorrow" "want"    
##  [67] "way"      "also"     "fun"      "got"      "make"     "put"     
##  [73] "thought"  "just"     "new"      "person"   "right"    "yes"     
##  [79] "said"     "see"      "use"      "know"     "post"     "today"   
##  [85] "never"    "though"   "year"     "end"      "last"     "long"    
##  [91] "big"      "game"     "guy"      "head"     "lot"      "famili"  
##  [97] "book"     "pleas"    "anoth"    "show"     "someon"   "follow"  
## [103] "love"     "chang"    "ask"      "sure"     "say"      "seem"    
## [109] "night"    "that"     "thank"    "final"    "may"      "away"    
## [115] "let"      "now"      "two"      "check"    "life"     "everyon" 
## [121] "ill"      "run"      "alway"    "ive"      "school"   "kid"     
## [127] "open"     "happen"   "wait"     "your"     "happi"    "better"  
## [133] "miss"     "watch"    "read"     "word"     "hope"     "tonight" 
## [139] "lol"      "tweet"

Plotting of Word Frequencies

Observation:

Some observations that stood up in the preliminary findings were that the top 30 single terms are mainly common, the one and tow syllable terms. The largest part of the corpus comes from Twitter, which is written in brief and simple language. New York and New York City are both featured in the bi and tri grams, that shows that is quite a popular city. Also Happy New year and Happy Mothers day have their dicent place in popularity and frequency of use.

Plans for Shiny WebApp

After some ajdustments to be done on the model for speed and precision tuning, and the right model will be chosen to bulid N-gram model and calculate the probabilities of the Trigrams when given specific uni-gram or bigrams. High Ram memory PC or a home server is helpful when youll need to decide between speed of predicting or better predictions.

Word Prediction Milestone report

Damjan Stefanovski

October 28, 2017