Introduction This document describes the data acquisition, cleaning and exploratory analysis that I’ve done so far for the Coursera Data Science Capstone Project.Also the thinking of the remaining tasks in the project for the near future.Building the Shiny app for word prediction.Here is a summary of the line counts for the three US files.
The US blogs file has 899,288 lines and approximately 4,799,000 words.
The US news file has 1,010,242 lines and approximately 886,300 words.
The US Twitter file has 2,360,148 lines and approximately 4,424,800 words.
## Dataset Filesize Lines Words
## 1 Blogs NA 899288 37334131
## 2 News NA 77259 2643969
## 3 Twitter NA 2360148 30373583
After a bit of cleaning and inspecting the data sets I’ve done some text mining/analytics using the following steps: - Inspection of the Test Data Managment (TDM) - Prepared the TDM for analysis, combining the sample files into one - Plotted the top terms, word frequency - Created functions to tokenize the n-grams using the NLP package - Transformed the text data with tokenizer function and plotted top bigrams - Transformed the text data with tokenizer function and plotted top trigrams
## Term Document Matrix and Exploratory analysis of the Corpus
cleanTDM <- TermDocumentMatrix(cleanedSample)
# inspect Term Document Matrix
inspect(cleanTDM)
## <<TermDocumentMatrix (terms: 146458, documents: 333667)>>
## Non-/sparse entries: 3411713/48864789773
## Sparsity : 100%
## Maximal term length: 105
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms 10851 16237 37360 53216 53970 54046 61677 62684 81830 88237
## can 1 0 0 2 6 31 2 1 0 2
## day 0 0 0 0 0 5 0 0 0 3
## get 1 4 6 3 0 7 1 0 0 2
## just 0 1 2 6 1 2 5 0 3 3
## like 2 3 2 5 0 1 2 0 3 1
## love 0 1 1 2 0 1 0 0 0 0
## make 3 0 1 3 1 14 2 2 1 1
## one 2 1 0 1 0 15 0 1 1 1
## time 0 2 5 1 1 6 11 0 0 1
## will 9 0 1 1 0 41 4 1 2 4
dim(cleanTDM)
## [1] 146458 333667
terms <- Terms(cleanTDM)
length(terms)
## [1] 146458
unique(Encoding(terms))
## [1] "unknown"
## [1] 140 333667
## tomorrow final away check tweet didnt
## 3471 3534 3536 3578 3606 3617
## today say thing great back need peopl come dont year want look
## 10050 10306 10522 10915 11194 11432 11534 11567 11784 11801 12474 12573
## think new work see now thank good know make day love can
## 12819 12820 12849 13454 14269 14746 15364 15497 16100 18175 19052 19377
## time will one like get just
## 19468 22072 22467 24316 24455 25339
## [1] 140
## [1] "back" "good" "like" "look" "man" "need"
## [7] "talk" "tri" "feel" "first" "part" "someth"
## [13] "didnt" "one" "stop" "well" "home" "hous"
## [19] "best" "call" "cant" "come" "day" "friend"
## [25] "girl" "help" "made" "morn" "next" "peopl"
## [31] "tell" "thing" "think" "time" "week" "work"
## [37] "around" "dont" "much" "realli" "sinc" "world"
## [43] "can" "find" "give" "itÃ" "littl" "mani"
## [49] "will" "â<U+0080><U+009C>" "even" "great" "live" "take"
## [55] "ever" "everi" "get" "keep" "mean" "old"
## [61] "place" "play" "start" "still" "tomorrow" "want"
## [67] "way" "also" "fun" "got" "make" "put"
## [73] "thought" "just" "new" "person" "right" "yes"
## [79] "said" "see" "use" "know" "post" "today"
## [85] "never" "though" "year" "end" "last" "long"
## [91] "big" "game" "guy" "head" "lot" "famili"
## [97] "book" "pleas" "anoth" "show" "someon" "follow"
## [103] "love" "chang" "ask" "sure" "say" "seem"
## [109] "night" "that" "thank" "final" "may" "away"
## [115] "let" "now" "two" "check" "life" "everyon"
## [121] "ill" "run" "alway" "ive" "school" "kid"
## [127] "open" "happen" "wait" "your" "happi" "better"
## [133] "miss" "watch" "read" "word" "hope" "tonight"
## [139] "lol" "tweet"
Some observations that stood up in the preliminary findings were that the top 30 single terms are mainly common, the one and tow syllable terms. The largest part of the corpus comes from Twitter, which is written in brief and simple language. New York and New York City are both featured in the bi and tri grams, that shows that is quite a popular city. Also Happy New year and Happy Mothers day have their dicent place in popularity and frequency of use.
After some ajdustments to be done on the model for speed and precision tuning, and the right model will be chosen to bulid N-gram model and calculate the probabilities of the Trigrams when given specific uni-gram or bigrams. High Ram memory PC or a home server is helpful when youll need to decide between speed of predicting or better predictions.