This report that explains your exploratory analysis done for the Capstone text prediction project. It explains the data clean-up done to date and sets out my goals for the eventual prediction algorithm and application.
Referring to the three US English text files:
• en_US.twitter.txt a 163 189 KB file. In R the character vector takes up 319MB with 2 360 148 text lines.
• en_US.news.txt a 200 989 KB file. In R the character vector takes up 257MB with 1 010 242 text lines.
• en_US.blogs.txt 200 235 KB file. In R the character vector is 255.4MB with 899 288 lines
Exploring and cleaning of the text file in the 3 text files by removing profane words, all punctuation marks and special characters, all capitalisation, all double spacing and all non-english words.
To compensate for limited avaialble memory I cleaned each of the 3 files and write each clean file into a new text file a clean prefix, releasing the used memory. The cleaned files can then be imported as needed and used for the token generation. For each file I ran a custom built clean.text() function which in turn calls 5 functions
removeBad() Custom built profanity filter
str_replace_all() Using a regular expression I remove all the punctuation marks and special characters.
to_lower() change all cases to lower cases.
str_squish() remove all unnecessary spaces.
is.word() this custom built function refers to the GradyAugmented dictionary and removes any word not in the English dictionary.
Each file lost a few sentences where the one-word sentences was a word that was not part of the English language. Although much of the cleaning could have been done in Quanteda, this approach gave me greater control of the use of memomory at each step.
Within quanteda I did additional cleaning, removing single characters and English stop-words while converting each of the three objects to a document feature matrix to explore each corpus.
Blogs text
17 811 278 tokens (words) with 16 153 188 types (different words).
Wordcloud of Blogs tokens
The first six words are dominant in the Blogs vocabulary – ‘can’, ‘one’, ‘just’ and ‘like’ and ‘time’ – after which the counts of words evens out.
Histogram of ten most frequent Blogs tokens
News text
17 934 852 tokens (words) with 16 872 177 types (different words).
Wordcloud of News tokens
The word ‘said’ dominates the vocabulary in News (It also appears in the top 120 words for Blogs (36 712) and Twitter (18 156)).
Histogram of ten most frequent News tokens
Twitter text
14 661 439 tokens (words) with 14 269 748 types (different words).
Wordcloud of Twitter tokens
Although the top six words has a high count, the twitter words are much more evenly distributed.
Histogram of ten most frequent Twitter tokens
Two words appear in the top 10 of all three groups – ‘one’ (no 1 Blogs, no 2 in News) and ‘can’ (no 2 Blogs and no 2 in Twitter). ‘Just’ and ‘like’ are both top words for Twitter and Blogs. I expect these 4 words along with ‘said’ will appear frequently in the n-grams.
I’ll make use of the Quanteda package to build the n- grams – 1-grams to 5-grams, storing the data in a data-table to reference in the shiny app.
I’ll use SQL with the sqldt package to extract the most frequently occurring prediction and save these into an output data.table for my shiny app.
I am developing the solution in R-Studio on a Windows PC with 4 gigabyte of memory and 2.00GHz CPU. With the limited memory and processing power (the advice on the discussion forum is to have 16 gigabyte memory available), I plan to follow an iterative approach to my high level steps:
Clean-up the incoming documents and break them up into units
For each unit - Build a corpus, Tokenize the corpus, Generate the n-grams
Assemble the subcomponents by n-gram size and then break the n-grams into base and predicted words
Aggregate to summarise each n-gram file into frequencies by base
I’ll measure and manage the use of memory – guided by references like: http://adv-r.had.co.nz/memory.html
The Shiny app free account I’ll be using will limited the app to 1 Gig. If I see I run out of space I will most likely remove the lower frequency n-grams. When building the application, I’ll keep in mind the limits in Shinyapp.io account – load time, memory, response time – trade-off with accuracy.
Learn the functions available in the Quanteda package to build the n-grams
Get a better understanding of Smoothing techniques
Understand how to use GC() to better manage memory issues
Build the complete collection of n-grams, prune of necessary
Determine what a good prediction algorithm look like and build efficiency tests around it.
Build the prediction algorithm
Test and deploy to Shiny
[Len Greski 1] (https://github.com/lgreski/datasciencectacontent/blob/master/markdown/capstone-simplifiedApproach.md)
[Len Greski 2] (https://github.com/lgreski/datasciencectacontent/blob/master/markdown/capstone-ngramComputerCapacity.md)
[Wayne Heller] (https://github.com/wayneheller/DataScienceSpecializationCapstone/blob/gh-pages/README.md)
[Paul Ringsted] (https://www.coursera.org/learn/data-science-project/discussions/weeks/1/threads/IxYRCkNkEemk-w7wPhl4Og)