This report that explains your exploratory analysis done for the Capstone text prediction project. It explains the data clean-up done to date and sets out my goals for the eventual prediction algorithm and application.

Exploratory data analysis

Raw text files

Referring to the three US English text files:

• en_US.twitter.txt a 163 189 KB file. In R the character vector takes up 319MB with 2 360 148 text lines.

• en_US.news.txt a 200 989 KB file. In R the character vector takes up 257MB with 1 010 242 text lines.

• en_US.blogs.txt 200 235 KB file. In R the character vector is 255.4MB with 899 288 lines

Data clean-up

Exploring and cleaning of the text file in the 3 text files by removing profane words, all punctuation marks and special characters, all capitalisation, all double spacing and all non-english words.

To compensate for limited avaialble memory I cleaned each of the 3 files and write each clean file into a new text file a clean prefix, releasing the used memory. The cleaned files can then be imported as needed and used for the token generation. For each file I ran a custom built clean.text() function which in turn calls 5 functions

  1. removeBad() Custom built profanity filter

  2. str_replace_all() Using a regular expression I remove all the punctuation marks and special characters.

  3. to_lower() change all cases to lower cases.

  4. str_squish() remove all unnecessary spaces.

  5. is.word() this custom built function refers to the GradyAugmented dictionary and removes any word not in the English dictionary.

Each file lost a few sentences where the one-word sentences was a word that was not part of the English language. Although much of the cleaning could have been done in Quanteda, this approach gave me greater control of the use of memomory at each step.

Cleaned Text files

Within quanteda I did additional cleaning, removing single characters and English stop-words while converting each of the three objects to a document feature matrix to explore each corpus.

Blogs text

17 811 278 tokens (words) with 16 153 188 types (different words).

Wordcloud of Blogs tokens

Wordcloud of Blogs tokens

The first six words are dominant in the Blogs vocabulary – ‘can’, ‘one’, ‘just’ and ‘like’ and ‘time’ – after which the counts of words evens out.

Histogram of ten most frequent Blogs tokens

Histogram of ten most frequent Blogs tokens

News text

17 934 852 tokens (words) with 16 872 177 types (different words).

Wordcloud of News tokens

Wordcloud of News tokens

The word ‘said’ dominates the vocabulary in News (It also appears in the top 120 words for Blogs (36 712) and Twitter (18 156)).

Histogram of ten most frequent News tokens

Histogram of ten most frequent News tokens

Twitter text

14 661 439 tokens (words) with 14 269 748 types (different words).

Wordcloud of Twitter tokens

Wordcloud of Twitter tokens

Although the top six words has a high count, the twitter words are much more evenly distributed.

Histogram of ten most frequent Twitter tokens

Histogram of ten most frequent Twitter tokens

Two words appear in the top 10 of all three groups – ‘one’ (no 1 Blogs, no 2 in News) and ‘can’ (no 2 Blogs and no 2 in Twitter). ‘Just’ and ‘like’ are both top words for Twitter and Blogs. I expect these 4 words along with ‘said’ will appear frequently in the n-grams.

Approach to build solution

I’ll make use of the Quanteda package to build the n- grams – 1-grams to 5-grams, storing the data in a data-table to reference in the shiny app.

I’ll use SQL with the sqldt package to extract the most frequently occurring prediction and save these into an output data.table for my shiny app.

Limitations to manage

Hardware memory and CPU

I am developing the solution in R-Studio on a Windows PC with 4 gigabyte of memory and 2.00GHz CPU. With the limited memory and processing power (the advice on the discussion forum is to have 16 gigabyte memory available), I plan to follow an iterative approach to my high level steps:

  1. Clean-up the incoming documents and break them up into units

  2. For each unit - Build a corpus, Tokenize the corpus, Generate the n-grams

  3. Assemble the subcomponents by n-gram size and then break the n-grams into base and predicted words

  4. Aggregate to summarise each n-gram file into frequencies by base

R using memory

I’ll measure and manage the use of memory – guided by references like: http://adv-r.had.co.nz/memory.html

Shinyapp.io account limitations

The Shiny app free account I’ll be using will limited the app to 1 Gig. If I see I run out of space I will most likely remove the lower frequency n-grams. When building the application, I’ll keep in mind the limits in Shinyapp.io account – load time, memory, response time – trade-off with accuracy.

Next steps