Data Science Capstone Week 2 Update

A1. Background

“In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, you should familiarize yourself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to you.”

Week 1 - Task 0 Understanding the problem:

https://www.coursera.org/learn/data-science-project/supplement/Iimbd/task-0-understanding-the-problem

The Data: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Tasks to accomplish:

1. Obtaining the data - Can you download the data and load/manipulate it in R?

2. Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Week 1 - Task 1 Getting and Cleaning the Data:

https://www.coursera.org/learn/data-science-project/supplement/IbTUL/task-1-getting-and-cleaning-the-data

Tasks to accomplish:

1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

2. Profanity filtering - removing profanity and other words you do not want to predict.

Week 2 - Task 2 “Exploratory Data Analysis”"

https://www.coursera.org/learn/data-science-project/supplement/BePVz/task-2-exploratory-data-analysis

“The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.”

Tasks to accomplish:

1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Week 2 - “Task 3 Modeling”"

https://www.coursera.org/learn/data-science-project/supplement/2IiM9/task-3-modeling

1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.

2. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

A2. Resources used:

1. https://cran.r-project.org/web/packages/ngram/ngram.pdf

2. https://cran.r-project.org/web/packages/ngram/README.html

3. https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

4. http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

5. https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

6. https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html

7. http://beyondvalence.blogspot.com/2014/01/text-mining-4-performing-term.html

8. https://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/

9. https://www.slideshare.net/rdatamining/text-mining-with-r-an-analysis-of-twitter-data

10. http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

————————————————————————————-

B. Basic Environment Setup

B1. Loading libraries

B2. Load cookbook-r.com ggplot 2 Helper Function for Multiple plot panel

————————————————————————————-

C. Data Mining

Step C1: File and Data Setup

1.1 Set working directory, file paths, and download the data.

1.2 Read raw training and profane table data.

Step C2: Raw Data Exploration

2.1 Evaluate object.size in MB, obtain file summary (length, class, mode), and inspect the first line in the text files:

##    NewsMB   BlogsMB TwitterMB   TotalMB 
##  19.17972 248.49350 301.39670 569.06992

## $NewsSummary
##    Length     Class      Mode 
##     77259 character character 
## 
## $BlogsSummary
##    Length     Class      Mode 
##    899288 character character 
## 
## $TwitterSummary
##    Length     Class      Mode 
##   2360148 character character

## $NewsSample
## [1] "He wasn't home alone, apparently."
## 
## $BlogsSample
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
## 
## $TwitterSample
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

2.2 Number of lines per file:

##       news      blogs    twitter totalLines 
##      77259     899288    2360148    3336695

2.3 Number of words per file:

(note: this step takes a few minutes)

##    newsWords    blogWords twitterWords   totalWords 
##      2643969     37334131     30373543     70351643

Step C3. Create single reduce raw data file and Remove non-ASCII characters

Reduce the total raw data file size to 3000 lines total, i.e. 1000 each from news, blogs, and twitter and then confirm line count as 1000 lines per object.

By removing the non-ASCII characters we can safely assume that the remaining words are largely of the English language. While the ASCII code primarily represents English characters and numbers, other European languages also have English letter characters. Hence, this data cleanup is not comprehensive in the sense of foreign language clean up.

See attached references:

https://www.quora.com/What-is-the-difference-between-ASCII-and-unicode-characters-difference-between-UTF-8-and-UTF-16

https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch2charset.htm

##    newsReduced   blogsReduced twitterReduced   totalReduced 
##           1000           1000           1000           3000

Create a single combined truncated data object and remove any non-ASCII characters before creating Corpus.

comb_trunc <- c(news_trunc, blogs_trunc, twitter_trunc)
comb_trunc <- iconv(comb_trunc, "latin1", "ASCII", sub="")

Step C4: Create Corpus

Load the data as a corpus and inspect lines 4 to 6, then clean up the memory. Ask R to return memory to the operating system while releasing objects that are no longer in used (Only the first “memory clearing” instance is snowed below. Instances further down in the code are hidden).

See interesting section on memory leaks: http://adv-r.had.co.nz/memory.html

combCorpus <- Corpus(VectorSource(comb_trunc))
        inspect(combCorpus[4:6])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] When you look at the internal data on international tests you find a telling dichotomy regarding American scores. America has a disproportional number of students doing very, very well and a large portion doing poorly. The Organization for Economic Co-operation and Development, which administers the PISA test, explains this dichotomy: the number of children in poverty. Twenty-two percent of American children live in mind-numbing poverty, which is far greater than the next highest country. Children do not learn well when they are hungry.
## [2] Another strong month of hiring makes it less likely that the Federal Reserve will take additional steps to boost the economy at its meeting next week.                                                                                                                                                                                                                                                                                                                                                                                                        
## [3] "I doubt you're ever going to see me kicking dirt, throwing bases, that kind of stuff," Matheny said after the Cardinals' 2-1 loss to Washington. "I don't think it's going to happen, but I don't know. I've lost it a couple times (as a player). ... Mostly, in spring training you don't see that."

Object memory cleared and memory released:

rm(news_trunc, blogs_trunc, twitter_trunc, comb_trunc)
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4983215 266.2   14442815  771.4  11669289  623.3
## Vcells 66068555 504.1  208068404 1587.5 173323665 1322.4

Step C5: Text Transformation

For this assignment “cleaning up the text data” follows an uncondensed “Stepwise Text Transformation”" so each step can be inspected.

The memory for each of the individually created data objects is cleared as soon as it is not further used.

5.1 Removing special characters:

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
           combCorpusS <- tm_map(combCorpus, toSpace, "/")
           combCorpusS <- tm_map(combCorpus, toSpace, "@")

5.2 Convert the text to lower case:

combCorpusSL <- tm_map(combCorpusS, content_transformer(tolower))
#One can Inspect the corpus objects
#inspect(combCorpusSL)

5.3 Remove numbers:

combCorpusSLN <- tm_map(combCorpusSL, content_transformer(removeNumbers))

5.4 Remove English common stop words:

combCorpusSLNS <- tm_map(combCorpusSLN, removeWords, stopwords("english"))

5.5 Remove profane words:

names(profane)<-"profane"
combCorpusSLNSP <- tm_map(combCorpusSLNS, removeWords, profane[,1])

5.6 Remove punctuations:

combCorpusSLNSPP <- tm_map(combCorpusSLNSP, content_transformer(removePunctuation))

5.7 Eliminate extra white spaces:

combCorpusSLNSPPW <- tm_map(combCorpusSLNSPP, content_transformer(stripWhitespace))

5.8 Text stemming:

Removing suffixes from words to get the common origin.

Stemming is a way to improve / increase word coverage, i.e. keep memory use lower.

combCorpusSLNSPPWSt <- tm_map(combCorpusSLNSPPW, content_transformer(stemDocument))

5.9 Write CLEANED corpus to disk (still includes sparse terms)

Optional step.

#writeCorpus(combCorpusSLNSPPWSt)

————————————————————————————-

D. Tokenize using ngram package

Step D1. Create a string of the cleaned corpus

Perform Tokenization, and thus obtain one (uni-), two (bi-), three (tri-), and four (tetra-) word combinations that appear frequently.

strCorpus <- concatenate ( lapply (combCorpusSLNSPPWSt , "[", 1) )
        ng1 <- ngram(strCorpus, n=1)
        ng2 <- ngram(strCorpus, n=2)
        ng3 <- ngram(strCorpus, n=3)
        ng4 <- ngram(strCorpus, n=4)

Step D2. Inspect strCorpus data

Take a look at the first five entries in each of the ngrams generated.

other helper functions that can be used to inspect the corpus include:

(i) print(ng2, output=“full”) & get.phrasetable(ng2) &

(ii) To make new sentences use: babble(ng=ng2, genlen=12, seed= 101562)

head(get.phrasetable(ngram(strCorpus, n=1)), 5)

##   ngrams freq        prop
## 1  said   304 0.006414585
## 2  like   280 0.005908170
## 3   one   268 0.005654963
## 4  will   266 0.005612762
## 5  just   251 0.005296253

head(get.phrasetable(ngram(strCorpus, n=2)), 5)

##        ngrams freq         prop
## 1   new york    26 0.0005486274
## 2  last year    22 0.0004642232
## 3 last night    16 0.0003376168
## 4  right now    15 0.0003165158
## 5  feel like    14 0.0002954147

head(get.phrasetable(ngram(strCorpus, n=3)), 5)

##               ngrams freq         prop
## 1     cinco de mayo     5 1.055075e-04
## 2 cricket world cup     4 8.440599e-05
## 3   osama bin laden     4 8.440599e-05
## 4     new york citi     4 8.440599e-05
## 5    want make sure     3 6.330449e-05

head(get.phrasetable(ngram(strCorpus, n=4)), 5)

##                        ngrams freq         prop
## 1      cricket world cup dvd     3 6.330583e-05
## 2  overal hire remain strong     2 4.220389e-05
## 3 littl stage puppet theater     2 4.220389e-05
## 4    roman cathol code canon     2 4.220389e-05
## 5        done unto us believ     2 4.220389e-05

Step D3. Extract top 20 uni-, bi-, tri-, and tetra- grams for visual evaluation and plot

Make a multi panel plot using code obtained from cookbook-r.com

4.3 Word Coverage

Determine the number of words needed in a frequency sorted dictionary to satisfy 50% and 90% coverage of all words present in the dictionary.

Using the string corpus setup above a function is developed to calculate the suggested coverage’s in the unigram.

It is determined that coverage increases exponentially when going from 50 to 90 percent coverage, i.e. increases by a factor 10 going from 50 to 90 % coverage.

Word coverage per available memory can be improved through stemming, removing redundant sparse terms, and potentially dropping two letter words.

#Setup the function to help extract the word coverage information 
getCoverage <- function (unigram, coverage)
        {
         #start with zero frequency
         frequency <- 0
         #determined final coverage frequency from unigram
         coverageFrequency <- coverage * sum(unigram$freq)
         #the unigram information already comes sorted
         for (i in 1:nrow(unigram)) {
                 if (frequency >= coverageFrequency)
                        
                 {
                  return (i)
                 }
                 frequency <- frequency + unigram[i, "freq"]
                                     }
         return (nrow(unigram))
        }

#Setup the unigram dictionary with column names from string corpus
unigram <- get.phrasetable(ngram(strCorpus, n=1))

#Request the coverages
getCoverage(unigram, coverage= 0.5)

## [1] 598

getCoverage(unigram, coverage= 0.9)

## [1] 5092

————————————————————————————-

E. Term Document Matrix’s (tdm)

Step E1. Build the matrix

Use TermDocumentMatrix (or DocumentTermMatrix) depending on whether you want terms as rows and documents as columns, or vice versa.

Step E2. Inspect the tdm

Inspect tdm dimensions, evaluate the first ten lines of the tdm, and find the 100 most frequent terms in the tdm.

There are several other ways to inspect the term-document matrix:

(i) View(as.matrix(tdm[1:1000, 1:5])) & inspect(dtm[1:2]) & meta(dtm[[2]], “id”) &

(ii) identical(dtm[[2]], dtm[[“dtm.txt”]]) & inspect(dtm[[2]]) & lapply(dtm[1:2], as.character)

dtm <- TermDocumentMatrix(combCorpusSLNSPPWSt)
dim(dtm)

## [1] 9626 3000

inspect(dtm[1:10,])

## <<TermDocumentMatrix (terms: 10, documents: 3000)>>
## Non-/sparse entries: 144/29856
## Sparsity           : 100%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## Sample             :
##           Docs
## Terms      1 1428 1940 310 339 398 543 751 767 781
##   aggress  1    0    0   0   0   0   0   0   0   0
##   bank     2    0    0   4   0   0   0   1   0   1
##   confer   1    0    0   0   0   0   0   0   0   0
##   debtthat 1    0    0   0   0   0   0   0   0   0
##   eas      1    0    0   0   0   0   0   0   0   0
##   ecb      1    0    0   0   0   0   0   0   0   0
##   european 1    0    0   0   0   0   0   0   1   0
##   expans   1    0    0   0   0   0   0   0   0   0
##   govern   1    0    0   0   3   0   2   1   1   1
##   help     1    3    3   0   0   2   0   0   2   2

findFreqTerms(dtm, 100)    #also use: findAssocs(dtm, "word to associate with", 0.8)

##  [1] "said"  "new"   "look"  "like"  "make"  "week"  "will"  "know" 
##  [9] "see"   "think" "time"  "good"  "just"  "can"   "get"   "year" 
## [17] "work"  "say"   "also"  "much"  "day"   "use"   "way"   "two"  
## [25] "need"  "come"  "peopl" "one"   "now"   "want"  "right" "first"
## [33] "last"  "love"

Step E3. Evaluate sparsity and remove sparse terms

It appears that manipulation of sparse terms may in some instances impact the way the term document matrix respond to the word cloud analysis. It was opted not to include removal of sparse terms for the moment.

#Removing those terms at least 95% sparse (i.e. terms occurring 0 times in a document)
#dtm <- removeSparseTerms(dtm, 0.95)      
#str(dtm)                                 #also use: inspect(removeSparseTerms(dtm, 0.95))

Step E4. Evaluate the term document matrix

4.1 Plotting Word Cloud

Create a suitable work matrix, inspect the data, and create a Word Cloud representing the 120 most frequent used words in term document matrix.

##      word freq
## said said  304
## like like  280
## one   one  268
## will will  266
## just just  251
## time time  236
## get   get  230
## day   day  207
## year year  196
## can   can  195

4.2 Word Frequency analysis

Expand the evaluation to include a bar plot representing all the words appearing more than a 100 times in the term document matrix. This plot in essence represents the same top 20 entities than was depicted above in the corpus unigram.

————————————————————————————-

F. Model Development - Markov chain algorithm

The plan going forward.

Basically the plan is to extract the individual words from the bi-, tri-, and tetra- grams as individually accessible entities that can be stored, searched, and recovered on request via a very basic Markov chainlike algorithm arrangement.

For instance, if the user, in the Shiny App, types the first word, the word will be first tested against the uni-gram to determine if the word is in the ng-database. If absent, no predicted choice is offered.

However, if the word is present in the uni-gram database, the word will first be tested against the first word entries in the tetra-gram. The reason to start the search in the tetra-gram is, if present in the tetra-gram database, will return the highest predicted word coverage, i.e. three words following the first word in the tetra-gram.

If there is no corresponding first word in the tetra-gram the search will expanded to the first word in the tri-gram database, and if not found there, access the bi-gram data base.

On which ever search the first word is allocated, the representing second, third, and fourth word will be offered to the user as an option to pick from. If the user picks a word, the search process highlighted above is repeated on the chosen word. If the user does not pick one of the presented options, and type a new word, the same process is repeated on the new word entered.

————————————————————————————-

G. APPENDIX

Assignment Check List and Answering some of the more specific Assignment Questions.

________________________________________________________________________________
####Week 1 Task 0:

______________

1. What do the data look like?

[done]

2. Where do the data come from?

[done]

3. Can you think of any other data sources that might help you in this project?

[done - see comment below]

[topic specific exp. technical publications]

4. What are the common steps in natural language processing?

[done, applied in sections of code above]

5. What are some common issues in the analysis of text data?

[done, ackward non-ASCII characters, multiple white spaces]

6. What is the relationship between NLP and the concepts you have learned in the Specialization?

[done, vey much so untidy data that needs processing before you can start working]

________________________________________________________________________________
####Week 1 Task 1:

______________

1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Create function.

[done]

2. Profanity filtering - removing profanity and other words you do not want to predict.

[done]

________________________________________________________________________________
####Week 2 Task 2:

______________

1. Some words are more frequent than others - what are the distributions of word frequencies?

[done]

2. What are the frequencies of 2-grams and 3-grams in the dataset?

[done]

3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

[done]

4. How do you evaluate how many of the words come from foreign languages?

[done]

5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

[done - Stemming words, removing redundant sparse terms, and maybe dropping two letter words may be ways of increasing the coverage per memory available]

________________________________________________________________________________
####Week 2 Task 3:

______________

1. How can you efficiently store an n-gram model (think Markov Chains)?

[in process of development]

2. How can you use the knowledge about word frequencies to make your model smaller and more efficient?

[done. Using bi-, tri-, and tetr-grams one can extract the words and word combinations that has a higer appearance frequency.]

3. How many parameters do you need (i.e. how big is n in your n-gram model)?

[significantly reduced from the 3000 lines started with, but not done yet.]

4. Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data)?

Data Science Capstone Week 2 Update

Dawid J Duvenhage

October 12, 2017

A1. Background

Week 1 - Task 0 Understanding the problem:

https://www.coursera.org/learn/data-science-project/supplement/Iimbd/task-0-understanding-the-problem

The Data: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Tasks to accomplish:

1. Obtaining the data - Can you download the data and load/manipulate it in R?

2. Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

Week 1 - Task 1 Getting and Cleaning the Data:

https://www.coursera.org/learn/data-science-project/supplement/IbTUL/task-1-getting-and-cleaning-the-data

Tasks to accomplish:

1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

2. Profanity filtering - removing profanity and other words you do not want to predict.

Week 2 - Task 2 “Exploratory Data Analysis”"

https://www.coursera.org/learn/data-science-project/supplement/BePVz/task-2-exploratory-data-analysis

Tasks to accomplish:

1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Week 2 - “Task 3 Modeling”"

https://www.coursera.org/learn/data-science-project/supplement/2IiM9/task-3-modeling

1. Build basic n-gram model - using the exploratory analysis you performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.

2. Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Build a model to handle cases where a particular n-gram isn’t observed.

A2. Resources used:

1. https://cran.r-project.org/web/packages/ngram/ngram.pdf

2. https://cran.r-project.org/web/packages/ngram/README.html

3. https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

4. http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

5. https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

6. https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html

7. http://beyondvalence.blogspot.com/2014/01/text-mining-4-performing-term.html

8. https://www.r-bloggers.com/text-mining-the-complete-works-of-william-shakespeare/

9. https://www.slideshare.net/rdatamining/text-mining-with-r-an-analysis-of-twitter-data

10. http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

————————————————————————————-

B. Basic Environment Setup

B1. Loading libraries

B2. Load cookbook-r.com ggplot 2 Helper Function for Multiple plot panel

————————————————————————————-

C. Data Mining

Step C1: File and Data Setup

1.1 Set working directory, file paths, and download the data.

1.2 Read raw training and profane table data.

Step C2: Raw Data Exploration

2.1 Evaluate object.size in MB, obtain file summary (length, class, mode), and inspect the first line in the text files:

2.2 Number of lines per file:

2.3 Number of words per file:

(note: this step takes a few minutes)

Step C3. Create single reduce raw data file and Remove non-ASCII characters

Reduce the total raw data file size to 3000 lines total, i.e. 1000 each from news, blogs, and twitter and then confirm line count as 1000 lines per object.

See attached references:

https://www.quora.com/What-is-the-difference-between-ASCII-and-unicode-characters-difference-between-UTF-8-and-UTF-16

https://docs.oracle.com/cd/B19306_01/server.102/b14225/ch2charset.htm

Create a single combined truncated data object and remove any non-ASCII characters before creating Corpus.

Step C4: Create Corpus

Load the data as a corpus and inspect lines 4 to 6, then clean up the memory. Ask R to return memory to the operating system while releasing objects that are no longer in used (Only the first “memory clearing” instance is snowed below. Instances further down in the code are hidden).

See interesting section on memory leaks: http://adv-r.had.co.nz/memory.html

Object memory cleared and memory released:

Step C5: Text Transformation

For this assignment “cleaning up the text data” follows an uncondensed “Stepwise Text Transformation”" so each step can be inspected.

The memory for each of the individually created data objects is cleared as soon as it is not further used.

5.1 Removing special characters:

5.2 Convert the text to lower case:

5.3 Remove numbers:

5.4 Remove English common stop words:

5.5 Remove profane words:

5.6 Remove punctuations:

5.7 Eliminate extra white spaces:

5.8 Text stemming:

Removing suffixes from words to get the common origin.

Stemming is a way to improve / increase word coverage, i.e. keep memory use lower.

5.9 Write CLEANED corpus to disk (still includes sparse terms)

Optional step.

————————————————————————————-

D. Tokenize using ngram package

Step D1. Create a string of the cleaned corpus

Perform Tokenization, and thus obtain one (uni-), two (bi-), three (tri-), and four (tetra-) word combinations that appear frequently.

Step D2. Inspect strCorpus data

Take a look at the first five entries in each of the ngrams generated.

________________________________________________________________________________
####Week 1 Task 0:

________________________________________________________________________________
####Week 1 Task 1:

________________________________________________________________________________
####Week 2 Task 2: