Abstract

This report provides documentation describing track to create my prediction algorithm for the Data Science Capstone project.

It explains my exploratory analysis and my goals for the eventual app and algorithm.

There are large data sets which has been more challenging than expected. The data is now cleaned.

The work will be on prediction and will be based on n-gram and a backoff method.

All code used in the development of the project is contained in this report. The main body of the report provides the essential discussion of the product development story by summarizing key aspects of the analysis and model building.

Demonstrate that you’ve downloaded the data and have successfully loaded it in.

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## 
## Attaching package: 'qdapRegex'

## The following object is masked from 'package:ggplot2':
## 
##     %+%

## Loading required package: qdapTools

## Loading required package: RColorBrewer

## 
## Attaching package: 'qdap'

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:magrittr':
## 
##     %>%

## The following object is masked from 'package:base':
## 
##     Filter

Create a basic report of summary statistics about the data sets.

library(tm)

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:qdap':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

#Need the longest line in each array.
longBlogs <- nchar(longBlogs)
max(nchar(longBlogs))

## [1] 1

longBlogs<-stri_length(lineBlogs)
max(longBlogs)

## [1] 40833

#Word "love"
loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)

## [1] 90956

#Word "hate"
hateTwitter<-grep("hate",lineTwitter)
length(hateTwitter)

## [1] 22138

#Word "biostats"
biostatsTwitter<-grep("biostats",lineTwitter)
lineTwitter[biostatsTwitter]

## [1] "i know how you feel.. i have biostats on tuesday and i have yet to study =/"

sentenceTwitter<-grep("A computer once beat me at chess, but it was no match for me at kickboxing",lineTwitter)
length(sentenceTwitter)

## [1] 3

Report any interesting findings and cleaning data

library(tm)
library(wordcloud)
#Remove all weird characters
cleanedBlogs <- iconv(lineBlogs, 'UTF-8', 'ASCII', "byte")
cleanedTwitter <- iconv(lineTwitter, 'UTF-8', 'ASCII', "byte")
cleanedNews <- iconv(lineNews, 'UTF-8', 'ASCII', "byte")

twitterSample<-sample(cleanedTwitter, .005* length(cleanedTwitter))
blogsSample<-sample(cleanedBlogs, .005*length(cleanedBlogs))
newsSample<-sample(cleanedNews,.005*length(cleanedNews))

#Create combined source
#Totalsample <- c(twitterSample,blogsSample,newsSample)

#rm(lineBlogs,lineTwitter,lineNews, lineBlogs,cleanedTwitter, cleanedNews,twitterSample,blogsSample,newsSample)

doc.corpus1 <- VCorpus(VectorSource(twitterSample))
doc.corpus2 <- VCorpus(VectorSource(blogsSample))
doc.corpus3 <- VCorpus(VectorSource(newsSample))
rm(lineBlogs,lineTwitter,lineNews,cleanedTwitter, cleanedNews,cleanedBlogs,twitterSample,blogsSample,newsSample)
#Convert to lower case
doc.corpus1<- tm_map(doc.corpus1, tolower)
doc.corpus2<- tm_map(doc.corpus2, tolower)
doc.corpus3<- tm_map(doc.corpus3, tolower)
#Remove all punctuatins
doc.corpus1<- tm_map(doc.corpus1, removePunctuation)
doc.corpus2<- tm_map(doc.corpus2, removePunctuation)
doc.corpus3<- tm_map(doc.corpus3, removePunctuation)
#Remove all numbers 
doc.corpus1<- tm_map(doc.corpus1, removeNumbers)
doc.corpus2<- tm_map(doc.corpus2, removeNumbers)
doc.corpus3<- tm_map(doc.corpus3, removeNumbers)
##Remove whitespace
doc.corpus1 <- tm_map(doc.corpus1, stripWhitespace)
doc.corpus2 <- tm_map(doc.corpus2, stripWhitespace)
doc.corpus3 <- tm_map(doc.corpus3, stripWhitespace)
##Force everything back to plaintext document
doc.corpus1 <- tm_map(doc.corpus1, PlainTextDocument)
doc.corpus2 <- tm_map(doc.corpus2, PlainTextDocument)
doc.corpus3 <- tm_map(doc.corpus3, PlainTextDocument)
doc.corpus <- c(doc.corpus1,doc.corpus2,doc.corpus3)
wordcloud(doc.corpus1, max.words = 200, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

Profanity words removed I use Github to download a list of profanity words to filter out Link (https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en)

library(tm)
library(RWeka)
library(openNLP)
library(NLP)
library(qdap)
badwords <- file("ProfaneWord.txt")
profanewordsvector <- VectorSource(badwords)
doc.corpus <- tm_map(doc.corpus, removeWords, profanewordsvector)
##
## Cleaning steps refined  over time
## Accomplished in three loops due to RAM limitations
##
for (j in seq(doc.corpus)) {
  # first two separate hyphenated and slashed words
  doc.corpus[[j]][[1]] <-gsub("-", " ", doc.corpus[[j]][[1]])
  doc.corpus[[j]][[1]] <-gsub("/", " ", doc.corpus[[j]][[1]])
  # converts symbol <> into an apostrophe
  doc.corpus[[j]][[1]] <-gsub("<>", "\\'", doc.corpus[[j]][[1]])
  
  # these three create end of sentence markers <EOS>
  doc.corpus[[j]][[1]] <-gsub("\\. |\\.$","  <EOS> ", doc.corpus[[j]][[1]])
  doc.corpus[[j]][[1]] <-gsub("\\? |\\?$","  <EOS> ", doc.corpus[[j]][[1]])
  doc.corpus[[j]][[1]] <-gsub("\\! |\\!$","  <EOS> ", doc.corpus[[j]][[1]])
  
}

cleanData<-data.frame(text=unlist(sapply(doc.corpus, `[`, "content")), stringsAsFactors=F)

singletok <- NGramTokenizer(cleanData, Weka_control(min = 1, max = 1))
bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
                                                ,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
                                                 .,;:\"()?!"))
bitritok <- paste(tritok,bitok)

single <- data.frame(table(singletok))
bitoke <- data.frame(table(bitok))
tritoke <- data.frame(table(tritok))
singlesort <- single[order(single$Freq,decreasing = TRUE),]
bitoksort <- bitoke[order(bitoke$Freq,decreasing = TRUE),]
tritoksort <- tritoke[order(tritoke$Freq,decreasing = TRUE),]
singleFrec <- singlesort[1:20,]
colnames(singleFrec) <- c("Word","Frequency")
bitoksortFrec<- bitoksort[1:20,]
colnames(bitoksortFrec) <- c("Word","Frequency")
tritoksortFrec <- tritoksort[1:20,]
colnames(tritoksortFrec) <- c("Word","Frequency")

20 single words by major frecuencies in alphabetical order.

library(ggplot2)

ggplot(singleFrec, aes(x=Word, y=Frequency), ) + 
  geom_bar(stat="Identity", fill="red", colour = "pink")+
  geom_text(aes(label=Frequency),vjust=-0.1) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 2)) +
  ggtitle("ngram.size = 1 \n")

20 bi-grams words by major frecuencies in alphabetical order.

ggplot(bitoksortFrec, aes(x=Word, y=Frequency), ) + 
  geom_bar(stat="Identity", fill="blue", colour = "green")+
  geom_text(aes(label=Frequency),vjust=-0.1) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 2)) +
  ggtitle("ngram.size = 1 \n")

15 tri-grams words by major frecuencies in alphabetical order.

ggplot(tritoksortFrec, aes(x=Word, y=Frequency), ) + 
  geom_bar(stat="Identity", fill="yellow", colour = "black")+
  geom_text(aes(label=Frequency),vjust=-0.1) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 2)) +
  ggtitle("ngram.size = 1 \n")

Note Another Exploratory Analysis

Clean data is a necessary – but not sufficient – condition to developing a prediction algorithm. This step is about understanding: understanding the relationships between the words and sentences, and other observable artifacts useful to set expectations in the model development. Highlights of the many hours of exploration including an understanding of relationships between vocabulary size and unique words, distributions of various N-Grams, and information that helped reevaluate the original strategy after the literature review. Table 1 provides statistics on the full corpora. It shows the total number of documents in the each genre of the corpus. These values also represent the total line counts because the loading method treated each article as a single line regardless of length. For instance, the longest line is a blog of 40,833 characters. 

*Table 1: Characterizing the Corpora by Word Count, Type, Ratios, Diversity*
       
Source | Documents | Vocabulary (V) | Word Types (T) | TTR (T/V) | Diversity 
------ | --------- | -------------- | -------------- | --------- | ---------
Blog   |   899,288 |     37,334,131 |      1,103,548 |   0.030   |  127.71
News   |    77,259 |      2,643,969 |        197,858 |   0.075   |   86.04
Tweets | 2,360,148 |     30,373,543 |      1,290,170 |   0.042   |  165.53
Corpus | 3,336,695 |     70,351,643 |      2,123,809 |   0.030   |  179.04

Table 1 also shows the total vocabulary (V) equaling the number of total word tokens present in each genre. Just over half the complete Corpus is composed of the blog posts. Word types (T) are the number of unique words within the Vocabulary. The Type/Token Ratio(TTR) is a well known measure of language comparison. It is simply the total word types divided by vocabulary (Richards, 1987).  The TTR indicates complexity, with higher numbers indicating a more complex genre. Tweets are the most complex because it takes more unique words to build a smaller vocabulary. This measure was used to make a decision to limit the data for the model to just blog entries. News articles tend to be overly repetitive with non-conversational language and tweets are a language unto themselves with many "created" words.

Diversity is also provided in Table 1. According to Richards (1987) it is a more useful measure because TTR will tend to fall off as a function of growing vocabulary alone. Diversity is defined as "A measure of vocabulary diversity that is approximately independent of sample size is the number of different words divided by the square root of twice the number of words in the sample" (Richards, p.208). It is robust and is positively correlated to the tokens and it flattens out at some point. Table 2 shows the effect on diversity as the size of the vocabulary (document numbers) increases. There is a relative flattening out at 60 percent of the total documents. This is in line with a common technique for separating data into a 60 percent training set, a 20 percent test set, and a 20 percent validation set. This helped validate the selection of using a training set composed of 60 percent of all blogs in the dataset. Notice in Table 3 that the widely used measure of Type/Token Ratio shows the similarity in complexity as represented by the blogs and the entire corpora of blogs, news, and tweets.

*Table 2: Effect of vocabulary size on diversity measures*

 Measure  |  Type  |   50  |   500 | 5,000 |   50K  |   60%  |   80%  | Entire
--------- | ------ | ----- | ----- | ----- | ------ | ------ | ------ | ------
Diversity |  Blog  | 16.38 | 34.17 | 57.31 |  83.07 | 118.55 | 123.72 | 127.71
Diversity | Corpus | 22.68 | 45.22 | 72.56 | 103.84 | 163.43 | 172.10 | 179.05

*Table 3: Effect of vocabulary size on Type/Token Ratios*

 Measure  |  Type  |   50  |   500 | 5,000 |   50K  |   60%  |   80%  | Entire
--------- | ------ | ----- | ----- | ----- | ------ | ------ | ------ | ------
TTR       |  Blog  |  0.51 |  0.31 |  0.18 |   0.08 |   0.04 |   0.03 |   0.03
TTR       | Corpus |  0.49 |  0.29 |  0.15 |   0.07 |   0.04 |   0.03 |   0.03

Understanding the distribution among the word tokens helps shape expectations of the linguistic model. An N-Gram refers to the number of words in a string. This project will work on a 3-Gram model. The basic building blocks of that model are unigrams (N: n=1), bigrams (N: n=2), and trigrams (N: n=3). The code developed to build each of these three N-Gram models is available in these appendices:

* <a href="#c-1-building-the-1-gram-unigram-model">C-1: Building the 1-Gram (unigram) Model</a>
* <a href="#c-2-building-the-2-gram-bigram-model">C-2: Building the 2-Gram (bigram) Model</a> 
* <a href="#c-3-building-the-3-gram-trigram-model">C-3: Building the 3-Gram (trigram) Model</a>

Natural Language Processing: Data Science Capstone Project

Dat Nguyen

February 18, 2018

Milestone Report

Abstract