Background

The motivation for this report is to:

Demonstrate that data has been downloaded and successfully loaded.
Create a basic report of summary statistics about the data sets.
Report any interesting findings on the data so far.

Download the dataset from the link provided in the assignment instructions

dataFileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(dataFileURL, destfile = "corpusData.zip", method = "curl")
unlink(dataFileURL)
unzip("corpusData.zip")

Data Import and Summary Statistics

The data for the project comes from corpora that were collected from publicly available sources by a web crawler. For this analysis the three files in English will be used. The data was downloaded to the working directory and analyzed from there.

list.files(pattern = "^en_US.*txt$",path = "./final/en_US")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Using readLines the text files are read as saved as character vectors.

blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

What do these look like? Next we will take a look at the head of each of the three files.

head(blogs, n = 3)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

head(news, n = 3)

## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."

head(twitter, n = 3)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

The next step of the exploration of the corpus is to find out the size of the files, # of lines, and words in each.

## Size of each file
size <- round(file.info(c("./final/en_US/en_US.blogs.txt", 
                          "./final/en_US/en_US.news.txt", 
                          "./final/en_US/en_US.twitter.txt"))$size/1024/1024, 2)
## Number of lines in each file
lines <- c(length(blogs), 
           length(news), 
           length(twitter))
## Number of characters in each file
char <- c(sum(nchar(blogs)), 
          sum(nchar(news)), 
          sum(nchar(twitter)))
## Number of words
words <- c(sum(str_count(blogs, "\\S+")), 
           sum(str_count(news, "\\S+")), 
           sum(str_count(twitter, "\\S+")))
## Knit results into a data table
stats <- cbind(size, lines, char, words)
colnames(stats) <- c("File Size (MB)", "Lines", "Characters", "Words")
rownames(stats) <- c("Blogs", "News", "Twitter")

Here we see plots of the line counts and word counts in the documents

kable(stats)

	File Size (MB)	Lines	Characters	Words
Blogs	200.42	899288	206824505	37334131
News	196.28	1010242	203223159	34372530
Twitter	159.36	2360148	162096241	30373583

These plots show the number of entries (lines) and number of words per corpus (text source). Each corpus has at least 800,000 lines of text (entries, tweets, items) and least 30 million words.

# plot prep
summaryStats <- as.data.frame(stats)

g.line.count <- ggplot(summaryStats, aes(x = factor(rownames(stats)), y = lines/1e+06))
g.line.count <- g.line.count + geom_bar(stat = "identity") +
  labs(y = "# of lines/million", x = "text source", title = "Count of lines per Corpus") 
g.word.count <- ggplot(summaryStats, aes(x = factor(rownames(stats)), y = words/1e+06))
g.word.count <- g.word.count + geom_bar(stat = "identity") + 
  labs(y = "# of words/million", x = "text source", title = "Count of words per Corpus")

g.line.count

g.word.count

Preparing the Corpus for analysis and developing a word prediction algorithm.

To accomodate for insufficient processing performance of the laptop being used a sample of each file is made before proceeding to tokenize and create n-grams.

#Create a sample corpus to be processed
set.seed(432)
blogs_sample <- blogs[sample(length(blogs), 0.1*length(blogs))]
news_sample <- news[sample(length(news), 0.1*length(news))]
twitter_sample <- twitter[sample(length(twitter), 0.1*length(twitter))]

Compare lines and words after sampling

## Number of lines in each file
lines_sample <- c(length(blogs_sample), 
           length(news_sample), 
           length(twitter_sample))
## Number of words
words_sample <- c(sum(str_count(blogs_sample, "\\S+")), 
           sum(str_count(news_sample, "\\S+")), 
           sum(str_count(twitter_sample, "\\S+")))
## Knit results into a data table
stats_sample <- cbind(lines, words, lines_sample, words_sample)
colnames(stats_sample) <- c("Lines", "Words", "Lines_Sample", "Words_Sample")
rownames(stats_sample) <- c("Blogs", "News", "Twitter")

This table shows the comparison of words and lines per each source of text after sampling.

kable(stats_sample)

	Lines	Words	Lines_Sample	Words_Sample
Blogs	899288	37334131	89928	3718194
News	1010242	34372530	101024	3437775
Twitter	2360148	30373583	236014	3033842

Here was seen the size reduction of the sample corpus to 10% of the main corpus. This will still be sufficient for performing the word prediction and shave off the size of the data look up tables therefor speeding up the app and making it light weight on computational resources to run the app.

Data Cleaning

Prior to building data feature matrices we create a combined corpus, convert text, tokenize the corpus, remove numbers, punctuation, symbols, twitter handles, separators and then create n-grams

cSC <- corpus(c(blogs_sample, news_sample, twitter_sample))
texts(cSC) <- iconv(texts(cSC), from = "UTF-8", to = "ASCII", sub = "")
corpusTokensNoTwitter <- tokens(cSC, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_separators = TRUE)
corpusUnigrams <- tokens_ngrams(corpusTokensNoTwitter, n= 1L)
corpusBigrams <- tokens_ngrams(corpusTokensNoTwitter, n=2L)
corpusTrigrams <- tokens_ngrams(corpusTokensNoTwitter, n=3L)
corpusQuartgrams <- tokens_ngrams(corpusTokensNoTwitter, n=4L)

Additional text process and creating document feature matrices

At this stage, it is not uncommon to remove stopwords - since the purpose of this project is to create a predictive text model, they will be kept. However, we filter for profane language, using a list of words available at LDNOOBW the profane words will be removed during building the quanteda dfm

profanity <- readLines("profaneWords.txt", encoding = "UTF-8", skipNul = TRUE)
dfmUni <- dfm(corpusUnigrams,remove = profanity, stem = FALSE)
dfmBi <- dfm(corpusBigrams,remove = profanity, stem = FALSE)
dfmTri <- dfm(corpusTrigrams, remove = profanity, stem = FALSE)
dfmQuart <- dfm(corpusQuartgrams, remove = profanity, stem = FALSE)

Load the multiplot function found at multiplot function source

Plot of the top 20 n-grams from the corpus.

dfmNgram <- c("uni", "bi", "tri", "quart")
for (i in 1:4) {
        ## Prepare data frame for plotting
        graphData <- as.data.frame(
                topfeatures(dfm(cSC,remove = profanity, stem = FALSE, remove_punct = TRUE,  ngrams=i)
, 20)
                )
        colnames(graphData) <- "frequency"
        graphData$ngram <- row.names(graphData)
        ## Generate plots 
        g <- ggplot(graphData, aes(y = frequency, 
                                   x = reorder(ngram, frequency)))
        g <- g + geom_bar(stat = "identity") + coord_flip()
        g <- g + ggtitle(paste(i, "-grams", sep = "")) 
        g <- g + ylab("") + xlab("")
        g <- g + theme_few()
        assign(paste("p", i, sep = ""), g)
}
## Combine plots
multiplot(p1, p2, p3, p4, cols=2)

Future directions and plans for the app

For the app we plan on usina a 4-gram probabilistic language model (nGram model) and Stupid Backoff to rank next-word candidates. According to D. Jurafsky et. al, N-gram models allow for the assignment of probabilities to sequences of words. Since an N-gram is a sequence of words of length N. Using N-gram models, we can estimate the probability of the last word of an N-gram given the previous words. This allows for use of N-gram for generating and ranking next-word predictions. Interactively, the way the word prediction works is a user enters the first three words of a phrase they are typing into the app UI. That phrase is checked for a match in an nGram look up table which is broken down into first words (previous words), and a last word. The look up tables are document feature frequency tables for quartGrams, triGrams, biGrams, and uniGrams created from the sampleCorpus. The word to be predicted is determined by finding a suitable result in the highest order ngram Look up Table available first.If no ngram meets that criteria the algorithm then backsoff to the next lower order ngram to search those look up tables to return a result. However there will be discounting to the score of a result from there by adding a smoothing factor lambda of 0.4 per order of nGram. This typically is the case in mispellings or rare phrases. Overall the goal is return a word prediction.

Some considerations for speeding up the app will be pruning the frequency tables to exclude low frequency text completions also calculating the scores and adding them to the look up tables will allow for determining the most likely result of a word prediction without having to compute the likelihood of a result.

Coursera Data Science - Milestone Report

Jean-Paul Courneya

5/22/2018