Background

The motivation for this report is to:

  1. Demonstrate that data has been downloaded and successfully loaded.

  2. Create a basic report of summary statistics about the data sets.

  3. Report any interesting findings on the data so far.

Data Import and Summary Statistics

The data for the project comes from corpora that were collected from publicly available sources by a web crawler. For this analysis the three files in English will be used. The data was downloaded to the working directory and analyzed from there.

list.files(pattern = "^en_US.*txt$",path = "./final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Using readLines the text files are read as saved as character vectors.

blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

What do these look like? Next we will take a look at the head of each of the three files.

head(blogs, n = 3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "We love you Mr. Brown."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
head(news, n = 3)
## [1] "He wasn't home alone, apparently."                                                                                                                                                
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
head(twitter, n = 3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."

The next step of the exploration of the corpus is to find out the size of the files, # of lines, and words in each.

## Size of each file
size <- round(file.info(c("./final/en_US/en_US.blogs.txt", 
                          "./final/en_US/en_US.news.txt", 
                          "./final/en_US/en_US.twitter.txt"))$size/1024/1024, 2)
## Number of lines in each file
lines <- c(length(blogs), 
           length(news), 
           length(twitter))
## Number of characters in each file
char <- c(sum(nchar(blogs)), 
          sum(nchar(news)), 
          sum(nchar(twitter)))
## Number of words
words <- c(sum(str_count(blogs, "\\S+")), 
           sum(str_count(news, "\\S+")), 
           sum(str_count(twitter, "\\S+")))
## Knit results into a data table
stats <- cbind(size, lines, char, words)
colnames(stats) <- c("File Size (MB)", "Lines", "Characters", "Words")
rownames(stats) <- c("Blogs", "News", "Twitter")

Here we see plots of the line counts and word counts in the documents

kable(stats)
File Size (MB) Lines Characters Words
Blogs 200.42 899288 206824505 37334131
News 196.28 1010242 203223159 34372530
Twitter 159.36 2360148 162096241 30373583

These plots show the number of entries (lines) and number of words per corpus (text source). Each corpus has at least 800,000 lines of text (entries, tweets, items) and least 30 million words.

# plot prep
summaryStats <- as.data.frame(stats)

g.line.count <- ggplot(summaryStats, aes(x = factor(rownames(stats)), y = lines/1e+06))
g.line.count <- g.line.count + geom_bar(stat = "identity") +
  labs(y = "# of lines/million", x = "text source", title = "Count of lines per Corpus") 
g.word.count <- ggplot(summaryStats, aes(x = factor(rownames(stats)), y = words/1e+06))
g.word.count <- g.word.count + geom_bar(stat = "identity") + 
  labs(y = "# of words/million", x = "text source", title = "Count of words per Corpus")
g.line.count

g.word.count

Preparing the Corpus for analysis and developing a word prediction algorithm.

To accomodate for insufficient processing performance of the laptop being used a sample of each file is made before proceeding to tokenize and create n-grams.

#Create a sample corpus to be processed
set.seed(432)
blogs_sample <- blogs[sample(length(blogs), 0.1*length(blogs))]
news_sample <- news[sample(length(news), 0.1*length(news))]
twitter_sample <- twitter[sample(length(twitter), 0.1*length(twitter))]

Compare lines and words after sampling

## Number of lines in each file
lines_sample <- c(length(blogs_sample), 
           length(news_sample), 
           length(twitter_sample))
## Number of words
words_sample <- c(sum(str_count(blogs_sample, "\\S+")), 
           sum(str_count(news_sample, "\\S+")), 
           sum(str_count(twitter_sample, "\\S+")))
## Knit results into a data table
stats_sample <- cbind(lines, words, lines_sample, words_sample)
colnames(stats_sample) <- c("Lines", "Words", "Lines_Sample", "Words_Sample")
rownames(stats_sample) <- c("Blogs", "News", "Twitter")
This table shows the comparison of words and lines per each source of text after sampling.
kable(stats_sample)
Lines Words Lines_Sample Words_Sample
Blogs 899288 37334131 89928 3718194
News 1010242 34372530 101024 3437775
Twitter 2360148 30373583 236014 3033842

Here was seen the size reduction of the sample corpus to 10% of the main corpus. This will still be sufficient for performing the word prediction and shave off the size of the data look up tables therefor speeding up the app and making it light weight on computational resources to run the app.

Data Cleaning

Prior to building data feature matrices we create a combined corpus, convert text, tokenize the corpus, remove numbers, punctuation, symbols, twitter handles, separators and then create n-grams

cSC <- corpus(c(blogs_sample, news_sample, twitter_sample))
texts(cSC) <- iconv(texts(cSC), from = "UTF-8", to = "ASCII", sub = "")
corpusTokensNoTwitter <- tokens(cSC, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_separators = TRUE)
corpusUnigrams <- tokens_ngrams(corpusTokensNoTwitter, n= 1L)
corpusBigrams <- tokens_ngrams(corpusTokensNoTwitter, n=2L)
corpusTrigrams <- tokens_ngrams(corpusTokensNoTwitter, n=3L)
corpusQuartgrams <- tokens_ngrams(corpusTokensNoTwitter, n=4L)

Additional text process and creating document feature matrices

At this stage, it is not uncommon to remove stopwords - since the purpose of this project is to create a predictive text model, they will be kept. However, we filter for profane language, using a list of words available at LDNOOBW the profane words will be removed during building the quanteda dfm

profanity <- readLines("profaneWords.txt", encoding = "UTF-8", skipNul = TRUE)
dfmUni <- dfm(corpusUnigrams,remove = profanity, stem = FALSE)
dfmBi <- dfm(corpusBigrams,remove = profanity, stem = FALSE)
dfmTri <- dfm(corpusTrigrams, remove = profanity, stem = FALSE)
dfmQuart <- dfm(corpusQuartgrams, remove = profanity, stem = FALSE)

Load the multiplot function found at multiplot function source

Plot of the top 20 n-grams from the corpus.

dfmNgram <- c("uni", "bi", "tri", "quart")
for (i in 1:4) {
        ## Prepare data frame for plotting
        graphData <- as.data.frame(
                topfeatures(dfm(cSC,remove = profanity, stem = FALSE, remove_punct = TRUE,  ngrams=i)
, 20)
                )
        colnames(graphData) <- "frequency"
        graphData$ngram <- row.names(graphData)
        ## Generate plots 
        g <- ggplot(graphData, aes(y = frequency, 
                                   x = reorder(ngram, frequency)))
        g <- g + geom_bar(stat = "identity") + coord_flip()
        g <- g + ggtitle(paste(i, "-grams", sep = "")) 
        g <- g + ylab("") + xlab("")
        g <- g + theme_few()
        assign(paste("p", i, sep = ""), g)
}
## Combine plots
multiplot(p1, p2, p3, p4, cols=2)

Future directions and plans for the app

For the app we plan on usina a 4-gram probabilistic language model (nGram model) and Stupid Backoff to rank next-word candidates. According to D. Jurafsky et. al, N-gram models allow for the assignment of probabilities to sequences of words. Since an N-gram is a sequence of words of length N. Using N-gram models, we can estimate the probability of the last word of an N-gram given the previous words. This allows for use of N-gram for generating and ranking next-word predictions. Interactively, the way the word prediction works is a user enters the first three words of a phrase they are typing into the app UI. That phrase is checked for a match in an nGram look up table which is broken down into first words (previous words), and a last word. The look up tables are document feature frequency tables for quartGrams, triGrams, biGrams, and uniGrams created from the sampleCorpus. The word to be predicted is determined by finding a suitable result in the highest order ngram Look up Table available first.If no ngram meets that criteria the algorithm then backsoff to the next lower order ngram to search those look up tables to return a result. However there will be discounting to the score of a result from there by adding a smoothing factor lambda of 0.4 per order of nGram. This typically is the case in mispellings or rare phrases. Overall the goal is return a word prediction.

Some considerations for speeding up the app will be pruning the frequency tables to exclude low frequency text completions also calculating the scores and adding them to the look up tables will allow for determining the most likely result of a word prediction without having to compute the likelihood of a result.