Capstone Milestone

Summary

This is a demonstration of progress on the project focused on creating a predicitive text algorithm. In particular this report gives a summary of some exploratry investigations into our provided texts which will be used as the basis for our predictive text algorithm. In particular it gives an overview of the text files in use and explores the frequency of single words, combinations of two words, and combinations of three words (unigrams, bigrams, and trigrams). Eventually these frequencies will be used as a kind of dictionary for the predictive text app.

Loading in data

#required libraries
library(NLP)
library(tm)
library(RWeka)
#See if files are present in working directory, if not, download.
if(!file.exists("Coursera-SwiftKey.zip")){
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  "Coursera-SwiftKey.zip")
}
#check to see if unzipped directory exists, if not unzip the english files.
if(!dir.exists("./Coursera-SwiftKey/final/en_US/")){
    US_Files <- grep('en_US..', unzip("Coursera-SwiftKey.zip", list=TRUE)$Name, 
                     ignore.case=TRUE, value=TRUE)
    unzip("Coursera-SwiftKey.zip",files = US_Files, exdir = "./Coursera-SwiftKey")
}

#paths for the three files
twFile <- "./Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
bFile <- "./Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
nFile <- "./Coursera-SwiftKey/final/en_US/en_US.news.txt"

#function for reading in text
readText <- function(path){
    con <- file(path, open = "r")
    textVect <- readLines(con, warn = FALSE)
    close(con)
    textVect
}

#function for collecting information on texts
info <- function(path, textVect){
    wds <- gregexpr("\\W+",textVect)
    data.frame(
        FileSize = file.info(path)[1]/1024^2, #File Size in Mb
        FileLength =length(textVect), #Num Entries
        MaxWords = max(as.numeric(summary(wds)[,1])), #Max Words/Line
        TotalWords = length(unlist(wds)), #Word Count
        row.names= deparse(substitute(textVect))
    )
}

#call to functions to read and summarize information on texts
blogs <- readText(bFile)
news <- readText(nFile)
twitter <- readText(twFile)

infoTable <- rbind(
    info(bFile, blogs),
    info(nFile, news),
    info(twFile, twitter)
)

Summary of texts

infoTable

##             size FileLength MaxWords TotalWords
## blogs   200.4242     899288     6851   38487556
## news    196.2775      77259     1521    2760230
## twitter 159.3641    2360148       62   30513860

Sampling Data and creationg of “corpus”

Because these three texts are too large to deal with as a whole, I will sample from all three. I will sample from a collection of all three so as to not weight one “kind” of writing more than another, as the writing styles of blogs, news articles and twitter might reasonably be different.

From this sample of 2000 entries, make a corpus using the “tm” package, cleaning the corpus by removing punctuation, numbers, whitespace and converting all strings to lowercase

allText <- c(blogs, news, twitter)
set.seed(1738) # for repeatable results
textSample <- sample(allText, 2000, replace = FALSE)
corpus <- VCorpus(VectorSource(textSample))

corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Creation of n-gram tokens

Using RWeka, create a tokenizer that splits the corpus into 1-, 2-, and 3- word patterns. Pass tokenizer to the “TermDocumentMatrix” function which returns an object wher those patterns are represented for each entry in the corpus

wCont <- function(n){
    Weka_control(min=n,max=n)
}

uniG <- function(x) NGramTokenizer(x, wCont(1))
biG  <- function(x) NGramTokenizer(x, wCont(2))
triG <- function(x) NGramTokenizer(x, wCont(3))

uniMatrix <-
    TermDocumentMatrix(corpus, control = list(tokenize = uniG))
biMatrix <-
    TermDocumentMatrix(corpus, control = list(tokenize = biG))
triMatrix <-
    TermDocumentMatrix(corpus, control = list(tokenize = triG))

Plotting top 25 patterns

Create function that given a TermDocumentMatrix, will plot the 25 most common pattern in that TDM. Then plot for uni-, bi- and tri-grams.

top25plot <- function(tdm,title){
    sorted <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
    d <- data.frame(term = names(sorted), freq = sorted)
    barplot(d[1:25, ]$freq, names.arg = d[1:25, ]$term, las = 2,
        main = title, ylab = "Frequencies")
}

top25plot(uniMatrix,"Unigram Frequent Patterns")

top25plot(biMatrix, "Bigram Frequent Patterns")

top25plot(triMatrix, "Trigram Frequent Patterns")

Next Steps

By decreasing the sparsity of the TextDocumentMatricies, it is possible more of the texts could be used in the investigation. Particularly in the trigram frequencies this might be helpful as the frequencies of three-term patterns is lower.

For creating a predictive text applictation, this pattern of tokenization will be useful as a kind of “dictionary” to look up potential patterns.