This is a demonstration of progress on the project focused on creating a predicitive text algorithm. In particular this report gives a summary of some exploratry investigations into our provided texts which will be used as the basis for our predictive text algorithm. In particular it gives an overview of the text files in use and explores the frequency of single words, combinations of two words, and combinations of three words (unigrams, bigrams, and trigrams). Eventually these frequencies will be used as a kind of dictionary for the predictive text app.
#required libraries
library(NLP)
library(tm)
library(RWeka)
#See if files are present in working directory, if not, download.
if(!file.exists("Coursera-SwiftKey.zip")){
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
"Coursera-SwiftKey.zip")
}
#check to see if unzipped directory exists, if not unzip the english files.
if(!dir.exists("./Coursera-SwiftKey/final/en_US/")){
US_Files <- grep('en_US..', unzip("Coursera-SwiftKey.zip", list=TRUE)$Name,
ignore.case=TRUE, value=TRUE)
unzip("Coursera-SwiftKey.zip",files = US_Files, exdir = "./Coursera-SwiftKey")
}
#paths for the three files
twFile <- "./Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
bFile <- "./Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
nFile <- "./Coursera-SwiftKey/final/en_US/en_US.news.txt"
#function for reading in text
readText <- function(path){
con <- file(path, open = "r")
textVect <- readLines(con, warn = FALSE)
close(con)
textVect
}
#function for collecting information on texts
info <- function(path, textVect){
wds <- gregexpr("\\W+",textVect)
data.frame(
FileSize = file.info(path)[1]/1024^2, #File Size in Mb
FileLength =length(textVect), #Num Entries
MaxWords = max(as.numeric(summary(wds)[,1])), #Max Words/Line
TotalWords = length(unlist(wds)), #Word Count
row.names= deparse(substitute(textVect))
)
}
#call to functions to read and summarize information on texts
blogs <- readText(bFile)
news <- readText(nFile)
twitter <- readText(twFile)
infoTable <- rbind(
info(bFile, blogs),
info(nFile, news),
info(twFile, twitter)
)
infoTable
## size FileLength MaxWords TotalWords
## blogs 200.4242 899288 6851 38487556
## news 196.2775 77259 1521 2760230
## twitter 159.3641 2360148 62 30513860
Because these three texts are too large to deal with as a whole, I will sample from all three. I will sample from a collection of all three so as to not weight one “kind” of writing more than another, as the writing styles of blogs, news articles and twitter might reasonably be different.
From this sample of 2000 entries, make a corpus using the “tm” package, cleaning the corpus by removing punctuation, numbers, whitespace and converting all strings to lowercase
allText <- c(blogs, news, twitter)
set.seed(1738) # for repeatable results
textSample <- sample(allText, 2000, replace = FALSE)
corpus <- VCorpus(VectorSource(textSample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Using RWeka, create a tokenizer that splits the corpus into 1-, 2-, and 3- word patterns. Pass tokenizer to the “TermDocumentMatrix” function which returns an object wher those patterns are represented for each entry in the corpus
wCont <- function(n){
Weka_control(min=n,max=n)
}
uniG <- function(x) NGramTokenizer(x, wCont(1))
biG <- function(x) NGramTokenizer(x, wCont(2))
triG <- function(x) NGramTokenizer(x, wCont(3))
uniMatrix <-
TermDocumentMatrix(corpus, control = list(tokenize = uniG))
biMatrix <-
TermDocumentMatrix(corpus, control = list(tokenize = biG))
triMatrix <-
TermDocumentMatrix(corpus, control = list(tokenize = triG))
Create function that given a TermDocumentMatrix, will plot the 25 most common pattern in that TDM. Then plot for uni-, bi- and tri-grams.
top25plot <- function(tdm,title){
sorted <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
d <- data.frame(term = names(sorted), freq = sorted)
barplot(d[1:25, ]$freq, names.arg = d[1:25, ]$term, las = 2,
main = title, ylab = "Frequencies")
}
top25plot(uniMatrix,"Unigram Frequent Patterns")
top25plot(biMatrix, "Bigram Frequent Patterns")
top25plot(triMatrix, "Trigram Frequent Patterns")
By decreasing the sparsity of the TextDocumentMatricies, it is possible more of the texts could be used in the investigation. Particularly in the trigram frequencies this might be helpful as the frequencies of three-term patterns is lower.
For creating a predictive text applictation, this pattern of tokenization will be useful as a kind of “dictionary” to look up potential patterns.