This report outlines exploratory data analysis for the Coursera Capstone project in the Data Science Specialization. The utilized data set can was downloaded from the link below.
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
In this article, the Data Import and Tokenization methods are explained and some key features of the text files are evaluated. Finally, the proposed prediction model methodology is explained and previewed. All the developed functions used in this study are listed at the end of this article as an appendix.
The “getImportDt” function below imports all three text files from the working directory into a data.table.
allLinesDt<- getImportDt(NULL, NULL)
| Text Doc | Number of Lines |
|---|---|
| en_US.blogs.txt | 899,288 |
| en_US.news.txt | 1,010,242 |
| en_US.twitter.txt | 2,360,148 |
Based on the table above we can see that the twitter text file has a lot more lines than the news file, followed by the blogs file. Considering the large size of these text files, a smaller subset of the data is required to further evaluate them, as follows.
linesnmax = 50000
nsamp = 5000
set.seed(34341)
dtSubset <- getImportDt(linesnmax = linesnmax, nsamp = nsamp)
docsOrig <- dtToQcorp(dtSubset)
For testing purposes and due to RAM limitations, only the first 50000 lines were read from each input file and then only a small subset (5,000 samples per input file) was selected from those lines, with equal number of samples per input file.
The “dtToQcorp” function converts the data.table into a corpus using the quanteda package. The tm package was initially utilized for text processing/analysis but the switch to quanteda was made due to it’s faster speed.
Below, a document feature matrix is created using the quanteda package and the number of total words per document is sumamrized.
dfmWrds <- dfm(docsOrig)
numwords <- ntoken(dfmWrds)
| Text Doc File | No. of words | No. of lines read |
|---|---|---|
| en_US.blogs.txt | 237,319 | 5,000 |
| en_US.news.txt | 200,046 | 5,000 |
| en_US.twitter.txt | 76,987 | 5,000 |
Based on the table above we can see that blogs have a lot more words per lines read, followed by news articles, followed by tweets. Typically, blogs tend to have longer texts compared to news. And tweets are limited in characters and so would be expected to have the least number of words per line.
To further analyze the text in each text file, n-grams were extracted from the document corpus. The “Get_qdfm” function receives a quanteda corpus and returns a n-gram tokenized document-feature matrix. Here are some assumptions made during tokenization of ngrams:
For further manipulation, the document-feature matrices are converted into ordered frequency data.tables using the “getOrderedFreqDt”. All three text documents are summarized into one document-feature data table using this function. A spell check option is built into this custom function such that ngrams with any misspelled words are removed from the data table.
It is possible some lines in the text corpora are misspelled or that they are not in English. For prediction model development it is not desirable to predict misspelled or non-English words for the user.
Based on the results above, going forward, all n-grams that contain words that are not in the English dictionary based on the hunspell package, will be removed from the document-term data tables.
# Generate ngrams for subset of data
freqWordsDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 1, removeStpwrds = F), spellCheck = T)
freqBigramDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 2, removeStpwrds = F), spellCheck = T)
freqTrigramDt <- getOrderedFreqDt(Get_qdfm(docsOrig, n = 3, removeStpwrds = F), spellCheck = T)
# Extract stems from words
stem_doc <- stemDocument(freqWordsDt$term)
stems <- unique(stem_doc)
# Determine unique words required for 50% and 90% of all words
freqWordsDt[,coverage:=cumsum(freq)/sum(freq)]
nwords.5 = nrow(freqWordsDt[coverage<.5])
nwords.9 = nrow(freqWordsDt[coverage<.9])
The frequencies of the top single, 2-grams, and 3-grams of the text of all three documents combined are summarized in the graphs above. Based on this analysis:
The term frequencey graphs below compare the words used between the three texts document. To evaluate the context of the words used here, stopwords are removed from the frequencies shown in the graphs below.
It’s also interesting to note that the top word for news articles is ‘said’ and the top 2 words for tweets are ‘just’ and ‘like’
The function “getProbMatrix” was developed to generate a list of ngram probability matrices by:
The function “predictNxtWrd” was developed to predict the next word provided a series of words by:
Going forward I intend to take the following steps to further evaluate and develop the prediction model:
# Function used to load text files in current directory as data.table
getImportDt <- function(linesnmax = 50000, nsamp = 2000, txtfilename = NULL){
# linesnmax and nsamp can take on NULL values which mean they are infinity in this function
# If txtfilename is not defined, all files in current directory are read
if (is.null(txtfilename)) {
filenames <- list.files(".") #read filenames from directory
} else {
filenames <- txtfilename
}
dtread <- NULL
dtread <- data.table(file = character(), lines = character())
# Only read the first linesnmax lines of each text doc
for (filename in filenames) {
if (is.null(linesnmax)){
lins <- readr::read_lines(filename)
} else {
lins <- readr::read_lines(filename, n_max = linesnmax)
}
newdt <- data.table(file = filename, lines = lins)
dtread <- rbind(dtread, newdt)
}
dtread$file<-as.factor(dtread$file)
# Take sample of nsamp size from the read data
if (!is.null(nsamp)) dtread <- dtread[, .SD[sample(.N, nsamp)], by = file]
return(dtread)
}
# This function converts the data table generated from getImportDt to a quanteda corpus
dtToQcorp <- function(dtinput){
docs <- NULL
for (filename in levels(dtinput$file)){
subdt <- dtinput[file == filename]
doc_str <- paste(subdt$lines, collapse = "\n")
tempdoc <- corpus(doc_str, docnames = filename)
if (is.null(docs)) {docs <- tempdoc
} else {docs <- c(docs, tempdoc)}
}
return(docs)
}
# This function generates a document-feature matrix based on a quanteda corpus
# To avoid tokenizing words from different sentences, first all lines are tokenized into sentences
Get_qdfm <- function(qcorp, n = 1, removeStpwrds = F){
sens <- unlist(lapply(tokenize_lines(as.String(qcorp)), tokenize_sentences),
use.names = F)
# Generate a list of explicit swear words
stpwrds <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt", warn = F)
if (removeStpwrds == T) stpwrds <- c(stpwrds, stopwords())
# The tokenize_ngrams function automatically removes punct and extra whitespace
ngrams <- unlist(tokenize_ngrams(sens, n=n, lowercase = T, stopwords = stpwrds))
ngrams <- ngrams[!is.na(ngrams)]
dfm(as.tokens(list(ngrams)))
}
# This function generates an ordered term-frequency data table based on a quanteda dfm
# The function combines all docs in dfm, info about individual docs is lost
getOrderedFreqDt <- function(dfminput, spellCheck = T){
dt <- data.table(convert(dfminput, "data.frame"))
dt <- dt[,-c(1)] # remove first "document" column
freqv <- colSums(dt)
freqdt <- data.table(term = names(freqv), freq = freqv)
if (spellCheck == T) {
freqdt[, wrongTerms := hunspell(term)]
freqdt[, correctSpell := identical(wrongTerms[[1]], character(0)), by= 1:nrow(freqdt)]
freqdt <- freqdt[correctSpell==T, c("term", "freq")]
}
freqdt[!grepl("[0-9]", term)] #removes all numbers from ngrams
setorder(freqdt, -freq)
return(freqdt)
}
# Function for plotting terms vs occurrences
plot_occurrences <- function(freqDt, nwords = 20){
wf <- data.frame(term = freqDt$term,
occurrences = freqDt$freq,
row.names = 1:nrow(freqDt))[1:nwords,]
ggplot(wf, aes(term, occurrences)) +
geom_bar(stat="identity") +
coord_flip() +
scale_x_discrete(limits=wf$term)
}
# This function splits sentence by their last n number of words and is used in following funcs
getSplitSent <- function(sen, nwrds){
wrds <- unlist(strsplit(sen, split = " "))
lastNwrds <- paste(tail(wrds,nwrds), collapse = " ")
remaining <- paste(head(wrds,length(wrds)-nwrds), collapse = " ")
c(remaining, lastNwrds)
}
# This function generates a list of freq/probabilitty matrices based on input quanteda corpus
getProbMatrix <- function(inputDocs, maxngram = 3, bareMatOnly = F, coverage = 1.0) {
freqList <- list()
freqList[[1]] <- maxngram
# Convert frequency vectors into data tables and split words in ngram cols
for (num in 2:maxngram){
dt <- getOrderedFreqDt(Get_qdfm(inputDocs, n=num), spellCheck = T)
if (coverage > 0.0 & coverage < 1.0) {
dt[,sumFreq:=cumsum(freq)]
dt <- dt[sumFreq<coverage*sum(freq)]
dt <- dt[,-c("sumFreq")]
}
dt[,remainingTerm := getSplitSent(term, 1)[1], by= 1:nrow(dt)]
dt[,lastWrd := getSplitSent(term, 1)[2], by= 1:nrow(dt)]
dt[,rTermFreq := sum(freq), by = .(remainingTerm)]
# Calculate probabilities and logs
dt[,p := freq / rTermFreq]
dt[,logp := log(p)]
if (bareMatOnly == T) dt[,c("term", "freq", "rTermFreq", "p"):=NULL]
freqList[[num]] <- dt
}
return(freqList)
}
# This function cleans and reformats an input sentences used to clean user input before predicting
Clean_Str <- function(inputstr, removeStpwrds = F){
corpus <- VCorpus(VectorSource(inputstr),
readerControl = list(reader=readPlain, language = "en"))
# Lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove explicitly profane words
profanity <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt", warn = F)
corpus <- tm_map(corpus, removeWords, profanity)
# Remove extra whitespace BUT maintain \n line breaks
whitespaceFUN <- content_transformer(function(x) gsub("[ ]+", " ",as.String(x)))
corpus <- tm_map(corpus, whitespaceFUN)
# Remove stop words if applicable
if (removeStpwrds == T) {
corpus <- tm_map(corpus, removeWords, words = stopwords("en"))
}
return(corpus[[1]]$content)
}
# This function suggest the most probable words to the user based on their input phrase
predictNxtWrd <- function(inputpmat, inputsent) {
# determine starting n based on length of input sentence and maxngram in prob matrix
numWrds <- length(strsplit(inputsent, split = " ")[[1]])
maxngram <- inputpmat[[1]]
n <- numWrds + 1
if (n > maxngram) n <- maxngram
# Use "Backoff" to determine probabilities
#TODO: develop an actual "stupid backoff" model uses 0.4 x probability of next ngram down to calculate probabilities
for (i in n:2){
lastNWrds_str <- getSplitSent(inputsent, i - 1)[2]
# format input sentence
lastNWrds_str <- Clean_Str(lastNWrds_str)
# select corresponding ngram matrix
pdt <- inputpmat[[i]]
subpdt <- pdt[remainingTerm == lastNWrds_str]
# if no match for last-n-words then use next ngram down, otherwise break loop
if (nrow(subpdt)!=0) break
}
#TODO: optimize so whole list doesn't have to be ordered just to get top words
setorderv(subpdt, c("logp"), c(-1))
predictTop <- subpdt$lastWrd[1:5]
}