Milestone Report

2. Exploring the text datasets

The data source by http://www.corpora.heliohost.org/ is here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The relevant files (English) are text files from three different sources: blogs, news, and twitter.

Using datasets from different sources verify that the prediction model fit different text typing environments and is not limited to predict well with only one specific environment (overfitting).

Reading the files, converting and sampling

There are several functions in R for reading text data. It appears that the files for blogs and twitter are easily read with the function readLines(), while the news file better read with readChar().
All the text files contain characters with encoding that is different from the local environment, and therefore after reading the files into variables they were converted to a different encoding using the function iconv() and tag characters were omitted.

The following code read each of the three text files, convert the text encoding while omitting tags, count the words and lines, and sample 5% of the rows (randomly) of each file (see Training dataset below).

fileNames <- list.files(path = "../final", full.names = TRUE, recursive =TRUE)
conT <- file(fileNames[4], "r") #Read blog file
keepT <- readLines(conT, 900000, warn = FALSE) ; close(conT)
keepT[1] ; iconv(keepT[1], from="UTF-8", to="ASCII", sub="") #ASCII conversion and omitting unknown characters

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."

keepT <- iconv(keepT[1:899288], from="UTF-8", to="ASCII", sub="")
BlogWords <- length(unlist(strsplit(gsub(" +", " ", keepT), " "))) #word count 37274639
set.seed(100) ; sampT <- sample.int(length(keepT),length(keepT)/20) #random 5% sampling
sampledBlog <- keepT[sampT]
#The code repeated for the twitter and the news files

Summary statistics

The files are of different sizes, number of lines and number of words. The following table list the original size and line/word counts for each file:

fileInfo <- data.frame(fileName=c("blogs", "news", "twitter"), LineCounts=c(BlogLines, NewsLines, TwitterLines), WordCounts=c(BlogWords, NewsWords, TwitterWords), Size=fileSizes) ; fileInfo

##   fileName LineCounts WordCounts      Size
## 1    blogs     899288   37274639 210160014
## 2     news    1010615   34326286 205811889
## 3  twitter    2360148   30343688 167105338

Training dataset

For further exploration the datasets were sampled “randomly” to create 3 smaller datasets that represent the 3 files. Since the source data is huge, each dataset was built by randomly selecting rows that are in total 5% of each original file. The 3 datasets are integrated into one “training” dataset, before further exploring, “cleaning” and transforming the data. Working with representative training dataset is faster, and leaving out part of the data enable to later test the accuracy of the prediction model with “new” data, i.e. testing how well the model predicts for the data that was not included in the training dataset.

The following code integrate the 3 datasets into one training dataset with 213501 lines and 5087988 words, and create word frequency table (top and bottom are shown) and a histogram for this table (splitted to 2 histograms). The most frequent word is the and words that occur only once (188429 such words) seem to be foreign words, typo, etc. The histograms include words with less then 20 occurrences and with at least 20 occurrences, and focus on a scale of up to 20 occurrences and up to 300 occurrences, respectively. The highest occurrence rates are not shown in this scale, for example, the word the occurs 235349 times.

#integrate datasets
trainingData <- sampledBlog
trainingData[(length(trainingData)+1) : (length(trainingData)+length(sampledNews))] <- sampledNews
trainingData[(length(trainingData)+1) : (length(trainingData)+length(sampledTwitter))] <- sampledTwitter

#create word frequency table and histogram
dataSepWords <- strsplit(gsub(" +", " ", trainingData), " ")
tableTrain <- sort(table(tolower(unlist(dataSepWords))), decreasing = TRUE)
singleOccurences <- length(tableTrain[which(tableTrain==1)]) #188429 words
head(tableTrain, 20) ; tail(tableTrain, 15)

## 
##    the     to    and      a     of     in      i    for     is   that 
## 235349 136663 117730 117542  99367  80389  79212  54418  52003  49378 
##    you     on     it   with    was     my     at     be   have   this 
##  41243  39347  37617  35146  31151  29905  27960  26410  26287  25124

## 
##     zwillinger    zwischenzug           zwow             zx   zydeco/cajun 
##              1              1              1              1              1 
##       zygmunt,        zynga's zynga-holders.         zynga, zynga/facebook 
##              1              1              1              1              1 
##            zyu          zz=-]         zzz's,         zzz's;       zzzzz's. 
##              1              1              1              1              1

par(mfrow=c(1,2))
hist(tableTrain[which(tableTrain<20)], xlim=c(0,20), breaks=10000, xlab="Occurrences", ylab="Number of Words (frequency)", main="Histogram of Word Frequency \n (<20 occurrences)")
hist(tableTrain[which(tableTrain>=20)], xlim=c(0,300), breaks=10000, xlab="Occurrences", ylab="Number of Words (frequency)", main="Histogram of Word Frequency \n (>=20 to 300 occurrences)")

Findings and conclusions about the required cleaning & transforming

Some data “cleaning” is required, as changing all upper-case to lower-case, omitting punctuation that are irrelevant for the prediction (as , - ?), replacing ! ; with dot (.) or some word/sign to identify end of sentence,
Omitting Words with low occurrence rate is a good tradeoff. There is a large number of such words (see histograms) and their contribution to the prediction model accuracy is very small. During cleaning and transforming the dataset, these words will be replaced with some identifier, for example .
To make the training dataset consistent, some transformations required, as replacing “u” (when appears as single word) with “you”, “I’m” with “I am” etc. For the leading list (contractions) that I used here for testing, see the References.
Since numbers, dates, phones etc. might be unique for each row, it make sense to replace them all with some identifier (for example, ) and during prediction, in case the model predict or , suggest a phone prefix (as 0) or the current date.
The rows here were be re-broken so that each row is one sentence, to enable proper use of the function ngrams of the NLP package.

After cleaning, transforming and breaking the data into sentences, the cleaned data was saved to a file named “sentencesTraining.txt”. The data was breaked into sentences by first replacing all relevant punctuation to dot, for example ! and ? define end of sentence, and then replacing the dot with end of line, and re-reading the data using scan() and textConnection() which enable to break the data to new rows. See code below.

#trainingData <- gsub("[.]", "\n", trainingData)
#conSen <- textConnection(trainingData)
#sentences <- scan(conSen, what=character(), sep="\n")

Basic n-gram exploration
The following code create 3-gram and 4-gram and add the table a probability column which basically is number of occurrences of the full gram divided by the number of occurrences of the (n-1)-gram.
Only fifth of the training size was used here for some exploring.

conRead <- file("./sentencesTraining.txt", open="r")
readCleanTrain <- readLines(conRead) ; close(conRead)
sentences <- readCleanTrain

#library(dplyr) ; library(plyr) ; library(NLP)
sentencesWords <- strsplit(sentences[1:100000], " ", fixed = TRUE)
gram3 <- lapply(1:100000, function(x) ngrams(sentencesWords[[x]], 3L))
gram3simple <- unlist(gram3, recursive=FALSE)
gram3frame <- as.data.frame(t(sapply(gram3simple, rbind)))
gram3freq3 <- count(gram3frame, vars=c("V1", "V2", "V3"))
gram3freq2 <- count(gram3frame, vars=c("V1", "V2"))
gram3all <- merge(gram3freq3, gram3freq2, by.x=c("V1","V2"), by.y=c("V1","V2"), all.x=TRUE)
gram3all$prob <- gram3all$freq.x / gram3all$freq.y
gram3all <- gram3all[order(-gram3all$prob, -gram3all$freq.x, gram3all$freq.y),]
gram3all[350000:350005,] ; gram3all[900000:900005,]

##           V1             V2        V3 freq.x freq.y prob
## 891160 under administrative detention      1      1    1
## 891161 under          again ourselves      1      1    1
## 891162 under            age     three      1      1    1
## 891168 under        another    branch      1      1    1
## 891172 under            are      free      1      1    1
## 891173 under   arrangements      made      1      1    1

##         V1   V2            V3 freq.x freq.y        prob
## 303519 for your consideration      1    170 0.005882353
## 303521 for your         cinco      1    170 0.005882353
## 303523 for your          kids      1    170 0.005882353
## 303524 for your   collections      1    170 0.005882353
## 303525 for your     animation      1    170 0.005882353
## 303526 for your   friendships      1    170 0.005882353

#the gram4 created with repeated code with relevant parameter values

gram4all[700000:700005,] ; gram4all[1100000:1100005,]

##         V1         V2        V3        V4 freq.x freq.y prob
## 918706 the undefeated     boxer   prefers      1      1    1
## 918707 the      under secretary        of      1      1    1
## 918708 the underbrush      with         a      1      1    1
## 918709 the undercroft         I       had      1      1    1
## 918710 the   undercut   sidepod fif<num>s      1      1    1
## 918711 the   underdog   mahself         -      1      1    1

##           V1  V2  V3       V4 freq.x freq.y       prob
## 213093 check out the literary      1     38 0.02631579
## 213094 check out the  line-up      1     38 0.02631579
## 213095 check out the      fab      1     38 0.02631579
## 213096 check out the     shop      1     38 0.02631579
## 213097 check out the  fashion      1     38 0.02631579
## 213098 check out the archives      1     38 0.02631579

3. Planned algorithm and application

The algorithm will be based on the statistics of sequences of 3 or 4 words (tri-gram or four-gram, respectively). For example, in case of tri-gram, predicting the next word will be based on the last two words. The accuracy of the two different n-gram models will be compared. However, using both is based on larger “dictionaries” (i.e. more data is stored in the model), and therefore the decision to include both in the model will take in account the application performance.

The prediction model will take as input a sequence of words, clean and transform these words (similar to the data in the model), for example, change them to lower-case. Then, the model will extract the last two or three words (as described above) and will find the next word based on a match of these two words sequence.

The application UI will include a text box, where the user start to type a sentence, and a button “Submit” that when clicked, fill-in the next word right after the last word that was typed by the user.

References

Contraction words list: http://www.grammar-monster.com/glossary/contractions.htm
NLP pdf : https://cran.r-project.org/web/packages/NLP/NLP.pdf
n-gram wiki : https://en.wikipedia.org/wiki/N-gram
regular expression: http://www.endmemo.com/program/R/grep.php
grouping with plry and dplyr: http://stackoverflow.com/questions/18201074/find-how-many-times-duplicated-rows-repeat-in-r-data-frame