This report is submitted as an assignment in the Coursera Data Science Capstone course. The report is an intermediate stage in building an application for predicting the next word for a partial sentence typed by a user. The report summarize briefly the source data (text files), some statistics as word/line counts, highlighted findings, and the main idea for the prediction algorithm and the UI (Shiny app) that will build the final product.
The data source by http://www.corpora.heliohost.org/ is here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The relevant files (English) are text files from three different sources: blogs, news, and twitter.
Using datasets from different sources verify that the prediction model fit different text typing environments and is not limited to predict well with only one specific environment (overfitting).
There are several functions in R for reading text data. It appears that the files for blogs and twitter are easily read with the function readLines(), while the news file better read with readChar().
All the text files contain characters with encoding that is different from the local environment, and therefore after reading the files into variables they were converted to a different encoding using the function iconv() and tag characters were omitted.
The following code read each of the three text files, convert the text encoding while omitting tags, count the words and lines, and sample 5% of the rows (randomly) of each file (see Training dataset below).
fileNames <- list.files(path = "../final", full.names = TRUE, recursive =TRUE)
conT <- file(fileNames[4], "r") #Read blog file
keepT <- readLines(conT, 900000, warn = FALSE) ; close(conT)
keepT[1] ; iconv(keepT[1], from="UTF-8", to="ASCII", sub="") #ASCII conversion and omitting unknown characters
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â<U+0080><U+009C>godsâ<U+0080>."
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."
keepT <- iconv(keepT[1:899288], from="UTF-8", to="ASCII", sub="")
BlogWords <- length(unlist(strsplit(gsub(" +", " ", keepT), " "))) #word count 37274639
set.seed(100) ; sampT <- sample.int(length(keepT),length(keepT)/20) #random 5% sampling
sampledBlog <- keepT[sampT]
#The code repeated for the twitter and the news files
The files are of different sizes, number of lines and number of words. The following table list the original size and line/word counts for each file:
fileInfo <- data.frame(fileName=c("blogs", "news", "twitter"), LineCounts=c(BlogLines, NewsLines, TwitterLines), WordCounts=c(BlogWords, NewsWords, TwitterWords), Size=fileSizes) ; fileInfo
## fileName LineCounts WordCounts Size
## 1 blogs 899288 37274639 210160014
## 2 news 1010615 34326286 205811889
## 3 twitter 2360148 30343688 167105338
For further exploration the datasets were sampled “randomly” to create 3 smaller datasets that represent the 3 files. Since the source data is huge, each dataset was built by randomly selecting rows that are in total 5% of each original file. The 3 datasets are integrated into one “training” dataset, before further exploring, “cleaning” and transforming the data. Working with representative training dataset is faster, and leaving out part of the data enable to later test the accuracy of the prediction model with “new” data, i.e. testing how well the model predicts for the data that was not included in the training dataset.
The following code integrate the 3 datasets into one training dataset with 213501 lines and 5087988 words, and create word frequency table (top and bottom are shown) and a histogram for this table (splitted to 2 histograms). The most frequent word is the and words that occur only once (188429 such words) seem to be foreign words, typo, etc. The histograms include words with less then 20 occurrences and with at least 20 occurrences, and focus on a scale of up to 20 occurrences and up to 300 occurrences, respectively. The highest occurrence rates are not shown in this scale, for example, the word the occurs 235349 times.
#integrate datasets
trainingData <- sampledBlog
trainingData[(length(trainingData)+1) : (length(trainingData)+length(sampledNews))] <- sampledNews
trainingData[(length(trainingData)+1) : (length(trainingData)+length(sampledTwitter))] <- sampledTwitter
#create word frequency table and histogram
dataSepWords <- strsplit(gsub(" +", " ", trainingData), " ")
tableTrain <- sort(table(tolower(unlist(dataSepWords))), decreasing = TRUE)
singleOccurences <- length(tableTrain[which(tableTrain==1)]) #188429 words
head(tableTrain, 20) ; tail(tableTrain, 15)
##
## the to and a of in i for is that
## 235349 136663 117730 117542 99367 80389 79212 54418 52003 49378
## you on it with was my at be have this
## 41243 39347 37617 35146 31151 29905 27960 26410 26287 25124
##
## zwillinger zwischenzug zwow zx zydeco/cajun
## 1 1 1 1 1
## zygmunt, zynga's zynga-holders. zynga, zynga/facebook
## 1 1 1 1 1
## zyu zz=-] zzz's, zzz's; zzzzz's.
## 1 1 1 1 1
par(mfrow=c(1,2))
hist(tableTrain[which(tableTrain<20)], xlim=c(0,20), breaks=10000, xlab="Occurrences", ylab="Number of Words (frequency)", main="Histogram of Word Frequency \n (<20 occurrences)")
hist(tableTrain[which(tableTrain>=20)], xlim=c(0,300), breaks=10000, xlab="Occurrences", ylab="Number of Words (frequency)", main="Histogram of Word Frequency \n (>=20 to 300 occurrences)")
After cleaning, transforming and breaking the data into sentences, the cleaned data was saved to a file named “sentencesTraining.txt”. The data was breaked into sentences by first replacing all relevant punctuation to dot, for example ! and ? define end of sentence, and then replacing the dot with end of line, and re-reading the data using scan() and textConnection() which enable to break the data to new rows. See code below.
#trainingData <- gsub("[.]", "\n", trainingData)
#conSen <- textConnection(trainingData)
#sentences <- scan(conSen, what=character(), sep="\n")
Basic n-gram exploration
The following code create 3-gram and 4-gram and add the table a probability column which basically is number of occurrences of the full gram divided by the number of occurrences of the (n-1)-gram.
Only fifth of the training size was used here for some exploring.
conRead <- file("./sentencesTraining.txt", open="r")
readCleanTrain <- readLines(conRead) ; close(conRead)
sentences <- readCleanTrain
#library(dplyr) ; library(plyr) ; library(NLP)
sentencesWords <- strsplit(sentences[1:100000], " ", fixed = TRUE)
gram3 <- lapply(1:100000, function(x) ngrams(sentencesWords[[x]], 3L))
gram3simple <- unlist(gram3, recursive=FALSE)
gram3frame <- as.data.frame(t(sapply(gram3simple, rbind)))
gram3freq3 <- count(gram3frame, vars=c("V1", "V2", "V3"))
gram3freq2 <- count(gram3frame, vars=c("V1", "V2"))
gram3all <- merge(gram3freq3, gram3freq2, by.x=c("V1","V2"), by.y=c("V1","V2"), all.x=TRUE)
gram3all$prob <- gram3all$freq.x / gram3all$freq.y
gram3all <- gram3all[order(-gram3all$prob, -gram3all$freq.x, gram3all$freq.y),]
gram3all[350000:350005,] ; gram3all[900000:900005,]
## V1 V2 V3 freq.x freq.y prob
## 891160 under administrative detention 1 1 1
## 891161 under again ourselves 1 1 1
## 891162 under age three 1 1 1
## 891168 under another branch 1 1 1
## 891172 under are free 1 1 1
## 891173 under arrangements made 1 1 1
## V1 V2 V3 freq.x freq.y prob
## 303519 for your consideration 1 170 0.005882353
## 303521 for your cinco 1 170 0.005882353
## 303523 for your kids 1 170 0.005882353
## 303524 for your collections 1 170 0.005882353
## 303525 for your animation 1 170 0.005882353
## 303526 for your friendships 1 170 0.005882353
#the gram4 created with repeated code with relevant parameter values
gram4all[700000:700005,] ; gram4all[1100000:1100005,]
## V1 V2 V3 V4 freq.x freq.y prob
## 918706 the undefeated boxer prefers 1 1 1
## 918707 the under secretary of 1 1 1
## 918708 the underbrush with a 1 1 1
## 918709 the undercroft I had 1 1 1
## 918710 the undercut sidepod fif<num>s 1 1 1
## 918711 the underdog mahself - 1 1 1
## V1 V2 V3 V4 freq.x freq.y prob
## 213093 check out the literary 1 38 0.02631579
## 213094 check out the line-up 1 38 0.02631579
## 213095 check out the fab 1 38 0.02631579
## 213096 check out the shop 1 38 0.02631579
## 213097 check out the fashion 1 38 0.02631579
## 213098 check out the archives 1 38 0.02631579
The algorithm will be based on the statistics of sequences of 3 or 4 words (tri-gram or four-gram, respectively). For example, in case of tri-gram, predicting the next word will be based on the last two words. The accuracy of the two different n-gram models will be compared. However, using both is based on larger “dictionaries” (i.e. more data is stored in the model), and therefore the decision to include both in the model will take in account the application performance.
The prediction model will take as input a sequence of words, clean and transform these words (similar to the data in the model), for example, change them to lower-case. Then, the model will extract the last two or three words (as described above) and will find the next word based on a match of these two words sequence.
The application UI will include a text box, where the user start to type a sentence, and a button “Submit” that when clicked, fill-in the next word right after the last word that was typed by the user.
Contraction words list: http://www.grammar-monster.com/glossary/contractions.htm
NLP pdf : https://cran.r-project.org/web/packages/NLP/NLP.pdf
n-gram wiki : https://en.wikipedia.org/wiki/N-gram
regular expression: http://www.endmemo.com/program/R/grep.php
grouping with plry and dplyr: http://stackoverflow.com/questions/18201074/find-how-many-times-duplicated-rows-repeat-in-r-data-frame