Capstone Project Milestone Report

Introduction

This document is the milestone report for Data Science Speciaization Capstone Project. The goal of the project is to create a predictive text model to propose what the next word could be by examininf the previous words in the Sentence.

To achieve this goal, and understand statistical properties of a living language, we will use a corpus (large structured text sets which are used for statistical natural language processing within a specific language territory, Ref: Wikipedia) provided by JHU via this link.

To optimize performance of Shiny application, the required files will be processed and results will be saved as part of the application.

The code and an RStudio project is available through GitHub

Getting Data

To repeat data preperation for the application, the code will automatically download the zip file. If you would like to speed up the process via a download manager, download the zip file to your working directory and rename it as corpus.zip.

if (!file.exists("corpus.zip")){
library("downloader")
download("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", dest="corpus.zip", mode="wb")
}
unzip ("corpus.zip")

Once data is extracted and available (It requires patience to download and extract a half GB file) you can see that it contains three files for each one of the four languages marked as: blogs, news and twitter which are most important parts of the living language in touch devices, which our sponsor SwiftKey’s major business is about. The files for American English will be used as part of this project.

Exploratory Data Analysis

Let’s start with basic statistical information about these text files such as Word counts, line counts and size?

The data is sampled to its 1% to reduce the computational cost to produce a proof of theory demonstration.

To have line counts lets store documents in arrays line:

fileName=c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
filePath=paste("./final/en_US/", fileName , sep = "")

sample_prc<-0.1

linesBlogs<-readLines(filePath[1], encoding="latin1")
numLinesBlogs<-length(linesBlogs)
numWordsBlogs<-sum(sapply(gregexpr("\\S+", linesBlogs), length))
linesBlogs<-sample(linesBlogs,numLinesBlogs*sample_prc)

linesNews<-readLines(filePath[2], encoding="latin1")
numLinesNews<-length(linesNews)
numWordsNews<-sum(sapply(gregexpr("\\S+", linesNews), length))
linesNews<-sample(linesNews,numLinesNews*sample_prc)

linesTwitter<-readLines(filePath[3], encoding="latin1")
numLinesTwitter<-length(linesTwitter)
numWordsTwitter<-sum(sapply(gregexpr("\\S+", linesTwitter), length))
linesTwitter<-sample(linesTwitter,numLinesTwitter*sample_prc)

fileSizeMB <- c( file.size(filePath[1]), 
                 file.size(filePath[2]), 
                 file.size(filePath[3])) / 1000000

numLines<-c(numLinesBlogs,numLinesNews,numLinesTwitter)
numWords<-c(numWordsBlogs,numWordsNews,numWordsTwitter)
alllines<-c(linesBlogs,linesNews,linesTwitter)
alllines<-paste(alllines , collapse = " ")

fileMetaData <- data.frame(fileName=fileName, fileSizeMB=fileSizeMB, numLines=numLines, numWords=numWords,average=numWords/numLines)
colnames(fileMetaData) <- c("Name", "Size(MB)", 
                            "#Lines", "#Words","words per line")
save(fileMetaData, file="MetaDataTable.Rda")
save(alllines, file="Alllines.Rda")

fileMetaData

##                Name Size(MB)  #Lines   #Words words per line
## 1   en_US.blogs.txt 210.1600  899288 37334441       41.51556
## 2    en_US.news.txt 205.8119   77259  2643972       34.22219
## 3 en_US.twitter.txt 167.1053 2360148 30373792       12.86944

Cleaning Data

Also, to have a simplified model:

Upper cases are converted to lower,
The punctuation is removed.
Numbers, and words including digits are removed
whitespaces are standardized into a single space character
all entities are merged int a single series of words.

alllines<-tolower(alllines)
alllines<-gsub("[^'[:^punct:]]", " ", alllines, perl=TRUE)
alllines<-gsub("\\b[[:alnum:]]*[[:digit:]]+[[:alnum:]]*\\b", " ", alllines)
alllines<-iconv(alllines, "latin1", "ASCII", sub="")
alllines<-gsub("\\s+", " ", alllines)
save(alllines,file="cleanedText.Rda")

Porter’s stemming algorithm is applied
To have a decent application, we are going to ban use of swear words. The list of banned words are shortened version of Luis von Ahn from Carnegie Mellon University

library(tm)

## Loading required package: NLP

sampleCorpus <- Corpus(VectorSource(alllines))
sampleCorpus <- tm_map(sampleCorpus, stemDocument)
sampleCorpus <- 
  tm_map(sampleCorpus, removeWords, readLines("modSwearWords.txt"))
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
sampleCorpus.df <-
  data.frame(text=unlist(sapply(sampleCorpus,`[`, "content")),
             stringsAsFactors = FALSE)
save(sampleCorpus, file = "sampleCorpus.Rda")

N-Grams

N-gram is “a contiguous sequence of n items from a given … corpus” referencing Wikipedia. We will use the package stylo to extract 1-grams, 2-grams and 3-grams.

library(stylo)

## stylo version: 0.6.3

corpusText<- txt.to.words(sampleCorpus)

UniGrm<-data.frame(table(make.ngrams(corpusText[[1]], ngram.size = 1)))
BiGrm<-data.frame(table(make.ngrams(corpusText[[1]], ngram.size = 2)))
TriGrm<-data.frame(table(make.ngrams(corpusText[[1]], ngram.size = 3)))

We will orger the n-grams according to their frequencies. Also we will calculate the percentage of occurance of n-gram, within all found equal level n-grams. The n-grams will be sorted in an decreasing manner according to their frequencies.

UniGrm<-UniGrm[order(UniGrm$Freq, decreasing = TRUE),]
BiGrm<-BiGrm[order(BiGrm$Freq, decreasing = TRUE),]
TriGrm<-TriGrm[order(TriGrm$Freq, decreasing = TRUE),]

UniGrm<-cbind(UniGrm,UniGrm$Freq/sum(UniGrm$Freq))
BiGrm<-cbind(BiGrm,BiGrm$Freq/sum(BiGrm$Freq))
TriGrm<-cbind(TriGrm,TriGrm$Freq/sum(TriGrm$Freq))

#provide better column names
colnames(UniGrm)<-c("Word","Frequency","Prc")
colnames(BiGrm)<-c("Word","Frequency","Prc")
colnames(TriGrm)<-c("Word","Frequency","Prc")

save(UniGrm,BiGrm,TriGrm, file="123grams.Rda")

We are going to demonstrate some properties of n-grams by using 50 most frequent ones:

topNumber<-50

top1g<-UniGrm[1:topNumber,]
top2g<-BiGrm[1:topNumber,]
top3g<-TriGrm[1:topNumber,]

save(top1g,top2g,top3g, file="123tops.Rda")

Lets see the percentage of n-grams in using bar plots:

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

plot1g <- ggplot (top1g, aes(x = reorder(Word,Prc), y = Prc )) + 
  geom_bar( stat = "identity" , fill = "red" ) +  
  xlab( "1-Gram List" ) + ylab( "Percentage" ) +
  theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

plot2g <- ggplot (top2g, aes(x = reorder(Word,Prc), y = Prc )) + 
  geom_bar( stat = "identity" , fill = "green" ) +  
  xlab( "1-Gram List" ) + ylab( "Percentage" ) +
  theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

plot3g <- ggplot (top3g, aes(x = reorder(Word,Prc), y = Prc )) + 
  geom_bar( stat = "identity" , fill = "blue" ) +  
  xlab( "1-Gram List" ) + ylab( "Percentage" ) +
  theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

save(plot1g,plot2g,plot3g, file="plots.Rda")

plot1g

plot2g

plot3g

The following figure and frame is to observe required number of words to cover the corpus

numWordsForCoverage<- Vectorize(function(prc){
  prc<-prc/100.0; run <- 0; counter <- 0;
  while(run<prc && counter<=length(UniGrm$Prc)){ 
    counter <- counter + 1
    run <- run + UniGrm$Prc[counter]
  }
  return(counter)
},vectorize.args = c("prc"))

coveragePrc<-0:100
numWordsForCoveragePrc<-numWordsForCoverage(coveragePrc)
coveragePlotFrame<-data.frame(prc=coveragePrc, numWords=numWordsForCoveragePrc)

coveragePlot<-ggplot(data=coveragePlotFrame, aes(x=prc, y=numWords)) +
    geom_line(stat="identity")+
    ylab( "Number of words required for coverage" ) +
    xlab( "Percentage" )
save(coveragePlot,coveragePlotFrame, file= "coveragePlot.Rda")

coveragePlotFrame$numWords[c(25,50,75,85,90,95,99,100)+1]

## [1]    14   101   703  1703  2972  6833 37267 98735

Findings about Data

During the process of generating this milestone report, we find following highights. These highlights are beyond the scope of this project, and should be considered while designing a professional product.

By inspecting the average number of words per line and the number of words, you can see blogs, news and twitter entries has different characteristics. While for this project this difference will be oversighted, Swiftkey might consider having advantage of this difference by creating application aware algorithms.
Application aware algorithms might suggest also structured data such as popular hashtags for Twitter.
Even it is beyond the scope of the project, once a prototyping is done as a deliverable, it will be worthy to implement the algorithm in other languages (specially compiled instead of interpreted) to speed up the whole process to handle a bigger corpus.
A mid-class laptop is not very suitable to crunch all the data. When done professionally, it will be better to use better computers (and even renting a cluster from a cloud service provider). In such a case a parallelization is required.
By focusing at the final product, and proof of concept for a mobile app, and looking at the computational perspective, a balance is required between computational power, disk storage, and speed.
Some n-grams are seasonal such as ‘happy mothers day’, ‘merry christmas’. Also some n-grams could be geo-spatially aware, such as ‘new york city, los angeles california’. So any app working on a GPS capable device the app might be programmed to be location aware, especially for tweeting.

Following findings will be considered (but not guaranteed) while designing the Shiny application.

Stopwords, while they are dominating the corpus, are also important part of the suggestion, since the application should not suggest grammatically incorrect structures. Some form of normalization is required to include stopwords.
Stemming is required to have more successful ngrams. On the other hand, while proposing words, stems will be incomplete. It is required to find a way to propose right word from the proposed stem.
To find grammatically correct suggestion from stems, WordNet from Princeton University can be used. Please refer to: Princeton University “About WordNet.” WordNet. Princeton University. 2010. <http://wordnet.princeton.edu>

Design Decisions

For every sentence I will look for 3-grams (if there are not enough words, 2-grams and 1-grams) to guess the next word to be typed.

The following design aspects will be evaluated in the next three weeks. The decision of to employ or not to employ them will be discussed with a presentation created with the help of RStudio and published with RPubs. The Shiny application will be distributed through shinyapps.io by RStudio

I will look for opportunities not only suggesting the next word, but also suggesting words while they were being typed. The n-gram suggestion might be altered accordingly as well.
I will look for opportunities to use two sets of n-grams, one with stopwords and the other without to harness power of both corpus versions. So I will mix two possible outcomes (probably one from stopwords and two without) to cover more options.
With more computational power available, I would like to use WordNet from Princeton University to remove or correct words to their proper english form such as a word “aaaalright” will become “allright” or it will be removed.
I will look options to remove structured data such as hashtags, e-mail addresses, usernames, url addresses from corpus