Important Reminder

This document is the UPDATED milestone report for Data Science Speciaization Capstone Project.

This document is published via RPubs through this link
The previous approach (dated 21 March 2016) will still be published in the RPubs by the end of the Capstone Project for historical reasons.

Introduction

The goal of the project is to create a predictive text model to propose what the next word could be by examininf the previous words in the Sentence.

To optimize performance of Shiny application, the required files will be processed and results will be saved as part of the application.
The code and an RStudio project is available through GitHub.
This report explains on the preprocessing required for the application. Interaction with the user and the prediction based on user input will be explained in the Application’s Presentation .
It is assumed that a CentOS 7 OS and bash is deployed to follow this milestone report.

Getting Data

To achieve the goal of the project, and understand statistical properties of a living language, we will use a corpus (large structured text sets which are used for statistical natural language processing within a specific language territory, Ref: Wikipedia) provided by JHU via this link.

wget "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip";  
unzip Coursera-SwiftKey.zip;

Exploratory Data Analysis

Let’s start with basic statistical information about these text files such as Word counts, line counts and size?

The linux wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output. The output is in format of . For details please use man wc to refer utility manual.

wc final/en_US/en_US.blogs.txt
wc final/en_US/en_US.news.txt
wc final/en_US/en_US.twitter.txt

Sampling Data

Once data is downloaded and extracted select the application requires sampling of the dataset to form a practical corpus. Downsizing is required to be able to save N-gram models in a mobile device which has limited memory, as well as to optimize response times.

The linux shuf utility is used to generate a random permutation of the provided files to standard output. For repeatability, random permutation used in the application is provided in the GitHub repository. For details please use man shuf to refer utility manual.

shuf -n 1000 final/en_US/en_US.blogs.txt > data/sample/en_US.blogs.input.txt;
shuf -n 1000 final/en_US/en_US.news.txt > data/sample/en_US.news.input.txt;
shuf -n 1000 final/en_US/en_US.twitter.txt > data/sample/en_US.twitter.input.txt;

Cleaning Data

To have a simplified model following transformations were executed on the corpus.

Corpus is flattened

alltext<-paste (c(
  readLines("data/samples/en_US.blogs.samples.txt", encoding="latin1"),
  readLines("data/samples/en_US.news.samples.txt", encoding="latin1"),
  readLines("data/samples/en_US.twitter.samples.txt", encoding="latin1")
), collapse = " ")

Upper cases are converted to lower,

alllines<-tolower(alllines)

Hashtags and URLs removed.

alllines<-gsub("\\b#\\S+\\b", " ", alllines)
alllines<-gsub("\\b?(f|ht)(tp)(s?)(://)(.*)\\b", " ", alllines)

Word which include non-ASCII characters were removed
Numbers, and words including digits are removed

alllines<-iconv(alllines, "latin1", "ASCII", sub="1")
alllines<-gsub("\\b[[:alnum:]]*[[:digit:]]+[[:alnum:]]*\\b", " ", alllines)

The punctuation except apostrophe is removed.
words with apostrophe are standardized

alllines<-gsub("[^'[:^punct:]]", " ", alllines, perl=TRUE)
alllines<-gsub("\\bwon't\\b", "will not", alllines)
alllines<-gsub("n't\\b", " not", alllines)
alllines<-gsub("'ve\\b", " have", alllines)
alllines<-gsub("'re\\b", " are", alllines)
alllines<-gsub("'s\\b", " is", alllines)
alllines<-gsub("'m\\b", " am", alllines)
alllines<-gsub("'d\\b", " would", alllines)
alllines<-gsub("'ll\\b", " will", alllines)
alllines<-gsub("\\b'", " ", alllines)
alllines<-gsub("'\\b", " ", alllines)
alllines<-gsub("\\b\\S+'\\S+\\b", " ", alllines)

all single letter words except “I” and “a” is removed

alllines<-gsub("\\b[b-hj-z]\\b", " ", alllines)

- To have a decent application, we are going to ban use of swear words. The list of banned words are shortened version of Luis von Ahn from Carnegie Mellon University

swearWords<-readLines("data/for_shiny_app/modSwearWords.txt")
for (sword in swearWords) 
  alllines<-gsub( paste("\\b\\S*",sword, "\\S*\\b", sep = "") , " ", alllines)

whitespaces are standardized into a single space character

alllines<-gsub("\\b\\s+\\b", " ", alllines)

N-Grams

N-gram is “a contiguous sequence of n items from a given … corpus” referencing Wikipedia. We will extract n-grams and save them to be used later by the Shiny application.

library("NLP")
library("tm")
library("ngramrr")
library("stringr")

cat(alltext,file="data/tmp/outfile.txt",sep="\n")
corpus<-VCorpus(DirSource("data/tmp"),readerControl = list(reader = readPlain))

Quad-grams are created to use last 3 words from user input.

tdm_quadgram <- tdm2(corpus, ngmin = 4, ngmax = 4)
saveRDS(tdm_quadgram, file = "data/for_shiny_app/quadgram.rda")

tdm_trigram <- tdm2(corpus, ngmin = 3, ngmax = 3)
saveRDS(tdm_trigram, file = "data/for_shiny_app/trigram.rda")

tdm_bigram <- tdm2(corpus, ngmin = 2, ngmax = 2)
saveRDS(tdm_bigram, file = "data/for_shiny_app/bigram.rda")

tdm_unigram <- TermDocumentMatrix(corpus, control=list(wordLengths=c(1,Inf)))
saveRDS(tdm_unigram, file = "data/for_shiny_app/unigram.rda")

We will order the n-grams according to their frequencies. Also we will calculate the percentage of occurance of n-gram, within all found equal level n-grams. The n-grams will be sorted in an decreasing manner according to their frequencies.

We are going to demonstrate some properties of n-grams by using 50 most frequent ones:

library(ggplot2)

tdm_unigram<-readRDS("../../data/for_shiny_app/unigram.rda")
tdm_bigram<-readRDS("../../data/for_shiny_app/bigram.rda")
tdm_trigram<-readRDS("../../data/for_shiny_app/trigram.rda")
tdm_quadgram<-readRDS("../../data/for_shiny_app/quadgram.rda")

draw_gram<-function(tdm_gram,topNumber=50){
  topNg<-data.frame(
    tdm_gram$dimnames$Terms[order(-tdm_gram$v)][1:topNumber],
    tdm_gram$v[order(-tdm_gram$v)][1:topNumber]/sum(tdm_gram$v))
  colnames(topNg)<-c("Word","Prc")
  
  plotNg <- ggplot (topNg, aes(x = reorder(Word,Prc), y = Prc )) + 
    geom_bar( stat = "identity" , fill = "red" ) +  
    xlab(paste(
      toString(length(unlist(strsplit(tdm_gram$dimnames$Terms[1], split=" ")))),
      "-Gram List", sep = "" )) + ylab( "Percentage" ) +
    theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )
  
  plotNg
}

draw_gram(tdm_unigram)

draw_gram(tdm_bigram)

draw_gram(tdm_trigram)

draw_gram(tdm_quadgram)

Findings about Data

During the process of generating this milestone report, we find following highights. These highlights are beyond the scope of this project, and should be considered while designing a professional product.

By inspecting the average number of words per line and the number of words, you can see blogs, news and twitter entries has different characteristics. While for this project this difference will be oversighted, Swiftkey might consider having advantage of this difference by creating application aware algorithms.
Application aware algorithms might suggest also structured data such as popular hashtags for Twitter.
Even it is beyond the scope of the project, once a prototyping is done as a deliverable, it will be worthy to implement the algorithm in other languages (specially compiled instead of interpreted) to speed up the whole process to handle a bigger corpus.
A mid-class laptop is not very suitable to crunch all the data. When done professionally, it will be better to use better computers (and even renting a cluster from a cloud service provider). In such a case a parallelization is required.
By focusing at the final product, and proof of concept for a mobile app, and looking at the computational perspective, a balance is required between computational power, disk storage, and speed.
Some n-grams are seasonal such as ‘happy mothers day’, ‘merry christmas’. Also some n-grams could be geo-spatially aware, such as ‘new york city, los angeles california’. So any app working on a GPS capable device the app might be programmed to be location aware, especially for tweeting.
To find grammatically correct suggestion from stems, WordNet from Princeton University can be used. Please refer to: Princeton University “About WordNet.” WordNet. Princeton University. 2010. <http://wordnet.princeton.edu>

Design Decisions

For every sentence I will look for 3-grams (if there are not enough words, 2-grams and 1-grams) to guess the next word to be typed.

The following design aspects will be evaluated in the following weeks. The decision of to employ or not to employ them will be discussed with a presentation created with the help of RStudio and published with RPubs. The Shiny application will be distributed through shinyapps.io by RStudio

I will look for opportunities not only suggesting the next word, but also suggesting words while they were being typed. The n-gram suggestion might be altered accordingly as well.
With more computational power available, I would like to use WordNet from Princeton University to remove or correct words to their proper english form such as a word “aaaalright” will become “allright” or it will be removed.

Annex

Katz Back-off Implementation details

Katz back-off is based on consecutive levels of n-grams to estimate the conditional probability of a word by using as much predecessing words as possible. More details can be found via Wikipedia.

Collect the input, clean and standardize the phrase exactly with the same transformations which were applied to the corpus above. Find phraseLen by:

  if (phrase == " ") return(NULL);
  phrase <- unlist(strsplit(phrase, split=" "));
  phrase<-phrase[(phrase!="")];
  phraseLen <- length(phrase);

The prediction algorithm starts from highest n-gram by using last n-1 words from the input as historical condition to find matching most frequent n-grams starting with the n-1 word input segment.
In our implementation, highest number of words used as historical condition is N-1=3 so N=4 and number of candidates is C=3.

  return_words<-c()
  if (length(return_words)<3 && phraseLen>=3){
    gram4<- paste(phrase[(phraseLen-2):phraseLen], collapse=" ");
    gram4<-paste("\\b",gram4,"\\S*\\b",sep="");
    filter_quad<-regexpr(gram4, tdm_quadgram$dimnames$Terms)==1
    sum_quad<-sum(filter_quad);
    if (sum_quad>0) {
      selected_term<-tdm_quadgram$dimnames$Terms[filter_quad];
      selected_freq<-tdm_quadgram$v[filter_quad]
      quad_words<-word(selected_term[order (- selected_freq)][1:sum_quad],-1)
      return_words<-c(return_words,quad_words)
      return_words<-unique(return_words)
    }
  }

If there are not enough number of next word candidates or prediction options existing, n is decreased by 1 and step 1 is repeated.

  if (length(return_words)<3 && phraseLen>=2){
    gram3<- paste(phrase[(phraseLen-1):phraseLen], collapse=" ");
    gram3<-paste("\\b",gram3,"\\S*\\b",sep="");
    filter_tri<-regexpr(gram3, tdm_trigram$dimnames$Terms)==1
    sum_tri<-sum(filter_tri);
    if (sum_tri>0) {
      selected_term<-tdm_trigram$dimnames$Terms[filter_tri];
      selected_freq<-tdm_trigram$v[filter_tri]
      tri_words<-word(selected_term[order (- selected_freq)][1:sum_tri],-1)
      return_words<-c(return_words,tri_words)
      return_words<-unique(return_words)
    }
  }
  
  if (length(return_words)<3 && phraseLen>=1){
    gram2<-paste("\\b",phrase[phraseLen],"\\S*\\b",sep="");
    filter_bi<-regexpr(gram2, tdm_bigram$dimnames$Terms)==1
    sum_bi<-sum(filter_bi);
    if (sum_bi>0) {
      selected_term<-tdm_bigram$dimnames$Terms[filter_bi];
      selected_freq<-tdm_bigram$v[filter_bi]
      bi_words<-word(selected_term[order (- selected_freq)][1:sum_bi],-1)
      return_words<-c(return_words,bi_words)
      return_words<-unique(return_words)
    }
  }

When n=1, there are no historical conditions required. In such cases, the most frequent words from corpus are selected as candidates.

  if (length(return_words)<3 && phraseLen>=0){
    unilength<-min(length(tdm_unigram$dimnames$Terms),3)
    default_words<-tdm_unigram$dimnames$Terms[order (- tdm_unigram$v)][1:unilength]
    return_words<-c(return_words,default_words)
    return_words<-unique(return_words)
  }
  
  if (length(return_words)>3) return_words<-return_words[1:3]
  return(return_words)
}

UPDATED Capstone Project Milestone Report

Umut Kahramankaptan

January 2nd, 2017