Data Science Capstone Project

Aixa Rodriguez Salan
14/November/2017

Swiftkey Next Word Prediction

Overview

The goal of this project it's to create an application that can predict the most probable next word to be typed using Natural Language Processign model based on a large corpus of raw data sourced by blogs, news and twitter. Basically the project was set in 4 stages:

  • Clean and Explore Data
  • NGram's Modeling
  • Prediction Algorithm
  • Shinny Application

Clean and Explore Data

The corpus for the NLP model consist in 3 text files collected from publicly available sources by a web crawler from blog, news and twitter with aproximatly 250MB each one and 102.4 millions of words. Given the amount of data, process it requires an amount of resources way too large to accomplish by my computer and for been readed in the shinnyiop server, a sample set of 1% was cleaned using “gsub” function and used as input on the next stage of the project to construct the NGram's models.

con <- file("final/en_US/en_US.blogs.txt", open="rb")
blog <- readLines(con, encoding="UTF-8")
close(con)
rm(con)

c(sample(blog,length(blog)*0.01),sample(twitter,length(twitter)*0.01),sample(news,length(news)*0.01))

NGram's Modeling

The quanteda" library was used to tokenize the dataset and then “sapply” to calculate the frequency of a phrase and with this obtain each ngram data frame.

ngram_set<-function(dt_set,ng=1){
  df_ngram<-tokenize(dt_set,ngrams=ng)
  df_ngram <- unlist(df_ngram)
  df_ngram <- table(df_ngram)
  df_ngram <- as.data.frame(df_ngram)
  colnames(df_ngram)<-c("term", "freq")
  df_ngram <- arrange(df_ngram, -freq) 
  if (ng>1) {
    df_ngram$term<-gsub("_", " ",df_ngram$term)
    df_ngram<-cbind(df_ngram,data.frame(t(sapply(df_ngram$term, function(x) strsplit(x, " ")[[1]]))))
    rownames(df_ngram)<-NULL
  }else
    df_ngram$X1<-df_ngram$term
  return(df_ngram)
}

1, 3 and 5 Grams Models

1grams 3grams 5grams

Prediction Algorithm

The prediction algorithm gets from each of the ngrams (1 to 5) the exact words in the same order leaving the last X term out of the filtering and with the obtained dataset calculates the percentage that the frequency represents.

st_word<-filter(df_1gram,grepl(txt[1],X1) & txt[1]!=df_1gram$X1 & txt[1]==substr(df_1gram$X1,1,stri_length(txt[1])) & !(df_1gram$X1 %in% stopwords("en")))
if(nrow(st_word)!=0){
  sm_freq<-sum(st_word$freq)
  st_word$prctn<-round(st_word$freq*100/sm_freq)
  if(nrow(st_word)>sz)
    st_word<-st_word[order(-st_word$freq),c("X1","prctn")][1:sz,]
  else
    st_word<-st_word[order(-st_word$freq),c("X1","prctn")][1:nrow(st_word),]
  names(st_word)<-c("term","prctn")
  return(st_word)

Then merging the 5 results calculates ponderated frequency based on the number of the terms found and subset the top 3 predictions.

n_merge$scr<-n_merge$N5*0.4^0+n_merge$N4*0.4^1+n_merge$N3*0.4^2+n_merge$N2*0.4^3+n_merge$N1*0.4^4

Shinny Application

The app it's hosted in shinyapps.io in https://aixarodriguez.shinyapps.io/shynapp/.

A minimalistic frontend with only a textbox, you just need to write in it and the app will make the rest. The first time it's loaded takes 30 seconds approximately setting in the grams and the prediction functions.

input


predict