Overview

Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken. The development of NLP applications is challenging because computers Human speech, is not always precise. It is often ambiguous and the linguistic structure can depend on many complex variables, including, regional dialects or social context.

Our approaches to NLP will be developing a word prediction shiny app, using data from a corpus called HC Corpora.

First I will try to understand what real data looks like, identifying appropriate tokens such as words or punctuation and removing profanity and other words that we do not want to predict. Also I will build figures and tables to understand variation in the frequencies of words in the data.

Using the exploratory analysis, I will build a n-gram model for predicting the next word based on the previous 1, 2, or 3 words and models to handle unseen n-grams or handle cases where a particular n-gram isn’t observed.

Then, I will build a predictive model based on the previous data modeling steps. Our goal for this prediction model is to minimize both the size and runtime of the model, evaluating the model for efficiency and accuracy.

Preparing the Environment

First, I will prepare the environment that allow me to work correctly. When I load a package, a set of required functions becomes available.

gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 297677 15.9     407500 21.8   350000 18.7
## Vcells 521344  4.0     905753  7.0   786411  6.0
library(ggplot2);
library(tm)
library(SnowballC)
library(RWeka)
library(Matrix)
library(gridExtra)
library(knitr)
opts_chunk$set(echo = TRUE, results = 'hold', message = FALSE)

Loading the data

Now, I will go to download the data .I go to work with data from HC Corpora. I’m going to load data set in the new npl directory. Tnen, I wiil unzip the file to our working directory.

if (!file.exists("./npl")){ 
        dir.create("./npl")
        url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile="./npl/Coursera-SwiftKey.zip", method="curl")
}
unzip("./npl/Coursera-SwiftKey.zip")

Preproccess and Cleaning the data

As we can see, I have created a directory containing files for four different languages. I will work with those in English. Then, I list the files in this directory, try to get familiar with the databases and do the necessary cleaning

Size of Files in MB

Fist I will calculate the size of the three English files

en_files <- list.files("final/en_US")
blog_size <- file.info("./final/en_US/en_US.blogs.txt")[1]/1024/1024
news_size <- file.info("./final/en_US/en_US.news.txt")[1]/1024/1024
twitt_size <- file.info("./final/en_US/en_US.twitter.txt")[1]/1024/1024

SIZE <- as.numeric(c(blog_size, news_size, twitt_size))

Number of Lines

Now, I find out the number of lines of the three files

blog <- file("./final/en_US/en_US.blogs.txt")
rblog <- readLines(blog)
close(blog)
lrblog <- length(rblog)

news <- file("./final/en_US/en_US.news.txt")
rnews <- readLines(news)
close(news)
lrnews <- length(rnews)

twitt <- file("./final/en_US/en_US.twitter.txt")
rtwitt <- readLines(twitt)
close(twitt)
lrtwitt <- length(rtwitt) 

LINES <- as.numeric(c(lrblog, lrnews, lrtwitt))

Number of Words

And the number of words in the files

nwblog <- sum(sapply(gregexpr("\\s+", rblog), length) + 1)
nwnews <- sum(sapply(gregexpr("\\s+", rnews), length) + 1)
nwtwitt <- sum(sapply(gregexpr("\\s+", rtwitt), length) + 1)

WORDS <- c(nwblog, nwnews, nwtwitt)

Longest Line

And finally, I calculate the longest line in each file.

words_b <- lapply(rblog, nchar)
word_max_b<- object.size(rblog[which.max(words_b)])

words_n <- lapply(rnews, nchar)
word_max_n<- object.size(rnews[which.max(words_n)])

words_t <- lapply(rtwitt, nchar)
word_max_t <- object.size(rtwitt[which.max(words_t)])

MAX_LINE <- c(word_max_b, word_max_n, word_max_t)
DT <- data.frame(round(SIZE, 2), LINES, WORDS, MAX_LINE)
row.names(DT) <- c("BLOG","NEWS","TWITTS")
colnames(DT) <- c("SIZE OF FILE(MB)", "NUMBER OF LINES", "NUMBER OF WORDS", "MAX_LINE")

Result Table

I show all the result in the following Table

grid.draw(tableGrob(DT, cols = colnames(DT), show.box = FALSE, name="test", separator="blue", padding.v = unit(20, "mm"), gpar.coretext = gpar(col = "orange", cex = 1), padding.h = unit(4, "mm"), gpar.corefill = gpar(fill = "white", col = "green")))

Plots result

These are the plots to ilustrates features of the data

Sample the files

This dataset is fairly large. I don’t necessarily need to load the entire dataset in to build our algorithms. I am goin to set a seed and then we use the rbinom function to random select 10% of ramdon rows from blog, 8% from News and 4% from Twitter.

set.seed(4321)

blogsamp <- rblog[rbinom(lrblog*.10, lrblog, .5)]
write.csv(blogsamp, file="./npl/blog.csv", row.names=FALSE, col.names=FALSE)

newssamp <- rnews[rbinom(lrnews*.080, lrnews, .5)]
 write.csv(newssamp, file="./npl/news.csv", row.names=FALSE, col.names=FALSE)
 
twittsamp <- rtwitt[rbinom(lrtwitt*.040, lrtwitt, .5)]
write.csv(twittsamp, file="./npl/twitt.csv", row.names=FALSE, col.names=FALSE)

Preprocessing Corpus

Now I’m going to construct a Corpus object, a collection of text documents that can be interpreted as a database for texts. I create our Corpus with Blog, News and Twitter text documents.

corp <- Corpus(DirSource("npl"), readerControl = list(reader = readPlain, language = "en_US", load = TRUE))
summary(corp)
##           Length Class             Mode
## blog.csv  2      PlainTextDocument list
## news.csv  2      PlainTextDocument list
## twitt.csv 2      PlainTextDocument list

Cleaning Corpus

I’m going to apply methods for cleaning up and structuring the input text for further analysis.

### Remove "/", @ and |
fspace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
 cpr <- tm_map(corp, fspace, "/|@|\\|")

### Eliminating Extra Whitespace
cpr <-tm_map(cpr, stripWhitespace)
### Convert to Lower Case
cpr <-tm_map(cpr, tolower)

### Remove Stopwords
cpr <- tm_map(cpr, removeWords, stopwords("english"))

### Remove Punctuation
cpr <-tm_map(cpr, removePunctuation)

### Remove Numbers
cpr <-tm_map(cpr, removeNumbers)

### Stemming
cprst <- tm_map(cpr, stemDocument, language = "english")

## Convertting to PlainText document
cprstp <- tm_map(cprst, PlainTextDocument)

Finally I want to construc a Term-Document Matrix the most common way of representing texts for further computation. This approach results in a matrix with document IDs and terms. The matrix elements are term frequencies

tdmst <- TermDocumentMatrix(cprstp)
tdf <- as.matrix(tdmst)
colnames(tdf) <- c("blog", "news", "twitt")
head(tdf)
##         Docs
## Terms    blog news twitt
##   aaa       0    3     0
##   aaaah     0   46     0
##   aaaand    6    0     0
##   aacc     41    0     0
##   aaja      5    0     0
##   aajae     5    0     0

I will show the twenty top frequency words in each file with a plot. Also I use a plot to show total top word frequency in Term-Document Matrix.

Plot1 - Blog File

tdf_b <- sort(tdf[ ,1], decreasing=TRUE)
 head(tdf_b, 10)
bxlabs <- names(tdf_b[1:20])
##   one  will   can  just  like   get  time   new   now   day 
## 11597 10539  9721  9130  9065  8235  7430  6914  6467  6130

Plot2 - News File

 tdf_n <- sort(tdf[ ,2], decreasing=TRUE)
 head(tdf_n, 10)
 nxlabs <- names(tdf_n[1:20])
##  said  will   one   new  also  year  just  last   two state 
## 20115  9683  5854  5754  5208  4695  4579  4372  4323  4232

Plot3 - Twitter file

tdf_t <- sort(tdf[ ,3], decreasing=TRUE)
 head(tdf_t, 10)
  txlabs <- names(tdf_t[1:20])
##   just   like    get   will thanks   love    now  today    day   good 
##   5905   4914   4384   4173   3952   3844   3715   3677   3631   3605

#### Plot4 - Total

tdf_tot <- sort(rowSums(tdf), decreasing=TRUE)
  txlabst <- names(tdf_tot[1:20])

I could use an n-gram tokenizer from RWeka to tokenize into phrases instead of single words. I will try to use 1, 2 and 3-gram tokenizer.

options( java.parameters = "-Xmx6g" ) 
UniG <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
BiG <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TriG <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
options(mc.cores=1)
tdmUni <- TermDocumentMatrix(cprstp, control = list(tokenize = UniG))
gc()
##             used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   9931135 530.4   18683826  997.9  18683826  997.9
## Vcells 101720182 776.1  206274084 1573.8 195143137 1488.9
options(mc.cores=1)
tdmBi  <- TermDocumentMatrix(cprstp, control = list(tokenize = BiG))
gc()
##             used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  10062917 537.5   18683826  997.9  18683826  997.9
## Vcells 102440368 781.6  206274084 1573.8 195143137 1488.9
options(mc.cores=1)
tdmTri <- TermDocumentMatrix(cprstp, control = list(tokenize = TriG))

Plot the Bi-Gram tokenizer.

tdmBim  <- as.matrix(tdmBi)
tdmBitot <- sort(rowSums(tdmBim), decreasing=TRUE)
 txlabsBi <- names(tdmBitot[1:20])

3-gram tokenizer elements.

tdmTrim <- as.matrix(tdmTri)
tdmTritot <- sort(rowSums(tdmTrim), decreasing=TRUE)
tdmdttot <- data.frame(tdmTritot)
head(tdmdttot, 20)
##                         tdmTritot
## boy big sword                 468
## little boy big                468
## new york city                 426
## pu bef th                     258
## love toast mom                224
## m sure can                    200
## ec love ed                    196
## creative kuts scrapping       192
## kuts scrapping bug            192
## scrapping bug designs         192
## buy time fell                 172
## canet buy time                172
## king johns castle             172
## risk accessor ur              172
## spot thedifference th         172
## th clan royal                 172
## th king johns                 172
## th spot thedifference         172
## time fell th                  172
## gaston south carolina         170

Interesting Things

I have had some issues with the computer memory. That forces me to reduce the size samples files to 10% for blog, 8% for news and 4% for twitter.

In order to obtain the m-gram TDA I have had to use the gc() function on each calculation to avoid the frecuent Out of Memory errors that I get.

In this first analysis, I haven not removed the profranity words, but I consider I must do it in order to avoid offensive content when the predictive model is built.

Also, I have realized that there are too many twitter acronyms and abrevations, so I think this file need a second cleaning operation.

I am thinkin about reducing the number of sparse words. Also, I try to remove Onw stop words. I consider to develop a strategy for handling mis-spelled words. All of This would allow me to increase the size of the files, so that I can have bigger samples and Build more n-gram models

Next Steps

So far, the operations that I have performed, allow me to idetify frequencies of the most common words in the files and fit a simplest model to predict the single most likely word. However, in order to get more reliable results, it would be neccesary to use Markov chain model. Also I will try to use backoff or smoothing methods to calculate the probability of unobserved n-grams when building the prediction algorithm. Also I will try to Build new 4 and 5-gram. As I mentined previously, the goal for this prediction model is to minimize both the size and runtime of the model, evaluating the model for efficiency and accuracy.