Natural language processing (NLP) is the ability of a computer program to understand human speech as it is spoken. The development of NLP applications is challenging because computers Human speech, is not always precise. It is often ambiguous and the linguistic structure can depend on many complex variables, including, regional dialects or social context.
Our approaches to NLP will be developing a word prediction shiny app, using data from a corpus called HC Corpora.
First I will try to understand what real data looks like, identifying appropriate tokens such as words or punctuation and removing profanity and other words that we do not want to predict. Also I will build figures and tables to understand variation in the frequencies of words in the data.
Using the exploratory analysis, I will build a n-gram model for predicting the next word based on the previous 1, 2, or 3 words and models to handle unseen n-grams or handle cases where a particular n-gram isn’t observed.
Then, I will build a predictive model based on the previous data modeling steps. Our goal for this prediction model is to minimize both the size and runtime of the model, evaluating the model for efficiency and accuracy.
First, I will prepare the environment that allow me to work correctly. When I load a package, a set of required functions becomes available.
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 297677 15.9 407500 21.8 350000 18.7
## Vcells 521344 4.0 905753 7.0 786411 6.0
library(ggplot2);
library(tm)
library(SnowballC)
library(RWeka)
library(Matrix)
library(gridExtra)
library(knitr)
opts_chunk$set(echo = TRUE, results = 'hold', message = FALSE)
Now, I will go to download the data .I go to work with data from HC Corpora. I’m going to load data set in the new npl directory. Tnen, I wiil unzip the file to our working directory.
if (!file.exists("./npl")){
dir.create("./npl")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile="./npl/Coursera-SwiftKey.zip", method="curl")
}
unzip("./npl/Coursera-SwiftKey.zip")
As we can see, I have created a directory containing files for four different languages. I will work with those in English. Then, I list the files in this directory, try to get familiar with the databases and do the necessary cleaning
Fist I will calculate the size of the three English files
en_files <- list.files("final/en_US")
blog_size <- file.info("./final/en_US/en_US.blogs.txt")[1]/1024/1024
news_size <- file.info("./final/en_US/en_US.news.txt")[1]/1024/1024
twitt_size <- file.info("./final/en_US/en_US.twitter.txt")[1]/1024/1024
SIZE <- as.numeric(c(blog_size, news_size, twitt_size))
Now, I find out the number of lines of the three files
blog <- file("./final/en_US/en_US.blogs.txt")
rblog <- readLines(blog)
close(blog)
lrblog <- length(rblog)
news <- file("./final/en_US/en_US.news.txt")
rnews <- readLines(news)
close(news)
lrnews <- length(rnews)
twitt <- file("./final/en_US/en_US.twitter.txt")
rtwitt <- readLines(twitt)
close(twitt)
lrtwitt <- length(rtwitt)
LINES <- as.numeric(c(lrblog, lrnews, lrtwitt))
And the number of words in the files
nwblog <- sum(sapply(gregexpr("\\s+", rblog), length) + 1)
nwnews <- sum(sapply(gregexpr("\\s+", rnews), length) + 1)
nwtwitt <- sum(sapply(gregexpr("\\s+", rtwitt), length) + 1)
WORDS <- c(nwblog, nwnews, nwtwitt)
And finally, I calculate the longest line in each file.
words_b <- lapply(rblog, nchar)
word_max_b<- object.size(rblog[which.max(words_b)])
words_n <- lapply(rnews, nchar)
word_max_n<- object.size(rnews[which.max(words_n)])
words_t <- lapply(rtwitt, nchar)
word_max_t <- object.size(rtwitt[which.max(words_t)])
MAX_LINE <- c(word_max_b, word_max_n, word_max_t)
DT <- data.frame(round(SIZE, 2), LINES, WORDS, MAX_LINE)
row.names(DT) <- c("BLOG","NEWS","TWITTS")
colnames(DT) <- c("SIZE OF FILE(MB)", "NUMBER OF LINES", "NUMBER OF WORDS", "MAX_LINE")
I show all the result in the following Table
grid.draw(tableGrob(DT, cols = colnames(DT), show.box = FALSE, name="test", separator="blue", padding.v = unit(20, "mm"), gpar.coretext = gpar(col = "orange", cex = 1), padding.h = unit(4, "mm"), gpar.corefill = gpar(fill = "white", col = "green")))
These are the plots to ilustrates features of the data
This dataset is fairly large. I don’t necessarily need to load the entire dataset in to build our algorithms. I am goin to set a seed and then we use the rbinom function to random select 10% of ramdon rows from blog, 8% from News and 4% from Twitter.
set.seed(4321)
blogsamp <- rblog[rbinom(lrblog*.10, lrblog, .5)]
write.csv(blogsamp, file="./npl/blog.csv", row.names=FALSE, col.names=FALSE)
newssamp <- rnews[rbinom(lrnews*.080, lrnews, .5)]
write.csv(newssamp, file="./npl/news.csv", row.names=FALSE, col.names=FALSE)
twittsamp <- rtwitt[rbinom(lrtwitt*.040, lrtwitt, .5)]
write.csv(twittsamp, file="./npl/twitt.csv", row.names=FALSE, col.names=FALSE)
Now I’m going to construct a Corpus object, a collection of text documents that can be interpreted as a database for texts. I create our Corpus with Blog, News and Twitter text documents.
corp <- Corpus(DirSource("npl"), readerControl = list(reader = readPlain, language = "en_US", load = TRUE))
summary(corp)
## Length Class Mode
## blog.csv 2 PlainTextDocument list
## news.csv 2 PlainTextDocument list
## twitt.csv 2 PlainTextDocument list
I’m going to apply methods for cleaning up and structuring the input text for further analysis.
### Remove "/", @ and |
fspace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
cpr <- tm_map(corp, fspace, "/|@|\\|")
### Eliminating Extra Whitespace
cpr <-tm_map(cpr, stripWhitespace)
### Convert to Lower Case
cpr <-tm_map(cpr, tolower)
### Remove Stopwords
cpr <- tm_map(cpr, removeWords, stopwords("english"))
### Remove Punctuation
cpr <-tm_map(cpr, removePunctuation)
### Remove Numbers
cpr <-tm_map(cpr, removeNumbers)
### Stemming
cprst <- tm_map(cpr, stemDocument, language = "english")
## Convertting to PlainText document
cprstp <- tm_map(cprst, PlainTextDocument)
Finally I want to construc a Term-Document Matrix the most common way of representing texts for further computation. This approach results in a matrix with document IDs and terms. The matrix elements are term frequencies
tdmst <- TermDocumentMatrix(cprstp)
tdf <- as.matrix(tdmst)
colnames(tdf) <- c("blog", "news", "twitt")
head(tdf)
## Docs
## Terms blog news twitt
## aaa 0 3 0
## aaaah 0 46 0
## aaaand 6 0 0
## aacc 41 0 0
## aaja 5 0 0
## aajae 5 0 0
I will show the twenty top frequency words in each file with a plot. Also I use a plot to show total top word frequency in Term-Document Matrix.
tdf_b <- sort(tdf[ ,1], decreasing=TRUE)
head(tdf_b, 10)
bxlabs <- names(tdf_b[1:20])
## one will can just like get time new now day
## 11597 10539 9721 9130 9065 8235 7430 6914 6467 6130
tdf_n <- sort(tdf[ ,2], decreasing=TRUE)
head(tdf_n, 10)
nxlabs <- names(tdf_n[1:20])
## said will one new also year just last two state
## 20115 9683 5854 5754 5208 4695 4579 4372 4323 4232
tdf_t <- sort(tdf[ ,3], decreasing=TRUE)
head(tdf_t, 10)
txlabs <- names(tdf_t[1:20])
## just like get will thanks love now today day good
## 5905 4914 4384 4173 3952 3844 3715 3677 3631 3605
#### Plot4 - Total
tdf_tot <- sort(rowSums(tdf), decreasing=TRUE)
txlabst <- names(tdf_tot[1:20])
I could use an n-gram tokenizer from RWeka to tokenize into phrases instead of single words. I will try to use 1, 2 and 3-gram tokenizer.
options( java.parameters = "-Xmx6g" )
UniG <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
BiG <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TriG <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
options(mc.cores=1)
tdmUni <- TermDocumentMatrix(cprstp, control = list(tokenize = UniG))
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 9931135 530.4 18683826 997.9 18683826 997.9
## Vcells 101720182 776.1 206274084 1573.8 195143137 1488.9
options(mc.cores=1)
tdmBi <- TermDocumentMatrix(cprstp, control = list(tokenize = BiG))
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 10062917 537.5 18683826 997.9 18683826 997.9
## Vcells 102440368 781.6 206274084 1573.8 195143137 1488.9
options(mc.cores=1)
tdmTri <- TermDocumentMatrix(cprstp, control = list(tokenize = TriG))
Plot the Bi-Gram tokenizer.
tdmBim <- as.matrix(tdmBi)
tdmBitot <- sort(rowSums(tdmBim), decreasing=TRUE)
txlabsBi <- names(tdmBitot[1:20])
tdmTrim <- as.matrix(tdmTri)
tdmTritot <- sort(rowSums(tdmTrim), decreasing=TRUE)
tdmdttot <- data.frame(tdmTritot)
head(tdmdttot, 20)
## tdmTritot
## boy big sword 468
## little boy big 468
## new york city 426
## pu bef th 258
## love toast mom 224
## m sure can 200
## ec love ed 196
## creative kuts scrapping 192
## kuts scrapping bug 192
## scrapping bug designs 192
## buy time fell 172
## canet buy time 172
## king johns castle 172
## risk accessor ur 172
## spot thedifference th 172
## th clan royal 172
## th king johns 172
## th spot thedifference 172
## time fell th 172
## gaston south carolina 170
I have had some issues with the computer memory. That forces me to reduce the size samples files to 10% for blog, 8% for news and 4% for twitter.
In order to obtain the m-gram TDA I have had to use the gc() function on each calculation to avoid the frecuent Out of Memory errors that I get.
In this first analysis, I haven not removed the profranity words, but I consider I must do it in order to avoid offensive content when the predictive model is built.
Also, I have realized that there are too many twitter acronyms and abrevations, so I think this file need a second cleaning operation.
I am thinkin about reducing the number of sparse words. Also, I try to remove Onw stop words. I consider to develop a strategy for handling mis-spelled words. All of This would allow me to increase the size of the files, so that I can have bigger samples and Build more n-gram models
So far, the operations that I have performed, allow me to idetify frequencies of the most common words in the files and fit a simplest model to predict the single most likely word. However, in order to get more reliable results, it would be neccesary to use Markov chain model. Also I will try to use backoff or smoothing methods to calculate the probability of unobserved n-grams when building the prediction algorithm. Also I will try to Build new 4 and 5-gram. As I mentined previously, the goal for this prediction model is to minimize both the size and runtime of the model, evaluating the model for efficiency and accuracy.