Introduction

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.

This research is developed for a NLP method to predict what people will say by the sentenses in the history data.

Outlines

Load the data and packages

Load the data

In this section I load the data and packages which are required for the research. The data is downloaded from Coursera . This research uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. Of course , the data used in my reserch is en_US data.The data is from a corpus called HC Corpora. You can also see the readme file for details on the corpora available.

The en_US data for research is including three parts:

  1. en_US.news.txt
  2. en_US.blogs.txt
  3. en_US.twitter.txt
readLines("en_US.news.txt") -> en_US_news
## Warning: incomplete final line found on 'en_US.news.txt'
readLines("en_US.blogs.txt") -> en_US_blogs
readLines("en_US.twitter.txt") -> en_US_twitter
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul

Load the packages

library(tm)
library(ggplot2)
library(wordcloud)

Summarise the data

library(plyr)
library(knitr)
## Create the list for summary
list_en<-list(en_US_news, en_US_blogs,en_US_twitter)
names(list_en)<- c("en_US_news", "en_US_blogs","en_US_twitter")

## Define the function for Word-counting
WordCounting<-function(x) length(unlist(strsplit(x,split = " ")))

## Output the summary:Object size,Lines,Words
ldply(list_en,c("object.size","length","WordCounting"))->output
names(output) <- c("Object","Size in bytes", "Line Counts", "Word Counts")
kable(output,html)
Object Size in bytes Line Counts Word Counts
en_US_news 20111392 77259 2643969
en_US_blogs 260564320 899288 37334131
en_US_twitter 316037344 2360148 30373543

Reload by tm

tmpackage also provides a method for character data,which is easy to do the NLP working, I will use it by Corpus funtion

# trans<-function(x){
#   do.call(paste0,as.list(x))
# } 
trans<-function(x){
  n<-length(x)
  do.call(paste0,as.list(x[sample(n,round(n/100))]))
} 
a<-c(trans(en_US_news),trans(en_US_blogs),trans(en_US_twitter))
corpus <- Corpus(VectorSource(a))
corpus
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

Clean the data:

Before model the data we need clean the data by tmpackages,including following works:

# Convert to Lower case
corpus <- tm_map(corpus,content_transformer(tolower))
# Remove Punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove Numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove Stopwords from Corpus
corpus <- tm_map(corpus, removeWords, stopwords("english"))

Model the data

A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document

doc <- TermDocumentMatrix(corpus)
M1<-as.matrix(doc)
names <- rownames(M1)
freq <- rowSums(M1)
order<-order(freq,decreasing = T)
data1 <- data.frame(id = names,
                   news = M1[,1],
                   blogs = M1[,2],
                  twitter = M1[,3],
                  freq = freq)
data1 <- data1[order,] 
data1[1:10,]
##        id news blogs twitter freq
## just just   47   980    1198 2225
## like like   39   946    1173 2158
## will will   97  1097     918 2112
## one   one   60  1167     691 1918
## get   get   24   709    1101 1834
## can   can   39   981     803 1823
## time time   42   846     561 1449
## know know   16   555     735 1306
## love love    9   372     911 1292
## good good   16   463     788 1267

Data visualization

library(reshape2)
melt(data1[1:20,],id="id")->plotdata
p <- ggplot(data=plotdata[1:60,],aes(id, value,fill=variable))
p <- p + geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=45, hjust=1))
print(p)

plot of chunk unnamed-chunk-8

Wordcloud

set.seed(123)
data1[1:100,]->plotwordcloud
pal <- brewer.pal(9,"BuGn")
    pal <- pal[-(1:4)]
wordcloud(plotwordcloud$id,freq=plotwordcloud$freq,random.order = F,color=pal)

plot of chunk unnamed-chunk-9

Summmary

For the final algorithm, I plan to design a ShinyApp that can help to predict what people will say by n-gram and Markov model.

  1. First, create the n-gram model start with the 2-gram, 3-gram, and 4-gram model. After finished these models , we can create a frequency tables which could be converted to a frequency matrix(Markov model).

  2. Some prases may not seen in the training data. Using the probabilities of n-grams that are seen in the corpus to predict the ngrams which are not seen.Some method can be such as “Maximum Likelihood Estimation”

  3. Finally, a shinyapp will be created which can examines a text string entered by a user. After comparing that string to the frequency matrix, The shiny app will show the answer of the predictioin.