Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of humancomputer interaction. Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
This research is developed for a NLP method to predict what people will say by the sentenses in the history data.
In this section I load the data and packages which are required for the research. The data is downloaded from Coursera . This research uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. Of course , the data used in my reserch is en_US data.The data is from a corpus called HC Corpora. You can also see the readme file for details on the corpora available.
The en_US data for research is including three parts:
en_US.news.txten_US.blogs.txten_US.twitter.txtreadLines("en_US.news.txt") -> en_US_news
## Warning: incomplete final line found on 'en_US.news.txt'
readLines("en_US.blogs.txt") -> en_US_blogs
readLines("en_US.twitter.txt") -> en_US_twitter
## Warning: line 167155 appears to contain an embedded nul
## Warning: line 268547 appears to contain an embedded nul
## Warning: line 1274086 appears to contain an embedded nul
## Warning: line 1759032 appears to contain an embedded nul
library(tm)
library(ggplot2)
library(wordcloud)
plyr to do the summarykable in knitr to show the results(I think the table in html is better, if you want to use it ,don’t forget to set results="asis"in the chunk)library(plyr)
library(knitr)
## Create the list for summary
list_en<-list(en_US_news, en_US_blogs,en_US_twitter)
names(list_en)<- c("en_US_news", "en_US_blogs","en_US_twitter")
## Define the function for Word-counting
WordCounting<-function(x) length(unlist(strsplit(x,split = " ")))
## Output the summary:Object size,Lines,Words
ldply(list_en,c("object.size","length","WordCounting"))->output
names(output) <- c("Object","Size in bytes", "Line Counts", "Word Counts")
kable(output,html)
| Object | Size in bytes | Line Counts | Word Counts |
|---|---|---|---|
| en_US_news | 20111392 | 77259 | 2643969 |
| en_US_blogs | 260564320 | 899288 | 37334131 |
| en_US_twitter | 316037344 | 2360148 | 30373543 |
tmpackage also provides a method for character data,which is easy to do the NLP working, I will use it by Corpus funtion
# trans<-function(x){
# do.call(paste0,as.list(x))
# }
trans<-function(x){
n<-length(x)
do.call(paste0,as.list(x[sample(n,round(n/100))]))
}
a<-c(trans(en_US_news),trans(en_US_blogs),trans(en_US_twitter))
corpus <- Corpus(VectorSource(a))
corpus
## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>
Before model the data we need clean the data by tmpackages,including following works:
# Convert to Lower case
corpus <- tm_map(corpus,content_transformer(tolower))
# Remove Punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove Numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove Stopwords from Corpus
corpus <- tm_map(corpus, removeWords, stopwords("english"))
A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document
doc <- TermDocumentMatrix(corpus)
M1<-as.matrix(doc)
names <- rownames(M1)
freq <- rowSums(M1)
order<-order(freq,decreasing = T)
data1 <- data.frame(id = names,
news = M1[,1],
blogs = M1[,2],
twitter = M1[,3],
freq = freq)
data1 <- data1[order,]
data1[1:10,]
## id news blogs twitter freq
## just just 47 980 1198 2225
## like like 39 946 1173 2158
## will will 97 1097 918 2112
## one one 60 1167 691 1918
## get get 24 709 1101 1834
## can can 39 981 803 1823
## time time 42 846 561 1449
## know know 16 555 735 1306
## love love 9 372 911 1292
## good good 16 463 788 1267
library(reshape2)
melt(data1[1:20,],id="id")->plotdata
p <- ggplot(data=plotdata[1:60,],aes(id, value,fill=variable))
p <- p + geom_bar(stat="identity")+theme(axis.text.x=element_text(angle=45, hjust=1))
print(p)
set.seed(123)
data1[1:100,]->plotwordcloud
pal <- brewer.pal(9,"BuGn")
pal <- pal[-(1:4)]
wordcloud(plotwordcloud$id,freq=plotwordcloud$freq,random.order = F,color=pal)
For the final algorithm, I plan to design a ShinyApp that can help to predict what people will say by n-gram and Markov model.
First, create the n-gram model start with the 2-gram, 3-gram, and 4-gram model. After finished these models , we can create a frequency tables which could be converted to a frequency matrix(Markov model).
Some prases may not seen in the training data. Using the probabilities of n-grams that are seen in the corpus to predict the ngrams which are not seen.Some method can be such as “Maximum Likelihood Estimation”
Finally, a shinyapp will be created which can examines a text string entered by a user. After comparing that string to the frequency matrix, The shiny app will show the answer of the predictioin.