The main goal of the capstone project is the application based on a predictive text model using explain the Explortory Data Analysis and building an algorithm. Briefly, the application works with a worth ant then it will try to predict the next word. The model will be trained using a collection of English text (corpus) that is compiled from 3 sources - news, blogs, and tweets. The main parts are loading and cleaning the data as well as use NLP (Natural Language Processing) applications in R
Step 1: Load the required libraries.
Sys.setenv(JAVA_HOME="c:\\Program Files\\Java\\jdk1.8.0_161\\jre")
library(rJava)
library(knitr)
## Warning: package 'knitr' was built under R version 3.3.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tm)
## Loading required package: NLP
#install.packages("RWekajars")
library(RWekajars)
#install.packages("RWeka")
library(RWeka)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(tm)
library(stringi)
## Warning: package 'stringi' was built under R version 3.3.3
library(NLP)
library(RColorBrewer)
library(wordcloud)
#install.packages("ngram")
library(ngram)
library(slam)
#install.packages("htmlTable")
library(htmlTable)
library(xtable)
Step 2: Load the required files and set up the work environment.
fileName_blog="final/en_US/en_US.blogs.txt"
con=file(fileName_blog,open="r")
lineBlogs=readLines(con, encoding = "UTF-8", skipNul = TRUE)
longBlogs=length(line)
close(con)
fileName_news="final/en_US/en_US.news.txt"
con=file(fileName_news,open="r")
lineNews=readLines(con, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'final/en_US/en_US.news.txt'
longNews=length(line)
close(con)
fileName_twitter="final/en_US/en_US.twitter.txt"
con=file(fileName_twitter,open="r")
lineTwitter=readLines(con, encoding = "UTF-8", skipNul = TRUE)
longTwitter=length(line)
close(con)
Step 3: statistics
To get a sense of what the data looks like, the main information from each of the 3 datasets (Blog, News and Twitter) is summarized.
Calculating the size of each file in MB,number of lines and words in each file,average word count per line in each file, max count of char per line in each file and others details.
Overview <- data.frame(
FileName=c("lineBlogs","lineNews","lineTwitter"),
"MaxCharacters" = sapply(list(lineBlogs, lineNews, lineTwitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))}),
"File.Size" = sapply(list(lineBlogs, lineNews, lineTwitter), function(x){format(object.size(x),"MB")}),
FileSizeinMB=c(file.info(fileName_blog)$size/1024^2,
file.info(fileName_news)$size/1024^2,
file.info(fileName_twitter)$size/1024^2),
t(rbind(sapply(list(lineBlogs,lineNews,lineTwitter),stri_stats_general),
WordCount=sapply(list(lineBlogs,lineNews,lineTwitter),stri_stats_latex)[4,])
)
)
Visualise the data in the table
kable(Overview,caption = "The main datasets")
| FileName | MaxCharacters | File.Size | FileSizeinMB | Lines | LinesNEmpty | Chars | CharsNWhite | WordCount |
|---|---|---|---|---|---|---|---|---|
| lineBlogs | 40833 | 248.5 Mb | 200.4242 | 899288 | 899288 | 206824382 | 170389539 | 37570839 |
| lineNews | 5760 | 19.2 Mb | 196.2775 | 77259 | 77259 | 15639408 | 13072698 | 2651432 |
| lineTwitter | 140 | 301.4 Mb | 159.3641 | 2360148 | 2360148 | 162096241 | 134082806 | 30451170 |
Step 4: Statistics to compare the all datasets
To summarize the all info until now, select a small subset of each data and compare with the main files.
Blogs_subset <- sample(lineBlogs, length(lineBlogs) * 0.002)
News_subset <- sample(lineNews, length(lineNews) * 0.002)
twitter_subset <- sample(lineTwitter, length(lineTwitter) * 0.002)
subset_blog_news_twitter<-c(sample(lineBlogs, length(lineBlogs) * 0.002),
sample(lineNews, length(lineNews) * 0.002),
sample(lineTwitter, length(lineTwitter) * 0.002))
Overview.after.subset <- data.frame('File' = c("lineBlogs","lineNews","lineTwitter","Blogs_subset","News_subset","twitter_subset","subset_blog_news_twitter"),
"File Size" = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){format(object.size(x),"MB")}),
'Nentries' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){length(x)}),
'TotalCharacters' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){sum(nchar(x))}),
'MaxCharacters' = sapply(list(lineBlogs,lineNews,lineTwitter,Blogs_subset,News_subset,twitter_subset,subset_blog_news_twitter), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
Visualise the data of the subsets in the table
kable(Overview.after.subset,caption = "7 datasets")
| File | File.Size | Nentries | TotalCharacters | MaxCharacters |
|---|---|---|---|---|
| lineBlogs | 248.5 Mb | 899288 | 206824505 | 40833 |
| lineNews | 19.2 Mb | 77259 | 15639408 | 5760 |
| lineTwitter | 301.4 Mb | 2360148 | 162096241 | 140 |
| Blogs_subset | 0.5 Mb | 1798 | 420411 | 2442 |
| News_subset | 0 Mb | 154 | 32193 | 1098 |
| twitter_subset | 0.6 Mb | 4720 | 325881 | 140 |
| subset_blog_news_twitter | 1.2 Mb | 6672 | 783546 | 9810 |
Step 5: First step to clean the data
After reducing the size of each data set that were loaded sampled data is used to create a corpus, and following clean up steps are performed.
1)Convert all words to lowercase using tolower
2)Eliminate punctuation using removePunctuation
3)Eliminate numbers using removeNumbers
4)Strip whitespace using stripWhitespace
5)Eliminate banned words
6)Stemming Using Porter’s Stemming Algorithm
7)Create Plain Text Format using PlainTextDocument
Blogs_subset <- iconv(Blogs_subset, "UTF-8", "ASCII", sub="")
News_subset <- iconv(News_subset, "UTF-8", "ASCII", sub="")
twitter_subset <- iconv(twitter_subset, "UTF-8", "ASCII", sub="")
Data_subset <- c(Blogs_subset,News_subset,twitter_subset)
building.corpus <- function (x = Data_subset) {
corpus <- VCorpus(VectorSource(Data_subset))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
}
corpues <- building.corpus(Data_subset)
Step 6: Breaking a stream of text up into words or short phrases
Using the tm package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams. for that, we have a clean dataset we need to convert it to a format that is most useful for Natural Language Processing (NLP).
#Unigrams
uni_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
#Bigrams
bi_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
#Trigrams
tri_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# Make Term Document Matrix
corpus.uni.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = uni_tokenizer))
corpus.bi.matrix<- TermDocumentMatrix(corpues, control = list(tokenize = bi_tokenizer))
corpus.tri.matrix <- TermDocumentMatrix(corpues, control = list(tokenize = tri_tokenizer))
corpus.uni <- findFreqTerms(corpus.uni.matrix,lowfreq = 10)
corpus.bi <- findFreqTerms(corpus.bi.matrix,lowfreq=10)
corpus.tri <- findFreqTerms(corpus.tri.matrix,lowfreq=10)
corpus.uni.f <- rowSums(as.matrix(corpus.uni.matrix[corpus.uni,]))
corpus.uni.f <- data.frame(word=names(corpus.uni.f), frequency=corpus.uni.f)
corpus.bi.f <- rowSums(as.matrix(corpus.bi.matrix[corpus.bi,]))
corpus.bi.f <- data.frame(word=names(corpus.bi.f), frequency=corpus.bi.f)
corpus.tri.f <- rowSums(as.matrix(corpus.tri.matrix[corpus.tri,]))
corpus.tri.f <- data.frame(word=names(corpus.tri.f), frequency=corpus.tri.f)
kable(head(corpus.uni.f),caption = "Only one word")
| word | frequency | |
|---|---|---|
| able | able | 32 |
| about | about | 436 |
| above | above | 19 |
| absolutely | absolutely | 20 |
| access | access | 11 |
| according | according | 20 |
kable(head(corpus.bi.f),caption = "Two words")
| word | frequency | |
|---|---|---|
| a bad | a bad | 10 |
| a beautiful | a beautiful | 11 |
| a better | a better | 13 |
| a big | a big | 22 |
| a bit | a bit | 44 |
| a book | a book | 10 |
kable(head(corpus.tri.f),caption = "Three words")
| word | frequency | |
|---|---|---|
| a couple of | a couple of | 28 |
| a long time | a long time | 10 |
| a lot of | a lot of | 37 |
| all of the | all of the | 13 |
| all the time | all the time | 10 |
| and i am | and i am | 11 |
Step 7: frequency of words or short phrases
In this section, We will find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams. The N-gram representation of a text lists all N-tuples of words that appear.
plot.n.grams <- function(data, title, num) {
df2 <- data[order(-data$frequency),][1:num,]
ggplot(df2, aes(x = seq(1:num), y = frequency)) +
geom_bar(stat = "identity", fill = "darkgreen", colour = "black", width = 1.1) +
coord_cartesian(xlim = c(0, num+1)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
scale_x_discrete(breaks = seq(1, num, by = 1), labels = df2$word[1:num]) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
U<-plot.n.grams(corpus.uni.f,"Unigrams",20)
B<-plot.n.grams(corpus.bi.f,"Bigrams",20)
Tr<-plot.n.grams(corpus.tri.f,"Trigrams",20)
gridExtra::grid.arrange(U, B, Tr, ncol = 3)
## Warning: position_stack requires non-overlapping x intervals
## Warning: position_stack requires non-overlapping x intervals
## Warning: position_stack requires non-overlapping x intervals
For a better visualistion, we are making a Wordcloud that is based on the frequencies of the N-grams
Step 7: frequency of words or short phrases
In this section, We will find the most frequently occurring words in the data. Here we list the most common unigrams, bigrams, and trigrams. The N-gram representation of a text lists all N-tuples of words that appear.
corpus.cloud<-list(corpus.tri.f,corpus.bi.f,corpus.uni.f)
par(mfrow=c(1, 3))
for (i in 1:3) {
wordcloud(corpus.cloud[[i]]$word, corpus.cloud[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}
The next steps are to:
1)Remove the common words to limit the skew that they may provide.
2)Add sentiment analysis.
3)Implement word prediction algorithm.
4)Implement Shiny app.