The objective of this project is to show the exploratory analysis to the data and the objectives to develop an application and prediction algorithm.
Making a first approximation, we observe documents with a considerably large size (150 - 200 MB) and around 35 million words to be processed per document.
size (MB) lines words chars
blogs 200.4242 899288 37334131 206824505
news 196.2775 1010242 34372530 203223159
twitter 159.3641 2360148 30373583 162096241
According to the boxplot the three documents have a similar amount of characters per word. In addition there is a large number of words outside the limits, we will explore this.
To evaluate what is involved, we will consider the first 3 words longer than 20 for each document (blogs, news and twitter respectively).
[1] "democratically-minded"
[2] "memorycomparatively."
[3] "HAHAHAHAHAHAHAHAHAHAHAHAHAHHAHAHAHAHAHAHAHAHAHAHHAHAHAHAHAHAHAHAHAHAAHHAHA"
[4] "something-for-nothing"
[5] "DunLaoghaire-Rathdown"
[6] "location--Minneapolis,"
[7] "http://www.yogajournal.com/for_teachers/697?utm_source=DailyInsight&utm_medium=newsletter&utm_campaign=DailyInsight)."
[8] "-http://deckboss.blogspot.ca/2012/05/legislature-lavishes-aquaculture.html"
[9] "hacker/cyberterrorist"
[10] "(Esshaych@hotmail.co.uk),"
[1] "theCareerBuilder.comad" "tetrahydrocannabinol,"
[3] "therndon@stonepointcc.org." "National-Bedminsters"
[5] "chandleraz.gov/cinco." "(healthoregon.org/radon)."
[7] "portlandbicycletours.com." "greyhoundwelfare.org."
[9] "http://www.nikkics.com." "http://www.mattdennys.com."
[1] "djsosnekspqnslanskam." "foundations/charities,"
[3] "evening/afternoon/whatever" "#OneThingYouShouldntDo"
[5] "Sark-oh-no-he-didn'tzy," "Liberals/Progressives/DemoRates"
[7] "after-work.Introducing" "#problemchildontheloose"
[9] "#WordsYouWillNeverHearMeSay" "www.historyglobe.com/jamestown/"
According to the summary seen, the “words” that have relationship with websites, emails, expressions that do not represent words and hashtag of twitter.
On the other hand, words appear that are not separated by space, but by slash or line, these should be considered as valid in cleaning.
Due to the size of the documents a corpus is made considering a sample of 5% of each document.
Then transformations are made to structure the data, according to the following:
To remove the profane words a dictionary of the GitHub account was used. Robert J Gabriel.
Next, the resulting document-term matrix is displayed.
<<TermDocumentMatrix (terms: 124333, documents: 3)>>
Non-/sparse entries: 168191/204808
Sparsity : 55%
Maximal term length: 261
Weighting : term frequency (tf)
Sample :
Docs
Terms en_US.blogs.txt en_US.news.txt en_US.twitter.txt
and 54596 3436 21801
are 9676 559 7974
for 18139 1434 19345
have 11046 564 8287
that 23000 1303 11609
the 91604 7718 46658
this 13076 465 8061
was 13750 862 5882
with 14192 1073 8819
you 14852 348 27092
To determine the behavior of the term frequency of the document-term matrix, four n-grams are created.
Next, a summary is displayed with the number of terms for each n-gram.
1-gram 2-gram 3-gram 4-gram
Terms "124333" "1233860" "2611491" "3233189"
size "10.8 Mb" "108.8 Mb" "241.8 Mb" "318 Mb"
A large number of terms is observed for each n-gram, which may slow down the execution of the model.
Below are the 20 terms most frequently for each n-gram.
The following shows how many unique words you need in a dictionary ordered by frequency to cover 50% and 90% of all instances of words in the language.
1-gram 2-gram 3-gram 4-gram
Cov 50% 262 36390 896745 1518444
Cov 90% 10057 890911 2268542 2890240
To make the number of terms more manageable, scattered words will be eliminated considering a factor of 20%.
Below is a summary with the number of terms for each n-gram.
1-gram 2-gram 3-gram 4-gram
Terms "11835" "25342" "10244" "1679"
size "1.3 Mb" "2.9 Mb" "1.2 Mb" "0.2 Mb"
As noted, the terms for each n-gram have decreased considerably.
1-gram 2-gram 3-gram 4-gram
Cov 50% 157 1095 1078 231
Cov 90% 3033 10269 5845 1106
Performing the same analysis for the new n-gram, a large decrease in terms is also observed. This must be reviewed in conjunction with the predictive capacity of the model to be performed.
The words in the documents have a similar number of letters in average, the same happens with their deviation. There are words outside this range, mainly due to references of web pages, emails, twitter hashtag and misspelled words. At a general level, these terms are not useful for analysis and are mostly eliminated during cleaning.
To create the corpus it was necessary to select a set of 5% of the total of each document, this is because it is inefficient to manage such a large corpus in a personal computer. This can cause problems for the model, since terms that are relevant may be left out.
For this problem I plan to make several sub-corpus (by means of a random sample with replacement) that are computationally manageable, create the document-term matrix for each one and merge them. This will help the terms with intermediate frequencies that are relevant not be lost, although at the expense of processing time. This will be the training set of the model that represents 60% of the corpus. The set of tests will be 20% and the validation 20%.
In the document-term matrix it is observed that the terms with more frequency are stopwords, which was evident, since to articulate sentences are used a lot. They were not eliminated during the cleanup on purpose, because the goal is to predict words, so normally predictions must appear.
When creating the n-grams I had problems with the gdap library, it took a long time to execute (do not delve into the reasons), so I chose to use the NLP library that gave better results. With this I want to indicate that the algorithms used are relevant to the application and that it will be necessary to optimize them, improving the development of what is already done as looking for other packages in R that may be useful, such as quanteda and tidytext.
Another important point to note is the size of the n-grams. The memory usage is considerable, which can cause problems in devices with low available memory. This will be verified during the development of the model and subsequent implementation of the Shiny application. In case of generating problems I plan to treat it by removing the sparse words, as shown in the document, adjusting the factor of terms to be eliminated according to the performance of the model, we must make a trade-off between the computational performance and the quality of the prediction.
To detect words that are in another language, use the hunspell package and the default dictionary for this package. It has the main problem that it is necessary to adjust the terms to their conditions so that it filters, structure that I could not find. On the other hand, the dictionary is deficient, since there are common terms that are filtered as belonging to another language. Because of this, do not use language filtering. Find the clear structure of the package and a better dictionary.
I will use the Katz back-off model with 3 or 4 n-grams to detect and predict observed and unobserved terms.
To evaluate the model, I plan to sequentially enter each line, word by word. In each entry the algorithm must predict the next word, with that it will have a measure of error. I still have to evaluate what this implies (and if it is a good measurement) and the evaluation method may be modified.
Once this model is developed, I will try other models with the same method of evaluation and set of tests. To finish evaluating the best model and testing it with the validation set.
Later I will develop an application in Shiny with this model (this I have not thought so much yet) and I will make a presentation and report with the main points of the development.
Load the libraries
suppressMessages(library(knitr))
suppressMessages(library(tm))
suppressMessages(library(ggplot2))
suppressMessages(library(NLP))
suppressMessages(library(qdap))
suppressMessages(library(tidyr))
suppressMessages(library(hunspell))Load the data
dir <- "./SwiftKey/"
# Load the data
con <- file(paste0(dir,"en_US.blogs.txt"), "rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)
con <- file(paste0(dir,"en_US.news.txt"), "rb")
news <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)
con <- file(paste0(dir,"en_US.twitter.txt"), "rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)Calculations in the data to show summary
# Read the size of the data
sizeData <- c(file.info(paste0(dir,"en_US.blogs.txt"))$size,
file.info(paste0(dir,"en_US.news.txt"))$size,
file.info(paste0(dir,"en_US.twitter.txt"))$size)
# Count the number of words per data set
wordsBlogs <- words(blogs)
wordsNews <- words(news)
wordsTwitter <- words(twitter)
wordsData <- c(length(wordsBlogs), length(wordsNews),
length(wordsTwitter))
# Count the number of chars per data set
charsData <- c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter)))
# Count the number of lines per data set
linesData <- c(length(blogs), length(news), length(twitter))Calculate number of characters per word
sumData <- rbind(sizeData/1048576, linesData, wordsData, charsData)
colnames(sumData) <- c("blogs", "news", "twitter")
rownames(sumData) <- c("size (MB)", "lines", "words", "chars")
# Summary of the data
t(sumData)
# Number of chars per word
charBlogs <- nchar(wordsBlogs)
charNews <- nchar(wordsNews)
charTwitter <- nchar(wordsTwitter)Boxplot of characters per word for each document
boxplot(charBlogs, charNews, charTwitter,
log = "y", names = c("blogs", "news", "twitter"),
ylab = "log(Number of Characters)", xlab = "File Name")
title("Comparing Distributions of Chracters per Word")Displays the first 10 words with more than 20 characters
wordsBlogs[charBlogs>20][1:10]
wordsNews[charNews>20][1:10]
wordsTwitter[charTwitter>20][1:10]Create a corpus with the three documents
remove(blogs, news, twitter, wordsBlogs, wordsNews, wordsTwitter)
corpus <- VCorpus(DirSource(dir),
readerControl = list(language = "en"))
n=.05 # 5% of the size of each set
set.seed(50)
corpus[[1]]$content <- sample(corpus[[1]]$content,
length(corpus[[1]]$content)*n)
corpus[[2]]$content <- sample(corpus[[2]]$content,
length(corpus[[2]]$content)*n)
corpus[[3]]$content <- sample(corpus[[3]]$content,
length(corpus[[3]]$content)*n)Make transformations and create a document-term matrix
# Create function to modify a text pattern
f <- content_transformer(function(x, patt1, patt2) gsub(patt1, patt2, x))
# Download profane words from Robert J Gabriel's github
# "https://github.com/RobertJGabriel/Google-profanity-words/blob/master/list.txt"
con <- file("profaneList.txt", "rb")
profaneWords <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)
# Remove punctuation and junk
corpus <- tm_map(corpus, f, "[[:punct:]]", "")
# Replace Unicode apostrophe with ASCII apostrophe
corpus <- tm_map(corpus, f, "['']", "'")
# Remove multiple repeating consecutive words
corpus <- tm_map(corpus, f, "\\b(\\w+)(?:\\s+\\1\\b)+", "\\1")
# Remove multiple repeating consecutive pairs of words
corpus <- tm_map(corpus, f, "\\b(\\w+\\s\\w+)(\\s\\1)+", "\\1")
# Replace abbreviation
corpus[[1]]$content <- replace_abbreviation(corpus[[1]]$content)
corpus[[2]]$content <- replace_abbreviation(corpus[[2]]$content)
corpus[[3]]$content <- replace_abbreviation(corpus[[3]]$content)
# Replace contraction
corpus[[1]]$content <- replace_contraction(corpus[[1]]$content)
corpus[[2]]$content <- replace_contraction(corpus[[2]]$content)
corpus[[3]]$content <- replace_contraction(corpus[[3]]$content)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Transform to tolower
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove profane words
corpus <- tm_map(corpus, removeWords, profaneWords)
# Remove strip extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Create a document-term matrix from single words found in all documents
tdmCorpus <- TermDocumentMatrix(corpus)Show the document-term matrix
inspect(tdmCorpus)Create n-gram
token1gram <- function(x) {
unlist(lapply(NLP::ngrams(words(x), 1), paste, collapse = " "),
use.names = FALSE)}
token2gram <- function(x) {
unlist(lapply(NLP::ngrams(words(x), 2), paste, collapse = " "),
use.names = FALSE)}
token3gram <- function(x) {
unlist(lapply(NLP::ngrams(words(x), 3), paste, collapse = " "),
use.names = FALSE)}
token4gram <- function(x) {
unlist(lapply(NLP::ngrams(words(x), 4), paste, collapse = " "),
use.names = FALSE)}
gram1 <- TermDocumentMatrix(corpus, control = list(tokenize = token1gram))
gram2 <- TermDocumentMatrix(corpus, control = list(tokenize = token2gram))
gram3 <- TermDocumentMatrix(corpus, control = list(tokenize = token3gram))
gram4 <- TermDocumentMatrix(corpus, control = list(tokenize = token4gram))Show the four n-grams
ngramData <- cbind(dim(gram1)[1], dim(gram2)[1],
dim(gram3)[1], dim(gram4)[1])
object <- cbind(format(object.size(gram1), units = "MB"),
format(object.size(gram2), units = "MB"),
format(object.size(gram3), units = "MB"),
format(object.size(gram4), units = "MB"))
ngramData <- rbind(ngramData, object)
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- c("Terms", "size")
ngramDataSave the n-gram to disk (to use later)
saveRDS(gram1, "one_words.rds")
saveRDS(gram2, "two_words.rds")
saveRDS(gram3, "three_words.rds")
saveRDS(gram4, "four_words.rds")Show the 20 most frequent terms of each n-gram
nn=20 # Number of bar
# 1-gram
freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
frequency=head(termFreq,nn))
g1 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
geom_bar(stat = "identity", fill="red") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("1-gram") + ylab("Frequency") +
labs(title = "Top 1-grams by frequency")
print(g1)
# 2-gram
freqTerms <- findFreqTerms(gram2)
termFreq <- rowSums(as.matrix(gram2[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
frequency=head(termFreq,nn))
g2 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
geom_bar(stat = "identity", fill="red") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("2-gram") + ylab("Frequency") +
labs(title = "Top 2-grams by frequency")
print(g2)
# 3-gram
freqTerms <- findFreqTerms(gram3)
termFreq <- rowSums(as.matrix(gram3[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
frequency=head(termFreq,nn))
g3 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
geom_bar(stat = "identity", fill="red") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("3-gram") + ylab("Frequency") +
labs(title = "Top 3-grams by frequency")
print(g3)
# 4-gram
freqTerms <- findFreqTerms(gram4)
termFreq <- rowSums(as.matrix(gram4[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
frequency=head(termFreq,nn))
g4 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
geom_bar(stat = "identity", fill="red") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("4-gram") + ylab("Frequency") +
labs(title = "Top 4-grams by frequency")
print(g4)Calculate and show n-gram coverage for 50% and 90%
wordsum <- function(x, coverage){
totalfreq <- sum(x$freq)
wordfreq <- 0
for (i in 1:length(x$freq))
{
wordfreq <- wordfreq + as.numeric(x$freq[i])
if (wordfreq >= coverage * totalfreq)
{
return (i)
}
}
return (nrow(x))
}
freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
G1 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G1) <- 1:dim(G1)[1]
G1 <- G1[with(G1, order(G1$freq, decreasing = TRUE)), ]
freqTerms <- findFreqTerms(gram2)
termFreq <- rowSums(as.matrix(gram2[freqTerms,]))
G2 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G2) <- 1:dim(G2)[1]
G2 <- G2[with(G2, order(G2$freq, decreasing = TRUE)), ]
freqTerms <- findFreqTerms(gram3)
termFreq <- rowSums(as.matrix(gram3[freqTerms,]))
G3 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G3) <- 1:dim(G3)[1]
G3 <- G3[with(G3, order(G3$freq, decreasing = TRUE)), ]
freqTerms <- findFreqTerms(gram4)
termFreq <- rowSums(as.matrix(gram4[freqTerms,]))
G4 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G4) <- 1:dim(G4)[1]
G4 <- G4[with(G4, order(G4$freq, decreasing = TRUE)), ]
ngramData <- rbind(cbind(wordsum(G1,0.5), wordsum(G2,0.5),
wordsum(G3,0.5),
wordsum(G4,0.5)),
cbind(wordsum(G1,0.9),
wordsum(G2,0.9),
wordsum(G3,0.9),
wordsum(G4,0.9)))
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- c("Cov 50%", "Cov 90%")
ngramDataDecrease the number of terms of each n-gram
# remove sparse words, leaving only 20% sparsity
gram1 <- removeSparseTerms(gram1, .2)
gram2 <- removeSparseTerms(gram2, .2)
gram3 <- removeSparseTerms(gram3, .2)
gram4 <- removeSparseTerms(gram4, .2)
ngramData <- cbind(dim(gram1)[1], dim(gram2)[1],
dim(gram3)[1], dim(gram4)[1])
object <- cbind(format(object.size(gram1), units = "MB"),
format(object.size(gram2), units = "MB"),
format(object.size(gram3), units = "MB"),
format(object.size(gram4), units = "MB"))
ngramData <- rbind(ngramData, object)
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- c("Terms", "size")
ngramDataCalculate and show the coverage of the new ngram for 50% and 90%
wordsum <- function(x, coverage){
totalfreq <- sum(x$freq)
wordfreq <- 0
for (i in 1:length(x$freq))
{
wordfreq <- wordfreq + as.numeric(x$freq[i])
if (wordfreq >= coverage * totalfreq)
{
return (i)
}
}
return (nrow(x))
}
freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
G1 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G1) <- 1:dim(G1)[1]
G1 <- G1[with(G1, order(G1$freq, decreasing = TRUE)), ]
freqTerms <- findFreqTerms(gram2)
termFreq <- rowSums(as.matrix(gram2[freqTerms,]))
G2 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G2) <- 1:dim(G2)[1]
G2 <- G2[with(G2, order(G2$freq, decreasing = TRUE)), ]
freqTerms <- findFreqTerms(gram3)
termFreq <- rowSums(as.matrix(gram3[freqTerms,]))
G3 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G3) <- 1:dim(G3)[1]
G3 <- G3[with(G3, order(G3$freq, decreasing = TRUE)), ]
freqTerms <- findFreqTerms(gram4)
termFreq <- rowSums(as.matrix(gram4[freqTerms,]))
G4 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G4) <- 1:dim(G4)[1]
G4 <- G4[with(G4, order(G4$freq, decreasing = TRUE)), ]
ngramData <- rbind(cbind(wordsum(G1,0.5), wordsum(G2,0.5),
wordsum(G3,0.5),
wordsum(G4,0.5)),
cbind(wordsum(G1,0.9),
wordsum(G2,0.9),
wordsum(G3,0.9),
wordsum(G4,0.9)))
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- c("Cov 50%", "Cov 90%")
ngramDataSave new n-grams to disk (to use later)
saveRDS(gram1, "one_words1.rds")
saveRDS(gram2, "two_words1.rds")
saveRDS(gram3, "three_words1.rds")
saveRDS(gram4, "four_words1.rds")sessionInfo()R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Chile.1252 LC_CTYPE=Spanish_Chile.1252
[3] LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Chile.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] hunspell_3.0 tidyr_0.8.2 qdap_2.3.2
[4] RColorBrewer_1.1-2 qdapTools_1.3.3 qdapRegex_0.7.2
[7] qdapDictionaries_1.0.7 ggplot2_3.1.0 tm_0.7-6
[10] NLP_0.2-0 knitr_1.21
loaded via a namespace (and not attached):
[1] gtools_3.8.1 wordcloud_2.6 venneuler_1.1-0
[4] tidyselect_0.2.5 xfun_0.4 slam_0.1-44
[7] reshape2_1.4.3 purrr_0.2.5 rJava_0.9-10
[10] reports_0.1.4 colorspace_1.3-2 htmltools_0.3.6
[13] yaml_2.2.0 chron_2.3-53 XML_3.98-1.16
[16] rlang_0.3.1 pillar_1.3.1 glue_1.3.0
[19] withr_2.1.2 bindrcpp_0.2.2 bindr_0.1.1
[22] plyr_1.8.4 stringr_1.3.1 munsell_0.5.0
[25] gtable_0.2.0 evaluate_0.12 gender_0.5.2
[28] parallel_3.5.2 xlsxjars_0.6.1 Rcpp_1.0.0
[31] scales_1.0.0 gdata_2.18.0 plotrix_3.7-4
[34] xlsx_0.6.1 openNLPdata_1.5.3-4 gridExtra_2.3
[37] digest_0.6.18 stringi_1.2.4 dplyr_0.7.8
[40] grid_3.5.2 tools_3.5.2 bitops_1.0-6
[43] magrittr_1.5 lazyeval_0.2.1 RCurl_1.95-4.11
[46] tibble_2.0.0 crayon_1.3.4 pkgconfig_2.0.2
[49] data.table_1.11.8 xml2_1.2.0 assertthat_0.2.0
[52] rmarkdown_1.11 openNLP_0.2-6 R6_2.3.0
[55] igraph_1.2.2 compiler_3.5.2