Task description
Loading Data
Text transformation and cleaning
Build n-Grams
Coverage
Language review
Remove sparse words
Code
Session info

Task description

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Loading Data

Making a first approximation, we observe documents with a considerably large size (150 - 200 MB) and around 35 million words to be processed per document.

        size (MB)   lines    words     chars
blogs    200.4242  899288 37334131 206824505
news     196.2775 1010242 34372530 203223159
twitter  159.3641 2360148 30373583 162096241

According to the boxplot the three documents have a similar amount of characters per word. In addition there is a large number of words outside the limits, we will explore this.

To evaluate what is involved, we will consider the first 10 words longer than 20 for each document (blogs, news and twitter respectively).

 [1] "democratically-minded"                                                                                                
 [2] "memorycomparatively."                                                                                                
 [3] "HAHAHAHAHAHAHAHAHAHAHAHAHAHHAHAHAHAHAHAHAHAHAHAHHAHAHAHAHAHAHAHAHAHAAHHAHA"                                           
 [4] "something-for-nothing"                                                                                                
 [5] "DunLaoghaire-Rathdown"                                                                                                
 [6] "location--Minneapolis,"                                                                                               
 [7] "http://www.yogajournal.com/for_teachers/697?utm_source=DailyInsight&utm_medium=newsletter&utm_campaign=DailyInsight)."
 [8] "-http://deckboss.blogspot.ca/2012/05/legislature-lavishes-aquaculture.html"                                           
 [9] "hacker/cyberterrorist"                                                                                                
[10] "(Esshaych@hotmail.co.uk),"

 [1] "theCareerBuilder.comad"     "tetrahydrocannabinol,"     
 [3] "therndon@stonepointcc.org." "National-Bedminsters"     
 [5] "chandleraz.gov/cinco."      "(healthoregon.org/radon)." 
 [7] "portlandbicycletours.com."  "greyhoundwelfare.org."     
 [9] "http://www.nikkics.com."    "http://www.mattdennys.com."

 [1] "djsosnekspqnslanskam."           "foundations/charities,"         
 [3] "evening/afternoon/whatever"      "#OneThingYouShouldntDo"         
 [5] "Sark-oh-no-he-didn'tzy,"         "Liberals/Progressives/DemoRates"
 [7] "after-work.Introducing"          "#problemchildontheloose"        
 [9] "#WordsYouWillNeverHearMeSay"     "www.historyglobe.com/jamestown/"

According to the summary seen, the “words” that have relationship with websites, emails, expressions that do not represent words and hashtag of twitter. (It is possible that they appear more in the cleaning)

On the other hand, words appear that are not separated by space, but by slash or line, these should be considered as valid in cleaning.

Text transformation and cleaning

Due to the size of the documents a corpus is made considering a sample of 5% of each document.

Then transformations are made to structure the data, according to the following:

Remove punctuation and junk
Replace Unicode apostrophe with ASCII apostrophe
Remove multiple repeating consecutive words
Remove multiple repeating consecutive pairs of words
Remove profane words
Remove numbers
Transform to tolower
Remove punctuation
Remove strip extra whitespace

To remove the profane words a dictionary of the GitHub account was used. Robert J Gabriel.

Next, the resulting document-term matrix is displayed.

<<TermDocumentMatrix (terms: 124333, documents: 3)>>
Non-/sparse entries: 168191/204808
Sparsity           : 55%
Maximal term length: 261
Weighting          : term frequency (tf)
Sample             :
      Docs
Terms  en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  and            54596           3436             21801
  are             9676            559              7974
  for            18139           1434             19345
  have           11046            564              8287
  that           23000           1303             11609
  the            91604           7718             46658
  this           13076            465              8061
  was            13750            862              5882
  with           14192           1073              8819
  you            14852            348             27092

Build n-Grams

To determine the behavior of the term frequency of the document-term matrix, four n-grams are created.

Next, the document-term matrix of each n-gram and a summary of the number of terms for each n-gram are shown.

<<TermDocumentMatrix (terms: 124333, documents: 3)>>
Non-/sparse entries: 168191/204808
Sparsity           : 55%
Maximal term length: 261
Weighting          : term frequency (tf)
Sample             :
      Docs
Terms  en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  and            54596           3436             21801
  are             9676            559              7974
  for            18139           1434             19345
  have           11046            564              8287
  that           23000           1303             11609
  the            91604           7718             46658
  this           13076            465              8061
  was            13750            862              5882
  with           14192           1073              8819
  you            14852            348             27092

<<TermDocumentMatrix (terms: 1233860, documents: 3)>>
Non-/sparse entries: 1412130/2289450
Sparsity           : 62%
Maximal term length: 356
Weighting          : term frequency (tf)
Sample             :
         Docs
Terms     en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  and the            3013            236               732
  at the             2340            211              1845
  for the            2900            269              3701
  i have             2386             23              1514
  in a               2320            230              1144
  in the             7630            718              3908
  of the             9283            739              2966
  on the             3541            298              2482
  to be              3448            160              2352
  to the             4283            309              2102

<<TermDocumentMatrix (terms: 2611491, documents: 3)>>
Non-/sparse entries: 2742032/5092441
Sparsity           : 65%
Maximal term length: 373
Weighting          : term frequency (tf)
Sample             :
                    Docs
Terms                en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  a lot of                       661             50               325
  going to be                    266             19               374
  i have a                       283              2               271
  i have to                      284              7               220
  i want to                      274             12               370
  it was a                       346             16               173
  looking forward to              74              1               430
  one of the                     715             55               291
  thanks for the                  11              0              1157
  to be a                        346             19               327

<<TermDocumentMatrix (terms: 3233189, documents: 3)>>
Non-/sparse entries: 3273677/6425890
Sparsity           : 66%
Maximal term length: 458
Weighting          : term frequency (tf)
Sample             :
                       Docs
Terms                   en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  at the end of                     158             11                49
  cant wait to see                   16              0               148
  for the first time                 84              6                77
  going to be a                      42              8               111
  is going to be                     70              9               117
  thank you for the                   8              0               151
  thanks for the follow               0              0               295
  thanks for the rt                   0              0               175
  the end of the                    162             15                73
  the rest of the                   135              2                64

      1-gram  2-gram  3-gram  4-gram
Terms 124333 1233860 2611491 3233189

A large number of terms is observed for each n-gram, which may slow down the execution of the model.

Below are the 20 terms most frequently for each n-gram.

Coverage

The following shows how many unique words you need in a dictionary ordered by frequency to cover 50% and 90% of all instances of words in the language.

        1-gram 2-gram  3-gram  4-gram
Cov 50%    262  36390  896745 1518444
Cov 90%  10057 890911 2268542 2890240

Language review

Using the hunspell package you can evaluate how many words come from foreign languages.

Below is a summary of the language of the terms and a graph of the 20 most frequent terms in another language.

             terms
total      2676426
english    2415673
no english  260753

As you can see, there are many words that if they are in English, this is due to the dictionary used (the one that comes by default in the package).

Due to the low performance of applying this dictionary in this analysis, words in a foreign language will not be filtered.

Using a better dictionary could actually filter the words in another language.

Remove sparse words

To make the number of terms more manageable, the sparse words will be eliminated considering considering a factor of 20%.

Below is the result of Remove sparse words from each n-gram and a summary with the number of terms for each n-gram.

<<TermDocumentMatrix (terms: 11835, documents: 3)>>
Non-/sparse entries: 35505/0
Sparsity           : 0%
Maximal term length: 18
Weighting          : term frequency (tf)
Sample             :
      Docs
Terms  en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  and            54596           3436             21801
  are             9676            559              7974
  for            18139           1434             19345
  have           11046            564              8287
  that           23000           1303             11609
  the            91604           7718             46658
  this           13076            465              8061
  was            13750            862              5882
  with           14192           1073              8819
  you            14852            348             27092

<<TermDocumentMatrix (terms: 25342, documents: 3)>>
Non-/sparse entries: 76026/0
Sparsity           : 0%
Maximal term length: 23
Weighting          : term frequency (tf)
Sample             :
         Docs
Terms     en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  and the            3013            236               732
  at the             2340            211              1845
  for the            2900            269              3701
  i have             2386             23              1514
  in a               2320            230              1144
  in the             7630            718              3908
  of the             9283            739              2966
  on the             3541            298              2482
  to be              3448            160              2352
  to the             4283            309              2102

<<TermDocumentMatrix (terms: 10244, documents: 3)>>
Non-/sparse entries: 30732/0
Sparsity           : 0%
Maximal term length: 33
Weighting          : term frequency (tf)
Sample             :
                    Docs
Terms                en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  a lot of                       661             50               325
  be able to                     325             15               150
  going to be                    266             19               374
  i have a                       283              2               271
  i have to                      284              7               220
  i want to                      274             12               370
  it was a                       346             16               173
  looking forward to              74              1               430
  one of the                     715             55               291
  to be a                        346             19               327

<<TermDocumentMatrix (terms: 1679, documents: 3)>>
Non-/sparse entries: 5037/0
Sparsity           : 0%
Maximal term length: 29
Weighting          : term frequency (tf)
Sample             :
                    Docs
Terms                en_US.blogs.txt en_US.news.txt en_US.twitter.txt
  at the end of                  158             11                49
  at the same time                93              9                53
  for the first time              84              6                77
  going to be a                   42              8               111
  if you want to                  73              5                70
  is going to be                  70              9               117
  is one of the                   87              6                54
  the end of the                 162             15                73
  the rest of the                135              2                64
  when it comes to               122              4                29

      1-gram 2-gram 3-gram 4-gram
Terms  11835  25342  10244   1679

As noted, the terms for each n-gram have decreased considerably.

        1-gram 2-gram 3-gram 4-gram
Cov 50%    157   1095   1078    231
Cov 90%   3033  10269   5845   1106

Performing the same analysis for the new n-gram, a large decrease in terms is also observed.

Code

Load the libraries

suppressMessages(library(knitr))
suppressMessages(library(tm))
suppressMessages(library(ggplot2))
suppressMessages(library(NLP))
suppressMessages(library(tidyr))
suppressMessages(library(hunspell))

Load the data

dir <- "./SwiftKey/"
# Load the data
con <- file(paste0(dir,"en_US.blogs.txt"), "rb")
blogs <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)
con <- file(paste0(dir,"en_US.news.txt"), "rb")
news <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)
con <- file(paste0(dir,"en_US.twitter.txt"), "rb")
twitter <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)

Calculations in the data to show summary

# Read the size of the data
sizeData <- c(file.info(paste0(dir,"en_US.blogs.txt"))$size, 
              file.info(paste0(dir,"en_US.news.txt"))$size, 
              file.info(paste0(dir,"en_US.twitter.txt"))$size)
# Count the number of words per data set
wordsBlogs <- words(blogs)
wordsNews <- words(news)
wordsTwitter <- words(twitter)
wordsData <- c(length(wordsBlogs), length(wordsNews),
               length(wordsTwitter))
# Count the number of chars per data set
charsData <- c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter)))
# Count the number of lines per data set
linesData <- c(length(blogs), length(news), length(twitter))

Calculate number of characters per word

sumData <- rbind(sizeData/1048576, linesData, wordsData, charsData)
colnames(sumData) <- c("blogs", "news", "twitter")
rownames(sumData) <- c("size (MB)", "lines", "words", "chars")
# Summary of the data
t(sumData)
# Number of chars per word
charBlogs <- nchar(wordsBlogs)
charNews <- nchar(wordsNews)
charTwitter <- nchar(wordsTwitter)

Boxplot of characters per word for each document

boxplot(charBlogs, charNews, charTwitter,
        log = "y", names = c("blogs", "news", "twitter"),
        ylab = "log(Number of Characters)", xlab = "File Name") 
title("Comparing Distributions of Chracters per Line")

Displays the first 10 words with more than 20 characters

wordsBlogs[charBlogs>20][1:10]
wordsNews[charNews>20][1:10]
wordsTwitter[charTwitter>20][1:10]

Create a corpus with the three documents

remove(blogs, news, twitter, wordsBlogs, wordsNews, wordsTwitter)
corpus <- VCorpus(DirSource(dir), 
                readerControl = list(language = "en"))
n=.05 # 5% of the size of each set
set.seed(50)
corpus[[1]]$content <- sample(corpus[[1]]$content, 
                              length(corpus[[1]]$content)*n)
corpus[[2]]$content <- sample(corpus[[2]]$content, 
                              length(corpus[[2]]$content)*n)
corpus[[3]]$content <- sample(corpus[[3]]$content, 
                              length(corpus[[3]]$content)*n)

Make transformations and create a document-term matrix

# Create function to modify a text pattern
f <- content_transformer(function(x, patt1, patt2) gsub(patt1, patt2, x))
# Download profane words from Robert J Gabriel's github
# "https://github.com/RobertJGabriel/Google-profanity-words/blob/master/list.txt"
con <- file("profaneList.txt", "rb")
profaneWords <- readLines(con, encoding = "UTF-8", skipNul = T, warn = F)
close(con)
# Remove punctuation and junk
corpus <- tm_map(corpus, f, "[[:punct:]]", "")
# Replace Unicode apostrophe with ASCII apostrophe
corpus <- tm_map(corpus, f, "['']", "'")
# Remove multiple repeating consecutive words
corpus <- tm_map(corpus, f, "\\b(\\w+)(?:\\s+\\1\\b)+", "\\1")
# Remove multiple repeating consecutive pairs of words
corpus <- tm_map(corpus, f, "\\b(\\w+\\s\\w+)(\\s\\1)+", "\\1")
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Transform to tolower
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove profane words
corpus <- tm_map(corpus, removeWords, profaneWords)
# Remove strip extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Create a document-term matrix from single words found in all documents
tdmCorpus <- TermDocumentMatrix(corpus)

Show the document-term matrix

inspect(tdmCorpus)

Create n-gram

token1gram <- function(x) {
      unlist(lapply(ngrams(words(x), 1), paste, collapse = " "),
              use.names = FALSE)}
token2gram <- function(x) {
      unlist(lapply(ngrams(words(x), 2), paste, collapse = " "),
              use.names = FALSE)}
token3gram <- function(x) {
      unlist(lapply(ngrams(words(x), 3), paste, collapse = " "),
             use.names = FALSE)}
token4gram <- function(x) {
      unlist(lapply(ngrams(words(x), 4), paste, collapse = " "),
             use.names = FALSE)}
gram1 <- TermDocumentMatrix(corpus, control = list(tokenize = token1gram))
gram2 <- TermDocumentMatrix(corpus, control = list(tokenize = token2gram))
gram3 <- TermDocumentMatrix(corpus, control = list(tokenize = token3gram))
gram4 <- TermDocumentMatrix(corpus, control = list(tokenize = token4gram))

Show the four n-grams

inspect(gram1)
inspect(gram2)
inspect(gram3)
inspect(gram4)
ngramData <- cbind(dim(gram1)[1], dim(gram2)[1],
                   dim(gram3)[1], dim(gram4)[1])
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- "Terms"
ngramData

Save the n-gram to disk

saveRDS(gram1, "one_words_0.rds")
saveRDS(gram2, "two_words_0.rds")
saveRDS(gram3, "three_words_0.rds")
saveRDS(gram4, "four_words_0.rds")

Show the 20 most frequent terms of each n-gram

nn=20 # Number of bar
# 1-gram
freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
                       frequency=head(termFreq,nn))
g1 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill="red") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("1-gram") + ylab("Frequency") +
    labs(title = "Top 1-grams by frequency")
print(g1)
# 2-gram
freqTerms <- findFreqTerms(gram2)
termFreq <- rowSums(as.matrix(gram2[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
                       frequency=head(termFreq,nn))
g2 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill="red") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("2-gram") + ylab("Frequency") +
    labs(title = "Top 2-grams by frequency")
print(g2)
# 3-gram
freqTerms <- findFreqTerms(gram3)
termFreq <- rowSums(as.matrix(gram3[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
                       frequency=head(termFreq,nn))
g3 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill="red") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("3-gram") + ylab("Frequency") +
    labs(title = "Top 3-grams by frequency")
print(g3)
# 4-gram
freqTerms <- findFreqTerms(gram4)
termFreq <- rowSums(as.matrix(gram4[freqTerms,]))
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,nn)),
                       frequency=head(termFreq,nn))
g4 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill="red") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("4-gram") + ylab("Frequency") +
    labs(title = "Top 4-grams by frequency")
print(g4)

Calculate and show n-gram coverage for 50% and 90%

wordsum <- function(x, coverage){
  totalfreq <- sum(x$freq)
  wordfreq <- 0
  for (i in 1:length(x$freq))
  {
    wordfreq <- wordfreq + as.numeric(x$freq[i])
    if (wordfreq >= coverage * totalfreq)
    {
        return (i)
    }
  }
  return (nrow(x))
}


freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
G1 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G1) <- 1:dim(G1)[1]
G1 <- G1[with(G1, order(G1$freq, decreasing = TRUE)), ]

freqTerms <- findFreqTerms(gram2)
termFreq <- rowSums(as.matrix(gram2[freqTerms,]))
G2 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G2) <- 1:dim(G2)[1]
G2 <- G2[with(G2, order(G2$freq, decreasing = TRUE)), ]

freqTerms <- findFreqTerms(gram3)
termFreq <- rowSums(as.matrix(gram3[freqTerms,]))
G3 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G3) <- 1:dim(G3)[1]
G3 <- G3[with(G3, order(G3$freq, decreasing = TRUE)), ]

freqTerms <- findFreqTerms(gram4)
termFreq <- rowSums(as.matrix(gram4[freqTerms,]))
G4 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G4) <- 1:dim(G4)[1]
G4 <- G4[with(G4, order(G4$freq, decreasing = TRUE)), ]

ngramData <- rbind(cbind(wordsum(G1,0.5), wordsum(G2,0.5),
                         wordsum(G3,0.5),
                         wordsum(G4,0.5)), 
                   cbind(wordsum(G1,0.9),
                         wordsum(G2,0.9),
                         wordsum(G3,0.9),
                         wordsum(G4,0.9)))
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- c("Cov 50%", "Cov 90%")
ngramData

Evaluate how many terms are in English

en <- dictionary("en_US")

freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
termEng <- hunspell_check(freqTerms, dict = en)
language <- rbind(sum(termFreq),
                   sum(termFreq[termEng]),
                   sum(termFreq[!termEng]))
colnames(language) <- "terms"
rownames(language) <- c("total", "english",  "no english")
language

freqTerms <- freqTerms[!termEng]
termFreq <- termFreq[!termEng]
termFreq <- termFreq[order(termFreq, decreasing = TRUE)]
termFreq <- data.frame(unigram=names(head(termFreq,20)),
                       frequency=head(termFreq,20))
g1 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity", fill="red") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("1-gram") + ylab("Frequency") +
    labs(title = "Top 1-grams by frequency")
print(g1)

Decrease the number of terms of each n-gram

# remove sparse words, leaving only 20% sparsity
gram1 <- removeSparseTerms(gram1, .2)
gram2 <- removeSparseTerms(gram2, .2)
gram3 <- removeSparseTerms(gram3, .2)
gram4 <- removeSparseTerms(gram4, .2)

inspect(gram1)
inspect(gram2)
inspect(gram3)
inspect(gram4)

ngramData <- cbind(dim(gram1)[1], dim(gram2)[1],
                   dim(gram3)[1], dim(gram4)[1])
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- "Terms"
ngramData

Calculate and show the coverage of the new ngram for 50% and 90%

wordsum <- function(x, coverage){
  totalfreq <- sum(x$freq)
  wordfreq <- 0
  for (i in 1:length(x$freq))
  {
    wordfreq <- wordfreq + as.numeric(x$freq[i])
    if (wordfreq >= coverage * totalfreq)
    {
        return (i)
    }
  }
  return (nrow(x))
}


freqTerms <- findFreqTerms(gram1)
termFreq <- rowSums(as.matrix(gram1[freqTerms,]))
G1 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G1) <- 1:dim(G1)[1]
G1 <- G1[with(G1, order(G1$freq, decreasing = TRUE)), ]

freqTerms <- findFreqTerms(gram2)
termFreq <- rowSums(as.matrix(gram2[freqTerms,]))
G2 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G2) <- 1:dim(G2)[1]
G2 <- G2[with(G2, order(G2$freq, decreasing = TRUE)), ]

freqTerms <- findFreqTerms(gram3)
termFreq <- rowSums(as.matrix(gram3[freqTerms,]))
G3 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G3) <- 1:dim(G3)[1]
G3 <- G3[with(G3, order(G3$freq, decreasing = TRUE)), ]

freqTerms <- findFreqTerms(gram4)
termFreq <- rowSums(as.matrix(gram4[freqTerms,]))
G4 <- data.frame(word = freqTerms, freq = termFreq)
row.names(G4) <- 1:dim(G4)[1]
G4 <- G4[with(G4, order(G4$freq, decreasing = TRUE)), ]

ngramData <- rbind(cbind(wordsum(G1,0.5), wordsum(G2,0.5),
                         wordsum(G3,0.5),
                         wordsum(G4,0.5)), 
                   cbind(wordsum(G1,0.9),
                         wordsum(G2,0.9),
                         wordsum(G3,0.9),
                         wordsum(G4,0.9)))
colnames(ngramData) <- c("1-gram", "2-gram", "3-gram", "4-gram")
rownames(ngramData) <- c("Cov 50%", "Cov 90%")
ngramData

Save new n-grams to disk

saveRDS(gram1, "one_words_1.rds")
saveRDS(gram2, "two_words_1.rds")
saveRDS(gram3, "three_words_1.rds")
saveRDS(gram4, "four_words_1.rds")

Session info

sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Chile.1252  LC_CTYPE=Spanish_Chile.1252   
[3] LC_MONETARY=Spanish_Chile.1252 LC_NUMERIC=C                  
[5] LC_TIME=Spanish_Chile.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] hunspell_3.0  tidyr_0.8.2   ggplot2_3.1.0 tm_0.7-6      NLP_0.2-0    
[6] knitr_1.21   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       compiler_3.5.2   pillar_1.3.1     plyr_1.8.4      
 [5] bindr_0.1.1      tools_3.5.2      digest_0.6.18    evaluate_0.12   
 [9] tibble_2.0.0     gtable_0.2.0     pkgconfig_2.0.2  rlang_0.3.1     
[13] yaml_2.2.0       parallel_3.5.2   xfun_0.4         bindrcpp_0.2.2  
[17] withr_2.1.2      stringr_1.3.1    dplyr_0.7.8      xml2_1.2.0      
[21] grid_3.5.2       tidyselect_0.2.5 glue_1.3.0       R6_2.3.0        
[25] rmarkdown_1.11   purrr_0.2.5      magrittr_1.5     scales_1.0.0    
[29] htmltools_0.3.6  assertthat_0.2.0 colorspace_1.3-2 stringi_1.2.4   
[33] lazyeval_0.2.1   munsell_0.5.0    slam_0.1-44      crayon_1.3.4

Capstone: Task 2 (Exploratory Data Analysis)

crsd

11 january 2019