Goals of the analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observed in the data before building the first linguistic models.

Motivation

  • Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  • Create a basic report of summary statistics about the data sets.
  • Report any interesting findings that you amassed so far.
  • Get feedback on your plans for creating a prediction algorithm and Shiny app.

Tasks

Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Data Loading

For Exploratory analysis we will first try to load all the data, in next phases due to capacity limitations we will work with sampled version of the corpus.

Exploratory analysis

Basic summaries of the three files: Word counts, line counts and basic data tables.

Files <- dir(
  path = file.path("R/Data/final/en_US/"),
  pattern = ".*\\.txt",
  all.files = FALSE,
  full.names = TRUE,
  recursive = FALSE,
  ignore.case = FALSE,
  include.dirs = FALSE,
  no.. = FALSE
)

Lines <- unlist(lapply(Files,  LaF::determine_nlines))

tableSizes <- data.frame(File = Files,
                         Lines = Lines )

formattable(tableSizes, align = c("l","c"))
File Lines
R/Data/final/en_US/en_US.blogs.txt 899288
R/Data/final/en_US/en_US.news.txt 1010242
R/Data/final/en_US/en_US.twitter.txt 2360148

en_US.blogs.txt

Check content sample:

index = 1
full_txt <- LaF::sample_lines(tableSizes$File[index], tableSizes$Lines[index], tableSizes$Lines[index])
num.lines <- tableSizes$Lines[index]
head(full_txt)
## [1] "Mouse Pad Where the mouse takes the grain it does not eat\r"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
## [2] "c. A knife\r"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [3] "As I like to put it the book has a side of romance. The main story doesn't fully focus on Mia and Jeremy's interest in each other. For me it was the perfect about, just enough for me to become invested in their happiness together.\r"                                                                                                                                                                                                                                                                                                                                                                                                                             
## [4] "LOVE!!! I chose a cognac color, yet another great neutral. I love the design and can see myself rocking them with tons of looks. The Vince Camuto Baron sandal also comes in black but I typically don't wear black shoes in Spring or Summer.\r"                                                                                                                                                                                                                                                                                                                                                                                                                     
## [5] "Bart was good and stayed home unless he caught sent of a female in heat and then he was off to have him some fun. Not long after Jennifer and i settled in a friend got divorced and decided to move down to Corpus Christi. Somehow I got possession of his Chesapeake Bay Retriever Rosey in the deal and man did she love the water. Especially the warm water of my closest neighbor's hot tub. Bart didn't care for the water much but he got excited to see Rosey swim around. And when he got excited he drooled even more.\r"                                                                                                                                 
## [6] "What emanates now from important Christian leaders, not to speak of such bona fide faithful as the Johnson, Carter, Kennedy, Bush, Blair, Brown and Atlee names adduced in a previous chapter, is something entirely different: it is secular solipsism and utopian eschatology masquerading as enlightened Christianity. It harks back to 2nd century Gnostics and not to the great Christian thinkers of the Renaissance or Enlightenment. Itâ\200\231s the lamb lying down with the lion thatâ\200\231s still a lion. Itâ\200\231s redistributive socialism and minorities-exalting Progressive doctrine that has nothing to do with the recognition of Godâ\200\231s presence on Earth.\r"

en_US.blogs.txt contains 8.99^{5} lines.

If we tokenize each line in individual words (1grams), we can summarize the number of words per line:

set.seed(123)
sample_corpus_content <- full_txt[rbinom(n = tableSizes$Lines[index],
                                   size = 1,
                                   prob = 0.01) > 0]
list.words <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 1L,
  n_min = 1L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

summary(unlist(lapply(list.words, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       9      28      42      59    2369
num.words <- sum(unlist(lapply(list.words, length)))

df.words.unique <- data.frame(grams=unlist(list.words)) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) 

num.words.unique <- length(df.words.unique$grams)

The total number of words in this document is 373995. But there are only 30030 unique words.

If we explore the number of characters per line we will get:

list.char.all <-
    tokenizers::tokenize_characters( sample_corpus_content,
                                     strip_non_alphanum = FALSE )

summary(unlist(lapply(list.char.all, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2      47     150     227     322   11380
df.char.all.unique <- data.frame(chars=unlist(list.char.all)) %>%
  group_by(chars) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

num.char.all.unique <- length(df.char.all.unique$chars)

The total number of characters is 2043773. But there are only 142 unique characters.

We can also analyze the number of characters per word:

summary(unlist(lapply(list.words, count_char_list))) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       3       4       4       6      39

Let’s identify non alphanumeric characters. If we explore the number of characters per line excluding non alphanumeric we get:

list.char.strip <-
  tokenizers::tokenize_characters(sample_corpus_content,
                                  strip_non_alphanum = TRUE)

summary(unlist(lapply(list.char.strip, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      36     118     180     256    9253

Punctuation characters:

df.char.strip.unique <- data.frame(chars=unlist(list.char.strip)) %>%
  group_by(chars) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

num.char.strip.unique <- length(df.char.strip.unique$chars)

punctuation <- df.char.all.unique$chars[!df.char.all.unique$chars %in% df.char.strip.unique$chars]

punctuation <- punctuation[!punctuation %in% c(" ", "'") ]

punctuation
##  [1] "."  ","  "\r" "-"  "!"  ")"  "("  "\"" ":"  "?"  "/"  "“"  ";"  "&"  "*" 
## [16] "”"  "‚"  "%"  "«"  "§"  "—"  "#"  "_"  "‹"  "„"  "@"  "–"  "»"  "‰"  "¡" 
## [31] "·"  "¿"  "¶"  " "  "›"  "†"  "…"  "’"  "‘"  "‡"  "•"  "["  "]"  "{"  "}"

Other symbols:

other_symbols <- df.char.all.unique$chars[!grepl("[A-Za-z]",df.char.all.unique$chars,perl = T)]

other_symbols <- unique(other_symbols[!other_symbols %in% c(" ", "'") ])

other_symbols
##   [1] "."      ","      "\r"     "â"      "\200"      "\231"      "-"      "0"     
##   [9] "1"      "!"      ")"      "("      "2"      "\""     ":"      "?"     
##  [17] "œ"      "\235"      "3"      "9"      "5"      "¦"      "ã"      "4"     
##  [25] "/"      "8"      "“"      "6"      ";"      "7"      "\201"      "\230"     
##  [33] "&"      "*"      "”"      "‚"      "$"      "%"      "ƒ"      "å"     
##  [41] "="      "+"      "æ"      "ä"      "«"      "§"      "©"      "¢"     
##  [49] "£"      "³"      "—"      "#"      "_"      "‹"      "º"      "ÿ"     
##  [57] "„"      "¯"      "~"      "°"      "š"      "è"      "¨"      "®"     
##  [65] "¹"      "@"      "\210"      "¸"      "ç"      "¬"      "–"      "»"     
##  [73] "ª"      "¼"      "¾"      "×"      "‰"      "¡"      "´"      "¥"     
##  [81] "·"      "\215"      "½"      "á"      "­"      "\217"      "¿"      "¤"     
##  [89] "¶"      " "      "›"      "†"      "ï"      "é"      "…"      "ž"     
##  [97] "’"      "²"      "‘"      "µ"      "\220"      "±"      "`"      "^"     
## [105] "‡"      "•"      "["      "ë"      "]"      "|"      "{"      "}"     
## [113] "í"      "ì"

Data cleaning. In order to increase coverage and reduce the space are going to:

  1. Eliminate punctuation and other symbols.
# 1. Remove symbols from words
other_symbols_pattern <- paste(quotemeta(c(punctuation, other_symbols)), collapse = "|")

df.clean.words.unique <- data.frame(grams=unlist(lapply(list.words,remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

We reduce the number of unique words from 30030 to 26983.

  1. Check valid English words:
# 2. Check dictionary...
df.valid.words.unique <- df.clean.words.unique[hunspell_check(df.clean.words.unique$grams),]

We reduce further the number of unique words from 26983 to 19503.

  1. Remove profanity terms.
# 3. Remove Profanity (if any left)
df.final.words.unique <- df.valid.words.unique[profanity(df.valid.words.unique$grams)$profanity_count==0,]

We reduce further the number of unique words from 19503 to 19455.

  1. Final cleaning ensuring no NA or invalid 1 character words are left
# 4. Final cleaning
df.final.words.unique <- df.final.words.unique[!is.na(nchar(df.final.words.unique$grams)) & 
                                                 ( nchar(df.final.words.unique$grams)>1 | 
                                                     df.final.words.unique$grams %in% c("a","i","o")),]

After all the cleaning performed above we went from 373995 total number of words to 346930 (92.76%).

If we look at the number of unique words, we went from 30030 to 19431 (64.71%).

The clean words data sets has a different distribution of number of character per unique word:

summary(unlist(lapply(df.final.words.unique$grams,nchar)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     6.0     7.0     7.4     9.0    19.0

The number of character for all the words has also changed:

lw.vector <- unlist(list.words)
summary(unlist(lapply(lw.vector[lw.vector%in%df.final.words.unique$grams],nchar)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     4.0     4.3     6.0    19.0
df.final.words.unique <- df.final.words.unique %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  )

frequencies[["words"]][["en_US.blogs.txt"]] <- df.final.words.unique

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter",
  "sample_corpus_content"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  994986   53    3211030  172  3211030  172
## Vcells 3040147   23   41115382  314 42562709  325
list.bigram <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 2L,
  n_min = 2L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

df.clean.bigram.unique <-
  data.frame(grams = unlist(lapply(list.bigram, remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  ) %>%
  filter(cum.perc<95 & count > 4) 

frequencies[["bigram"]][["en_US.blogs.txt"]] <-
  df.clean.bigram.unique[unlist(lapply(df.clean.bigram.unique$grams, bigram_filter)),]

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter",
  "sample_corpus_content"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1010442   54    3211032  172  3211032  172
## Vcells 5244642   40   26313845  201 42562709  325
list.trigram <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 3L,
  n_min = 3L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

df.clean.trigram.unique <-
  data.frame(grams = unlist(lapply(list.trigram, remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  ) %>%
  filter(cum.perc<80 & count > 2) 

frequencies[["trigram"]][["en_US.blogs.txt"]] <-
  df.clean.trigram.unique[unlist(lapply(df.clean.trigram.unique$grams, trigram_filter)),]

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1004670   54    3211032  172  3211032  172
## Vcells 3417534   26   21051076  161 42562709  325

en_US.news.txt

Check content sample:

index = 2
full_txt <- LaF::sample_lines(tableSizes$File[index], tableSizes$Lines[index], tableSizes$Lines[index])
num.lines <- tableSizes$Lines[index]
head(full_txt)
## [1] "This offer is open to all New Jersey homeowners and the public is welcome to this event. This is a great time to invite friends and neighbors to join us to learn about Christ Church Newton, while also learning if their home might qualify for a no-cost solar installation.\r"                                                                                     
## [2] "The posted wait time was 60 minutes, and we maneuvered through the slick walkway (it winds beneath the ride, and spillover falls onto the queue). The line didn't look that long, but we failed to factor in the hordes of Universal Express users who enter from another direction. The standstill produced an unexpected and unpleasant damp experience â\200” sweat.\r"
## [3] "The lawyer, Richard Sparaco, said he could not comment on plea discussions. But he said Tatar's letter was neither surprising nor troubling.\r"                                                                                                                                                                                                                        
## [4] "Today, we are announcing another 300 jobs, building on our $163-million investment in the Detroit area and adding to what is now the largest concentration of GE IT experts anywhere in the world. With this announcement, GE will be bringing its employment there to 1,400. GE employment will reach about 3,800 in Michigan in the next few years.\r"               
## [5] "This disorder affects mostly men older than 50 and causes them to thrash about wildly, almost violently. They can hurt their bed partners. During the agitated period, the person is still experiencing vivid, frightening dreams that center on fighting or fleeing a threatening encounter. D.S.'s husband can relate his dreams to her the following morning.\r"    
## [6] "The second came late in the second quarter, after another Osweiler and Willie 34-yard completion. ASU marched down to the Illini 15, but Alex Garoutte went wide left on a 32-yard field goal.\r"

en_US.news.txt contains 1.01^{6} lines.

If we tokenize each line in individual words (1grams), we can summarize the number of words per line:

set.seed(124) 
sample_corpus_content <- full_txt[rbinom(n = tableSizes$Lines[index],
                                   size = 1,
                                   prob = 0.01) > 0]
list.words <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 1L,
  n_min = 1L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

summary(unlist(lapply(list.words, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      19      32      35      46     250
num.words <- sum(unlist(lapply(list.words, length)))

df.words.unique <- data.frame(grams=unlist(list.words)) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) 

num.words.unique <- length(df.words.unique$grams)

The total number of words in this document is 351011. But there are only 31847 unique words.

If we explore the number of characters per line we will get:

list.char.all <-
  tokenizers::tokenize_characters( sample_corpus_content,
                                   strip_non_alphanum = FALSE )

summary(unlist(lapply(list.char.all, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3     112     187     203     269    1364
df.char.all.unique <- data.frame(chars=unlist(list.char.all)) %>%
  group_by(chars) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

num.char.all.unique <- length(df.char.all.unique$chars)

The total number of characters is 2053207. But there are only 104 unique characters.

We can also analyze the number of characters per word:

summary(unlist(lapply(list.words, count_char_list))) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       3       4       5       6      35

Let’s identify non alphanumeric characters. If we explore the number of characters per line excluding non alphanumeric we get:

list.char.strip <-
  tokenizers::tokenize_characters(sample_corpus_content,
                                  strip_non_alphanum = TRUE)

summary(unlist(lapply(list.char.strip, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2      88     150     163     215    1048

Punctuation characters:

df.char.strip.unique <- data.frame(chars=unlist(list.char.strip)) %>%
  group_by(chars) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

num.char.strip.unique <- length(df.char.strip.unique$chars)

punctuation <- df.char.all.unique$chars[!df.char.all.unique$chars %in% df.char.strip.unique$chars]

punctuation <- punctuation[!punctuation %in% c(" ", "'") ]

punctuation
##  [1] "."  ","  "\r" "\"" "-"  ":"  ")"  "("  "”"  ";"  "?"  "/"  "“"  "&"  "!" 
## [16] "’"  "%"  "*"  "–"  "—"  "@"  "•"  "_"  "#"  "¿"  "…"  "¡"  "‘"  "»"  "‹" 
## [31] "‰"  "‚"  "¶"  "["  "]"  "«"  "§"

Other symbols:

other_symbols <- df.char.all.unique$chars[!grepl("[A-Za-z]",df.char.all.unique$chars,perl = T)]

other_symbols <- unique(other_symbols[!other_symbols %in% c(" ", "'") ])

other_symbols
##  [1] "."      ","      "\r"     "\""     "-"      "0"      "1"      "2"     
##  [9] "â"      "\200"      "5"      "3"      "9"      "4"      ":"      "6"     
## [17] ")"      "8"      "("      "7"      "$"      "\231"      "”"      ";"     
## [25] "œ"      "?"      "\235"      "="      "/"      "“"      "ã"      "&"     
## [33] "!"      "’"      "%"      "¸"      "\230"      "*"      "©"      "¦"     
## [41] "–"      "—"      "@"      "•"      "½"      "_"      "`"      "ï"     
## [49] "#"      "¿"      "+"      "¢"      "…"      "\210"      "¡"      "¨"     
## [57] "‘"      "±"      "»"      "‹"      "‰"      "š"      "­"      "‚"     
## [65] "¬"      "®"      "°"      "¶"      "¼"      "["      "]"      "^"     
## [73] "´"      "«"      "§"      "³"

Data cleaning. In order to increase coverage and reduce the space are going to:

  1. Eliminate punctuation and other symbols.
# 1. Remove symbols from words
other_symbols_pattern <- paste(quotemeta(c(punctuation, other_symbols)), collapse = "|")

df.clean.words.unique <- data.frame(grams=unlist(lapply(list.words,remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

We reduce the number of unique words from 31847 to 29163.

  1. Check valid English words:
# 2. Check dictionary...
df.valid.words.unique <- df.clean.words.unique[hunspell_check(df.clean.words.unique$grams),]

We reduce further the number of unique words from 29163 to 19283.

  1. Remove profanity terms.
# 3. Remove Profanity (if any left)
df.final.words.unique <- df.valid.words.unique[profanity(df.valid.words.unique$grams)$profanity_count==0,]

We reduce further the number of unique words from 19283 to 19261.

  1. Final cleaning ensuring no NA or invalid 1 character words are left
# 4. Final cleaning
df.final.words.unique <- df.final.words.unique[!is.na(nchar(df.final.words.unique$grams)) & 
                                                 ( nchar(df.final.words.unique$grams)>1 | 
                                                     df.final.words.unique$grams %in% c("a","i","o")),]

After all the cleaning performed above we went from 351011 total number of words to 313408 (89.29%).

If we look at the number of unique words, we went from 31847 to 19237 (60.4%).

The clean words data sets has a different distribution of number of character per unique word:

summary(unlist(lapply(df.final.words.unique$grams,nchar)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     6.0     7.0     7.5     9.0    20.0

The number of character for all the words has also changed:

lw.vector <- unlist(list.words)
summary(unlist(lapply(lw.vector[lw.vector%in%df.final.words.unique$grams],nchar)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     4.0     4.6     6.0    20.0
df.final.words.unique <- df.final.words.unique %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  )

frequencies[["words"]][["en_US.news.txt"]] <- df.final.words.unique

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter",
  "sample_corpus_content"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1019805   54    3114591  166  3893238  208
## Vcells 3173494   24   40347962  308 46052927  351
list.bigram <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 2L,
  n_min = 2L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

df.clean.bigram.unique <-
  data.frame(grams = unlist(lapply(list.bigram, remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  ) %>%
  filter(cum.perc<95 & count > 4) 

frequencies[["bigram"]][["en_US.news.txt"]] <-
  df.clean.bigram.unique[unlist(lapply(df.clean.bigram.unique$grams, bigram_filter)),]

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter",
  "sample_corpus_content"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1023540   55    3114591  166  3893238  208
## Vcells 3484079   27   20658157  158 46052927  351
list.trigram <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 3L,
  n_min = 3L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

df.clean.trigram.unique <-
  data.frame(grams = unlist(lapply(list.trigram, remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  ) %>%
  filter(cum.perc<80 & count > 2) 

frequencies[["trigram"]][["en_US.news.txt"]] <-
  df.clean.trigram.unique[unlist(lapply(df.clean.trigram.unique$grams, trigram_filter)),]

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1021210   55    3114591  166  3893238  208
## Vcells 5182731   40   20658157  158 46052927  351

en_US.twitter.txt

Check content sample:

index = 3
full_txt <- LaF::sample_lines(tableSizes$File[index], tableSizes$Lines[index], tableSizes$Lines[index])
num.lines <- tableSizes$Lines[index]
head(full_txt)
## [1] "I already want a new one but I wanna dye my hair first\r"                                                                                 
## [2] "My HP mug is magic. Before I pour coffee in, it says \"I solemnly swear that I'm up to no good\" and after it says \"mischief managed\"\r"
## [3] "i even made a resume & it still didnt work...\r"                                                                                          
## [4] "congrats on passing 3000 followers\r"                                                                                                     
## [5] "And im mean. Just called peoples moms ugly!\r"                                                                                            
## [6] "So after a day of sitting on it, that show last night was weird as hell\r"

en_US.twitter.txt contains 2.36^{6} lines.

If we tokenize each line in individual words (1grams), we can summarize the number of words per line:

set.seed(124) 
sample_corpus_content <- full_txt[rbinom(n = tableSizes$Lines[index],
                                   size = 1,
                                   prob = 0.01) > 0]
list.words <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 1L,
  n_min = 1L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

summary(unlist(lapply(list.words, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       7      12      13      18      37
num.words <- sum(unlist(lapply(list.words, length)))

df.words.unique <- data.frame(grams=unlist(list.words)) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) 

num.words.unique <- length(df.words.unique$grams)

The total number of words in this document is 303107. But there are only 26074 unique words.

If we explore the number of characters per line we will get:

list.char.all <-
  tokenizers::tokenize_characters( sample_corpus_content,
                                   strip_non_alphanum = FALSE )

summary(unlist(lapply(list.char.all, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3      38      65      70     101     152
df.char.all.unique <- data.frame(chars=unlist(list.char.all)) %>%
  group_by(chars) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

num.char.all.unique <- length(df.char.all.unique$chars)

The total number of characters is 1651368. But there are only 143 unique characters.

We can also analyze the number of characters per word:

summary(unlist(lapply(list.words, count_char_list))) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       3       4       4       5      41

Let’s identify non alphanumeric characters. If we explore the number of characters per line excluding non alphanumeric we get:

list.char.strip <-
  tokenizers::tokenize_characters(sample_corpus_content,
                                  strip_non_alphanum = TRUE)

summary(unlist(lapply(list.char.strip, length)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      28      50      54      78     127

Punctuation characters:

df.char.strip.unique <- data.frame(chars=unlist(list.char.strip)) %>%
  group_by(chars) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

num.char.strip.unique <- length(df.char.strip.unique$chars)

punctuation <- df.char.all.unique$chars[!df.char.all.unique$chars %in% df.char.strip.unique$chars]

punctuation <- punctuation[!punctuation %in% c(" ", "'") ]

punctuation
##  [1] "."  "\r" "!"  ","  "?"  ":"  "-"  "\"" "#"  ")"  "&"  "("  "/"  ";"  "_" 
## [16] "*"  "@"  "%"  "’"  "“"  "]"  "”"  "["  "‘"  "‰"  "«"  "»"  "‚"  "¡"  "•" 
## [31] "—"  "„"  "…"  "¿"  "\\" "{"  "}"  "†"  "–"  "¶"  " "  "‹"  "·"  "‡"  "§" 
## [46] "›"

Other symbols:

other_symbols <- df.char.all.unique$chars[!grepl("[A-Za-z]",df.char.all.unique$chars,perl = T)]

other_symbols <- unique(other_symbols[!other_symbols %in% c(" ", "'") ])

other_symbols
##   [1] "."      "\r"     "!"      ","      "?"      ":"      "-"      "\""    
##   [9] "#"      "1"      ")"      "0"      "2"      "3"      "&"      "("     
##  [17] "â"      "/"      "\200"      "5"      "4"      ";"      "<"      "9"     
##  [25] "8"      "6"      "7"      "\235"      "œ"      "_"      "$"      "*"     
##  [33] "ð"      "ÿ"      ">"      "\231"      "\230"      "="      "@"      "%"     
##  [41] "~"      "+"      "¥"      "ã"      "¦"      "’"      "î"      "^"     
##  [49] "“"      "]"      "¤"      "\215"      "\201"      "\220"      "”"      "|"     
##  [57] "["      "‘"      "š"      "ž"      "‰"      "«"      "»"      "º"     
##  [65] "‚"      "©"      "¡"      "¢"      "•"      "—"      "„"      "…"     
##  [73] "±"      "\217"      "¿"      "£"      "½"      "ï"      "­"      "®"     
##  [81] "`"      "\\"     "{"      "}"      "°"      "†"      "³"      "–"     
##  [89] "¶"      "¾"      "æ"      "ƒ"      " "      "¨"      "¸"      "\210"     
##  [97] "‹"      "·"      "è"      "¯"      "´"      "‡"      "¼"      "²"     
## [105] "ª"      "ä"      "ñ"      "§"      "¬"      "à"      "›"      "µ"     
## [113] "¹"      "å"      "é"

Data cleaning. In order to increase coverage and reduce the space are going to:

  1. Eliminate punctuation and other symbols.
# 1. Remove symbols from words
other_symbols_pattern <- paste(quotemeta(c(punctuation, other_symbols)), collapse = "|")

df.clean.words.unique <- data.frame(grams=unlist(lapply(list.words,remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count))

We reduce the number of unique words from 26074 to 24516.

  1. Check valid English words:
# 2. Check dictionary...
df.valid.words.unique <- df.clean.words.unique[hunspell_check(df.clean.words.unique$grams),]

We reduce further the number of unique words from 24516 to 14334.

  1. Remove profanity terms.
# 3. Remove Profanity (if any left)
df.final.words.unique <- df.valid.words.unique[profanity(df.valid.words.unique$grams)$profanity_count==0,]

We reduce further the number of unique words from 14334 to 14281.

  1. Final cleaning ensuring no NA or invalid 1 character words are left
# 4. Final cleaning
df.final.words.unique <- df.final.words.unique[!is.na(nchar(df.final.words.unique$grams)) & 
                                                 ( nchar(df.final.words.unique$grams)>1 | 
                                                     df.final.words.unique$grams %in% c("a","i","o")),]

After all the cleaning performed above we went from 303107 total number of words to 269558 (88.93%).

If we look at the number of unique words, we went from 26074 to 14257 (54.68%).

The clean words data sets has a different distribution of number of character per unique word:

summary(unlist(lapply(df.final.words.unique$grams,nchar)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       5       7       7       8      18

The number of character for all the words has also changed:

lw.vector <- unlist(list.words)
summary(unlist(lapply(lw.vector[lw.vector%in%df.final.words.unique$grams],nchar)))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     3.0     4.0     4.1     5.0    18.0
df.final.words.unique <- df.final.words.unique %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  )

frequencies[["words"]][["en_US.twitter.txt"]] <- df.final.words.unique

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter",
  "sample_corpus_content"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1041427   56    4808444  257  6010554  321
## Vcells 4303016   33   51459728  393 46052927  351
list.bigram <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 2L,
  n_min = 2L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

df.clean.bigram.unique <-
  data.frame(grams = unlist(lapply(list.bigram, remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  ) %>%
  filter(cum.perc<95 & count > 4) 

frequencies[["bigram"]][["en_US.twitter.txt"]] <-
  df.clean.bigram.unique[unlist(lapply(df.clean.bigram.unique$grams, bigram_filter)),]

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter",
  "sample_corpus_content"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1050276   56    3846756  206  6010554  321
## Vcells 7032980   54   41167783  314 46052927  351
list.trigram <- tokenizers::tokenize_ngrams(
  sample_corpus_content,
  lowercase = TRUE,
  n = 3L,
  n_min = 3L,
  stopwords = character(),
  ngram_delim = " ",
  simplify = FALSE
)

df.clean.trigram.unique <-
  data.frame(grams = unlist(lapply(list.trigram, remove_symbols_ext))) %>%
  group_by(grams) %>%
  dplyr::summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(
    cum.perc = 100 * cumsum(count) / sum(count)
  ) %>%
  filter(cum.perc<80 & count > 2) 

frequencies[["trigram"]][["en_US.twitter.txt"]] <-
  df.clean.trigram.unique[unlist(lapply(df.clean.trigram.unique$grams, trigram_filter)),]

toberm <- ls()

rm(list = toberm[!toberm %in% c(
  "tableSizes",
  "frequencies",
  "quotemeta",
  "remove_symbols_ext",
  "count_char_list",
  "bigram_filter",
  "trigram_filter"
)])

gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1031177   55    3077406  164  6010554  321
## Vcells 7436979   57   26347382  201 46052927  351

Understanding frequencies

Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

for(tt in names(frequencies)) {
  for (ss in names(frequencies[[tt]])) {
    total.freq = max( frequencies[[tt]][[ss]]$cum.perc,na.rm = T ) / 100
    frequencies[[tt]][[ss]] <- frequencies[[tt]][[ss]] %>%
      mutate( freq = count / sum(count) * total.freq,
              cum.freq = cumsum(count) / sum(count) * total.freq )
  }
}

if( !file.exists("backups.Rdata") ) {
  save(tableSizes, frequencies, file = "backups.Rdata")
}

1. What are the distributions of word frequencies?

plot.frequencies <- function(type, source, n) {
  g <-
    ggplot( frequencies[[type]][[source]][seq(1,n),], 
            aes(x = reorder(grams, freq), y = freq)) + 
    geom_bar(stat ="Identity", fill = "blue") +
    theme(axis.text.y = element_text(size = 11)) +
    labs(title = source, y = "frequency", x = type) +
    theme(axis.title.y = element_text(size = 11)) +
    theme(axis.title.x = element_text(size = 11)) +
    guides(color = "none" ) +
    coord_flip()
  return(g)
  
}

en_US.b.words <- plot.frequencies("words", "en_US.blogs.txt", 22)
en_US.n.words <- plot.frequencies("words", "en_US.news.txt", 22)
en_US.t.words <- plot.frequencies("words", "en_US.twitter.txt", 22)

grid.arrange( en_US.b.words,
              en_US.n.words,
              en_US.t.words,
              ncol=3 )

2. What are the frequencies of 2-grams and 3-grams?

en_US.b.bigram <- plot.frequencies("bigram", "en_US.blogs.txt", 22)
en_US.n.bigram <- plot.frequencies("bigram", "en_US.news.txt", 22)
en_US.t.bigram <- plot.frequencies("bigram", "en_US.twitter.txt", 22)

grid.arrange( en_US.b.bigram,
              en_US.n.bigram,
              en_US.t.bigram,
              ncol=3 )

en_US.b.trigram <-
  plot.frequencies("trigram", "en_US.blogs.txt", 22) + theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  ))
en_US.n.trigram <-
  plot.frequencies("trigram", "en_US.news.txt", 22) + theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  ))
en_US.t.trigram <-
  plot.frequencies("trigram", "en_US.twitter.txt", 22) + theme(axis.text.x = element_text(
    angle = 90,
    vjust = 0.5,
    hjust = 1
  ))

grid.arrange(en_US.b.trigram,
             en_US.n.trigram,
             en_US.t.trigram,
             ncol = 3)

3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Coverage values for en_US.blogs.txt:

  • 84 unique words (0.43%) to cover 50% all word instances in the document.
  • 3847 unique words (19.8%) to cover 90% all word instances in the document.
  • 7454 unique words (38.36%) to cover 95% all word instances in the document.
  • 15961 unique words (82.14%) to cover 99% all word instances in the document.

Coverage values for en_US.news.txt:

  • 129 unique words (0.67%) to cover 50% all word instances in the document.
  • 4399 unique words (22.87%) to cover 90% all word instances in the document.
  • 7951 unique words (41.33%) to cover 95% all word instances in the document.
  • 16102 unique words (83.7%) to cover 99% all word instances in the document.

Coverage values for en_US.twitter.txt:

  • 90 unique words (0.63%) to cover 50% all word instances in the document.
  • 2550 unique words (17.89%) to cover 90% all word instances in the document.
  • 5040 unique words (35.35%) to cover 95% all word instances in the document.
  • 11561 unique words (81.09%) to cover 99% all word instances in the document.

4. How do you evaluate how many of the words come from foreign languages?

In our filtering we have used hunspell_check to keep valid English words only. But this approach eliminates foreign languages as well as misspelled words.

5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

We have increased coverage by eliminating symbols and cleaning valid words. Other strategies could be for example to use synonyms to identify words that may not be in the corpora or using a smaller number of words to cover the same number of phrases.