The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observed in the data before building the first linguistic models.
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
For Exploratory analysis we will first try to load all the data, in next phases due to capacity limitations we will work with sampled version of the corpus.
Basic summaries of the three files: Word counts, line counts and basic data tables.
Files <- dir(
path = file.path("R/Data/final/en_US/"),
pattern = ".*\\.txt",
all.files = FALSE,
full.names = TRUE,
recursive = FALSE,
ignore.case = FALSE,
include.dirs = FALSE,
no.. = FALSE
)
Lines <- unlist(lapply(Files, LaF::determine_nlines))
tableSizes <- data.frame(File = Files,
Lines = Lines )
formattable(tableSizes, align = c("l","c"))
File | Lines |
---|---|
R/Data/final/en_US/en_US.blogs.txt | 899288 |
R/Data/final/en_US/en_US.news.txt | 1010242 |
R/Data/final/en_US/en_US.twitter.txt | 2360148 |
Check content sample:
index = 1
full_txt <- LaF::sample_lines(tableSizes$File[index], tableSizes$Lines[index], tableSizes$Lines[index])
num.lines <- tableSizes$Lines[index]
head(full_txt)
## [1] "Mouse Pad Where the mouse takes the grain it does not eat\r"
## [2] "c. A knife\r"
## [3] "As I like to put it the book has a side of romance. The main story doesn't fully focus on Mia and Jeremy's interest in each other. For me it was the perfect about, just enough for me to become invested in their happiness together.\r"
## [4] "LOVE!!! I chose a cognac color, yet another great neutral. I love the design and can see myself rocking them with tons of looks. The Vince Camuto Baron sandal also comes in black but I typically don't wear black shoes in Spring or Summer.\r"
## [5] "Bart was good and stayed home unless he caught sent of a female in heat and then he was off to have him some fun. Not long after Jennifer and i settled in a friend got divorced and decided to move down to Corpus Christi. Somehow I got possession of his Chesapeake Bay Retriever Rosey in the deal and man did she love the water. Especially the warm water of my closest neighbor's hot tub. Bart didn't care for the water much but he got excited to see Rosey swim around. And when he got excited he drooled even more.\r"
## [6] "What emanates now from important Christian leaders, not to speak of such bona fide faithful as the Johnson, Carter, Kennedy, Bush, Blair, Brown and Atlee names adduced in a previous chapter, is something entirely different: it is secular solipsism and utopian eschatology masquerading as enlightened Christianity. It harks back to 2nd century Gnostics and not to the great Christian thinkers of the Renaissance or Enlightenment. Itâ\200\231s the lamb lying down with the lion thatâ\200\231s still a lion. Itâ\200\231s redistributive socialism and minorities-exalting Progressive doctrine that has nothing to do with the recognition of Godâ\200\231s presence on Earth.\r"
en_US.blogs.txt contains 8.99^{5} lines.
If we tokenize each line in individual words (1grams), we can summarize the number of words per line:
set.seed(123)
sample_corpus_content <- full_txt[rbinom(n = tableSizes$Lines[index],
size = 1,
prob = 0.01) > 0]
list.words <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 1L,
n_min = 1L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
summary(unlist(lapply(list.words, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 9 28 42 59 2369
num.words <- sum(unlist(lapply(list.words, length)))
df.words.unique <- data.frame(grams=unlist(list.words)) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.words.unique <- length(df.words.unique$grams)
The total number of words in this document is 373995. But there are only 30030 unique words.
If we explore the number of characters per line we will get:
list.char.all <-
tokenizers::tokenize_characters( sample_corpus_content,
strip_non_alphanum = FALSE )
summary(unlist(lapply(list.char.all, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 47 150 227 322 11380
df.char.all.unique <- data.frame(chars=unlist(list.char.all)) %>%
group_by(chars) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.char.all.unique <- length(df.char.all.unique$chars)
The total number of characters is 2043773. But there are only 142 unique characters.
We can also analyze the number of characters per word:
summary(unlist(lapply(list.words, count_char_list)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3 4 4 6 39
Let’s identify non alphanumeric characters. If we explore the number of characters per line excluding non alphanumeric we get:
list.char.strip <-
tokenizers::tokenize_characters(sample_corpus_content,
strip_non_alphanum = TRUE)
summary(unlist(lapply(list.char.strip, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 36 118 180 256 9253
Punctuation characters:
df.char.strip.unique <- data.frame(chars=unlist(list.char.strip)) %>%
group_by(chars) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.char.strip.unique <- length(df.char.strip.unique$chars)
punctuation <- df.char.all.unique$chars[!df.char.all.unique$chars %in% df.char.strip.unique$chars]
punctuation <- punctuation[!punctuation %in% c(" ", "'") ]
punctuation
## [1] "." "," "\r" "-" "!" ")" "(" "\"" ":" "?" "/" "“" ";" "&" "*"
## [16] "”" "‚" "%" "«" "§" "—" "#" "_" "‹" "„" "@" "–" "»" "‰" "¡"
## [31] "·" "¿" "¶" " " "›" "†" "…" "’" "‘" "‡" "•" "[" "]" "{" "}"
Other symbols:
other_symbols <- df.char.all.unique$chars[!grepl("[A-Za-z]",df.char.all.unique$chars,perl = T)]
other_symbols <- unique(other_symbols[!other_symbols %in% c(" ", "'") ])
other_symbols
## [1] "." "," "\r" "â" "\200" "\231" "-" "0"
## [9] "1" "!" ")" "(" "2" "\"" ":" "?"
## [17] "œ" "\235" "3" "9" "5" "¦" "ã" "4"
## [25] "/" "8" "“" "6" ";" "7" "\201" "\230"
## [33] "&" "*" "”" "‚" "$" "%" "ƒ" "å"
## [41] "=" "+" "æ" "ä" "«" "§" "©" "¢"
## [49] "£" "³" "—" "#" "_" "‹" "º" "ÿ"
## [57] "„" "¯" "~" "°" "š" "è" "¨" "®"
## [65] "¹" "@" "\210" "¸" "ç" "¬" "–" "»"
## [73] "ª" "¼" "¾" "×" "‰" "¡" "´" "¥"
## [81] "·" "\215" "½" "á" "" "\217" "¿" "¤"
## [89] "¶" " " "›" "†" "ï" "é" "…" "ž"
## [97] "’" "²" "‘" "µ" "\220" "±" "`" "^"
## [105] "‡" "•" "[" "ë" "]" "|" "{" "}"
## [113] "í" "ì"
Data cleaning. In order to increase coverage and reduce the space are going to:
# 1. Remove symbols from words
other_symbols_pattern <- paste(quotemeta(c(punctuation, other_symbols)), collapse = "|")
df.clean.words.unique <- data.frame(grams=unlist(lapply(list.words,remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
We reduce the number of unique words from 30030 to 26983.
# 2. Check dictionary...
df.valid.words.unique <- df.clean.words.unique[hunspell_check(df.clean.words.unique$grams),]
We reduce further the number of unique words from 26983 to 19503.
# 3. Remove Profanity (if any left)
df.final.words.unique <- df.valid.words.unique[profanity(df.valid.words.unique$grams)$profanity_count==0,]
We reduce further the number of unique words from 19503 to 19455.
# 4. Final cleaning
df.final.words.unique <- df.final.words.unique[!is.na(nchar(df.final.words.unique$grams)) &
( nchar(df.final.words.unique$grams)>1 |
df.final.words.unique$grams %in% c("a","i","o")),]
After all the cleaning performed above we went from 373995 total number of words to 346930 (92.76%).
If we look at the number of unique words, we went from 30030 to 19431 (64.71%).
The clean words data sets has a different distribution of number of character per unique word:
summary(unlist(lapply(df.final.words.unique$grams,nchar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 6.0 7.0 7.4 9.0 19.0
The number of character for all the words has also changed:
lw.vector <- unlist(list.words)
summary(unlist(lapply(lw.vector[lw.vector%in%df.final.words.unique$grams],nchar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 4.0 4.3 6.0 19.0
df.final.words.unique <- df.final.words.unique %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
)
frequencies[["words"]][["en_US.blogs.txt"]] <- df.final.words.unique
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter",
"sample_corpus_content"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 994986 53 3211030 172 3211030 172
## Vcells 3040147 23 41115382 314 42562709 325
list.bigram <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 2L,
n_min = 2L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
df.clean.bigram.unique <-
data.frame(grams = unlist(lapply(list.bigram, remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
) %>%
filter(cum.perc<95 & count > 4)
frequencies[["bigram"]][["en_US.blogs.txt"]] <-
df.clean.bigram.unique[unlist(lapply(df.clean.bigram.unique$grams, bigram_filter)),]
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter",
"sample_corpus_content"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1010442 54 3211032 172 3211032 172
## Vcells 5244642 40 26313845 201 42562709 325
list.trigram <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 3L,
n_min = 3L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
df.clean.trigram.unique <-
data.frame(grams = unlist(lapply(list.trigram, remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
) %>%
filter(cum.perc<80 & count > 2)
frequencies[["trigram"]][["en_US.blogs.txt"]] <-
df.clean.trigram.unique[unlist(lapply(df.clean.trigram.unique$grams, trigram_filter)),]
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1004670 54 3211032 172 3211032 172
## Vcells 3417534 26 21051076 161 42562709 325
Check content sample:
index = 2
full_txt <- LaF::sample_lines(tableSizes$File[index], tableSizes$Lines[index], tableSizes$Lines[index])
num.lines <- tableSizes$Lines[index]
head(full_txt)
## [1] "This offer is open to all New Jersey homeowners and the public is welcome to this event. This is a great time to invite friends and neighbors to join us to learn about Christ Church Newton, while also learning if their home might qualify for a no-cost solar installation.\r"
## [2] "The posted wait time was 60 minutes, and we maneuvered through the slick walkway (it winds beneath the ride, and spillover falls onto the queue). The line didn't look that long, but we failed to factor in the hordes of Universal Express users who enter from another direction. The standstill produced an unexpected and unpleasant damp experience â\200” sweat.\r"
## [3] "The lawyer, Richard Sparaco, said he could not comment on plea discussions. But he said Tatar's letter was neither surprising nor troubling.\r"
## [4] "Today, we are announcing another 300 jobs, building on our $163-million investment in the Detroit area and adding to what is now the largest concentration of GE IT experts anywhere in the world. With this announcement, GE will be bringing its employment there to 1,400. GE employment will reach about 3,800 in Michigan in the next few years.\r"
## [5] "This disorder affects mostly men older than 50 and causes them to thrash about wildly, almost violently. They can hurt their bed partners. During the agitated period, the person is still experiencing vivid, frightening dreams that center on fighting or fleeing a threatening encounter. D.S.'s husband can relate his dreams to her the following morning.\r"
## [6] "The second came late in the second quarter, after another Osweiler and Willie 34-yard completion. ASU marched down to the Illini 15, but Alex Garoutte went wide left on a 32-yard field goal.\r"
en_US.news.txt contains 1.01^{6} lines.
If we tokenize each line in individual words (1grams), we can summarize the number of words per line:
set.seed(124)
sample_corpus_content <- full_txt[rbinom(n = tableSizes$Lines[index],
size = 1,
prob = 0.01) > 0]
list.words <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 1L,
n_min = 1L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
summary(unlist(lapply(list.words, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 19 32 35 46 250
num.words <- sum(unlist(lapply(list.words, length)))
df.words.unique <- data.frame(grams=unlist(list.words)) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.words.unique <- length(df.words.unique$grams)
The total number of words in this document is 351011. But there are only 31847 unique words.
If we explore the number of characters per line we will get:
list.char.all <-
tokenizers::tokenize_characters( sample_corpus_content,
strip_non_alphanum = FALSE )
summary(unlist(lapply(list.char.all, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 112 187 203 269 1364
df.char.all.unique <- data.frame(chars=unlist(list.char.all)) %>%
group_by(chars) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.char.all.unique <- length(df.char.all.unique$chars)
The total number of characters is 2053207. But there are only 104 unique characters.
We can also analyze the number of characters per word:
summary(unlist(lapply(list.words, count_char_list)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3 4 5 6 35
Let’s identify non alphanumeric characters. If we explore the number of characters per line excluding non alphanumeric we get:
list.char.strip <-
tokenizers::tokenize_characters(sample_corpus_content,
strip_non_alphanum = TRUE)
summary(unlist(lapply(list.char.strip, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 88 150 163 215 1048
Punctuation characters:
df.char.strip.unique <- data.frame(chars=unlist(list.char.strip)) %>%
group_by(chars) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.char.strip.unique <- length(df.char.strip.unique$chars)
punctuation <- df.char.all.unique$chars[!df.char.all.unique$chars %in% df.char.strip.unique$chars]
punctuation <- punctuation[!punctuation %in% c(" ", "'") ]
punctuation
## [1] "." "," "\r" "\"" "-" ":" ")" "(" "”" ";" "?" "/" "“" "&" "!"
## [16] "’" "%" "*" "–" "—" "@" "•" "_" "#" "¿" "…" "¡" "‘" "»" "‹"
## [31] "‰" "‚" "¶" "[" "]" "«" "§"
Other symbols:
other_symbols <- df.char.all.unique$chars[!grepl("[A-Za-z]",df.char.all.unique$chars,perl = T)]
other_symbols <- unique(other_symbols[!other_symbols %in% c(" ", "'") ])
other_symbols
## [1] "." "," "\r" "\"" "-" "0" "1" "2"
## [9] "â" "\200" "5" "3" "9" "4" ":" "6"
## [17] ")" "8" "(" "7" "$" "\231" "”" ";"
## [25] "œ" "?" "\235" "=" "/" "“" "ã" "&"
## [33] "!" "’" "%" "¸" "\230" "*" "©" "¦"
## [41] "–" "—" "@" "•" "½" "_" "`" "ï"
## [49] "#" "¿" "+" "¢" "…" "\210" "¡" "¨"
## [57] "‘" "±" "»" "‹" "‰" "š" "" "‚"
## [65] "¬" "®" "°" "¶" "¼" "[" "]" "^"
## [73] "´" "«" "§" "³"
Data cleaning. In order to increase coverage and reduce the space are going to:
# 1. Remove symbols from words
other_symbols_pattern <- paste(quotemeta(c(punctuation, other_symbols)), collapse = "|")
df.clean.words.unique <- data.frame(grams=unlist(lapply(list.words,remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
We reduce the number of unique words from 31847 to 29163.
# 2. Check dictionary...
df.valid.words.unique <- df.clean.words.unique[hunspell_check(df.clean.words.unique$grams),]
We reduce further the number of unique words from 29163 to 19283.
# 3. Remove Profanity (if any left)
df.final.words.unique <- df.valid.words.unique[profanity(df.valid.words.unique$grams)$profanity_count==0,]
We reduce further the number of unique words from 19283 to 19261.
# 4. Final cleaning
df.final.words.unique <- df.final.words.unique[!is.na(nchar(df.final.words.unique$grams)) &
( nchar(df.final.words.unique$grams)>1 |
df.final.words.unique$grams %in% c("a","i","o")),]
After all the cleaning performed above we went from 351011 total number of words to 313408 (89.29%).
If we look at the number of unique words, we went from 31847 to 19237 (60.4%).
The clean words data sets has a different distribution of number of character per unique word:
summary(unlist(lapply(df.final.words.unique$grams,nchar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 6.0 7.0 7.5 9.0 20.0
The number of character for all the words has also changed:
lw.vector <- unlist(list.words)
summary(unlist(lapply(lw.vector[lw.vector%in%df.final.words.unique$grams],nchar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 4.0 4.6 6.0 20.0
df.final.words.unique <- df.final.words.unique %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
)
frequencies[["words"]][["en_US.news.txt"]] <- df.final.words.unique
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter",
"sample_corpus_content"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1019805 54 3114591 166 3893238 208
## Vcells 3173494 24 40347962 308 46052927 351
list.bigram <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 2L,
n_min = 2L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
df.clean.bigram.unique <-
data.frame(grams = unlist(lapply(list.bigram, remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
) %>%
filter(cum.perc<95 & count > 4)
frequencies[["bigram"]][["en_US.news.txt"]] <-
df.clean.bigram.unique[unlist(lapply(df.clean.bigram.unique$grams, bigram_filter)),]
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter",
"sample_corpus_content"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1023540 55 3114591 166 3893238 208
## Vcells 3484079 27 20658157 158 46052927 351
list.trigram <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 3L,
n_min = 3L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
df.clean.trigram.unique <-
data.frame(grams = unlist(lapply(list.trigram, remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
) %>%
filter(cum.perc<80 & count > 2)
frequencies[["trigram"]][["en_US.news.txt"]] <-
df.clean.trigram.unique[unlist(lapply(df.clean.trigram.unique$grams, trigram_filter)),]
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1021210 55 3114591 166 3893238 208
## Vcells 5182731 40 20658157 158 46052927 351
Check content sample:
index = 3
full_txt <- LaF::sample_lines(tableSizes$File[index], tableSizes$Lines[index], tableSizes$Lines[index])
num.lines <- tableSizes$Lines[index]
head(full_txt)
## [1] "I already want a new one but I wanna dye my hair first\r"
## [2] "My HP mug is magic. Before I pour coffee in, it says \"I solemnly swear that I'm up to no good\" and after it says \"mischief managed\"\r"
## [3] "i even made a resume & it still didnt work...\r"
## [4] "congrats on passing 3000 followers\r"
## [5] "And im mean. Just called peoples moms ugly!\r"
## [6] "So after a day of sitting on it, that show last night was weird as hell\r"
en_US.twitter.txt contains 2.36^{6} lines.
If we tokenize each line in individual words (1grams), we can summarize the number of words per line:
set.seed(124)
sample_corpus_content <- full_txt[rbinom(n = tableSizes$Lines[index],
size = 1,
prob = 0.01) > 0]
list.words <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 1L,
n_min = 1L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
summary(unlist(lapply(list.words, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 7 12 13 18 37
num.words <- sum(unlist(lapply(list.words, length)))
df.words.unique <- data.frame(grams=unlist(list.words)) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.words.unique <- length(df.words.unique$grams)
The total number of words in this document is 303107. But there are only 26074 unique words.
If we explore the number of characters per line we will get:
list.char.all <-
tokenizers::tokenize_characters( sample_corpus_content,
strip_non_alphanum = FALSE )
summary(unlist(lapply(list.char.all, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 38 65 70 101 152
df.char.all.unique <- data.frame(chars=unlist(list.char.all)) %>%
group_by(chars) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.char.all.unique <- length(df.char.all.unique$chars)
The total number of characters is 1651368. But there are only 143 unique characters.
We can also analyze the number of characters per word:
summary(unlist(lapply(list.words, count_char_list)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3 4 4 5 41
Let’s identify non alphanumeric characters. If we explore the number of characters per line excluding non alphanumeric we get:
list.char.strip <-
tokenizers::tokenize_characters(sample_corpus_content,
strip_non_alphanum = TRUE)
summary(unlist(lapply(list.char.strip, length)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 28 50 54 78 127
Punctuation characters:
df.char.strip.unique <- data.frame(chars=unlist(list.char.strip)) %>%
group_by(chars) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
num.char.strip.unique <- length(df.char.strip.unique$chars)
punctuation <- df.char.all.unique$chars[!df.char.all.unique$chars %in% df.char.strip.unique$chars]
punctuation <- punctuation[!punctuation %in% c(" ", "'") ]
punctuation
## [1] "." "\r" "!" "," "?" ":" "-" "\"" "#" ")" "&" "(" "/" ";" "_"
## [16] "*" "@" "%" "’" "“" "]" "”" "[" "‘" "‰" "«" "»" "‚" "¡" "•"
## [31] "—" "„" "…" "¿" "\\" "{" "}" "†" "–" "¶" " " "‹" "·" "‡" "§"
## [46] "›"
Other symbols:
other_symbols <- df.char.all.unique$chars[!grepl("[A-Za-z]",df.char.all.unique$chars,perl = T)]
other_symbols <- unique(other_symbols[!other_symbols %in% c(" ", "'") ])
other_symbols
## [1] "." "\r" "!" "," "?" ":" "-" "\""
## [9] "#" "1" ")" "0" "2" "3" "&" "("
## [17] "â" "/" "\200" "5" "4" ";" "<" "9"
## [25] "8" "6" "7" "\235" "œ" "_" "$" "*"
## [33] "ð" "ÿ" ">" "\231" "\230" "=" "@" "%"
## [41] "~" "+" "¥" "ã" "¦" "’" "î" "^"
## [49] "“" "]" "¤" "\215" "\201" "\220" "”" "|"
## [57] "[" "‘" "š" "ž" "‰" "«" "»" "º"
## [65] "‚" "©" "¡" "¢" "•" "—" "„" "…"
## [73] "±" "\217" "¿" "£" "½" "ï" "" "®"
## [81] "`" "\\" "{" "}" "°" "†" "³" "–"
## [89] "¶" "¾" "æ" "ƒ" " " "¨" "¸" "\210"
## [97] "‹" "·" "è" "¯" "´" "‡" "¼" "²"
## [105] "ª" "ä" "ñ" "§" "¬" "à" "›" "µ"
## [113] "¹" "å" "é"
Data cleaning. In order to increase coverage and reduce the space are going to:
# 1. Remove symbols from words
other_symbols_pattern <- paste(quotemeta(c(punctuation, other_symbols)), collapse = "|")
df.clean.words.unique <- data.frame(grams=unlist(lapply(list.words,remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count))
We reduce the number of unique words from 26074 to 24516.
# 2. Check dictionary...
df.valid.words.unique <- df.clean.words.unique[hunspell_check(df.clean.words.unique$grams),]
We reduce further the number of unique words from 24516 to 14334.
# 3. Remove Profanity (if any left)
df.final.words.unique <- df.valid.words.unique[profanity(df.valid.words.unique$grams)$profanity_count==0,]
We reduce further the number of unique words from 14334 to 14281.
# 4. Final cleaning
df.final.words.unique <- df.final.words.unique[!is.na(nchar(df.final.words.unique$grams)) &
( nchar(df.final.words.unique$grams)>1 |
df.final.words.unique$grams %in% c("a","i","o")),]
After all the cleaning performed above we went from 303107 total number of words to 269558 (88.93%).
If we look at the number of unique words, we went from 26074 to 14257 (54.68%).
The clean words data sets has a different distribution of number of character per unique word:
summary(unlist(lapply(df.final.words.unique$grams,nchar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 5 7 7 8 18
The number of character for all the words has also changed:
lw.vector <- unlist(list.words)
summary(unlist(lapply(lw.vector[lw.vector%in%df.final.words.unique$grams],nchar)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 3.0 4.0 4.1 5.0 18.0
df.final.words.unique <- df.final.words.unique %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
)
frequencies[["words"]][["en_US.twitter.txt"]] <- df.final.words.unique
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter",
"sample_corpus_content"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1041427 56 4808444 257 6010554 321
## Vcells 4303016 33 51459728 393 46052927 351
list.bigram <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 2L,
n_min = 2L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
df.clean.bigram.unique <-
data.frame(grams = unlist(lapply(list.bigram, remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
) %>%
filter(cum.perc<95 & count > 4)
frequencies[["bigram"]][["en_US.twitter.txt"]] <-
df.clean.bigram.unique[unlist(lapply(df.clean.bigram.unique$grams, bigram_filter)),]
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter",
"sample_corpus_content"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1050276 56 3846756 206 6010554 321
## Vcells 7032980 54 41167783 314 46052927 351
list.trigram <- tokenizers::tokenize_ngrams(
sample_corpus_content,
lowercase = TRUE,
n = 3L,
n_min = 3L,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
df.clean.trigram.unique <-
data.frame(grams = unlist(lapply(list.trigram, remove_symbols_ext))) %>%
group_by(grams) %>%
dplyr::summarise(count = n()) %>%
arrange(desc(count)) %>%
mutate(
cum.perc = 100 * cumsum(count) / sum(count)
) %>%
filter(cum.perc<80 & count > 2)
frequencies[["trigram"]][["en_US.twitter.txt"]] <-
df.clean.trigram.unique[unlist(lapply(df.clean.trigram.unique$grams, trigram_filter)),]
toberm <- ls()
rm(list = toberm[!toberm %in% c(
"tableSizes",
"frequencies",
"quotemeta",
"remove_symbols_ext",
"count_char_list",
"bigram_filter",
"trigram_filter"
)])
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1031177 55 3077406 164 6010554 321
## Vcells 7436979 57 26347382 201 46052927 351
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
for(tt in names(frequencies)) {
for (ss in names(frequencies[[tt]])) {
total.freq = max( frequencies[[tt]][[ss]]$cum.perc,na.rm = T ) / 100
frequencies[[tt]][[ss]] <- frequencies[[tt]][[ss]] %>%
mutate( freq = count / sum(count) * total.freq,
cum.freq = cumsum(count) / sum(count) * total.freq )
}
}
if( !file.exists("backups.Rdata") ) {
save(tableSizes, frequencies, file = "backups.Rdata")
}
plot.frequencies <- function(type, source, n) {
g <-
ggplot( frequencies[[type]][[source]][seq(1,n),],
aes(x = reorder(grams, freq), y = freq)) +
geom_bar(stat ="Identity", fill = "blue") +
theme(axis.text.y = element_text(size = 11)) +
labs(title = source, y = "frequency", x = type) +
theme(axis.title.y = element_text(size = 11)) +
theme(axis.title.x = element_text(size = 11)) +
guides(color = "none" ) +
coord_flip()
return(g)
}
en_US.b.words <- plot.frequencies("words", "en_US.blogs.txt", 22)
en_US.n.words <- plot.frequencies("words", "en_US.news.txt", 22)
en_US.t.words <- plot.frequencies("words", "en_US.twitter.txt", 22)
grid.arrange( en_US.b.words,
en_US.n.words,
en_US.t.words,
ncol=3 )
en_US.b.bigram <- plot.frequencies("bigram", "en_US.blogs.txt", 22)
en_US.n.bigram <- plot.frequencies("bigram", "en_US.news.txt", 22)
en_US.t.bigram <- plot.frequencies("bigram", "en_US.twitter.txt", 22)
grid.arrange( en_US.b.bigram,
en_US.n.bigram,
en_US.t.bigram,
ncol=3 )
en_US.b.trigram <-
plot.frequencies("trigram", "en_US.blogs.txt", 22) + theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
))
en_US.n.trigram <-
plot.frequencies("trigram", "en_US.news.txt", 22) + theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
))
en_US.t.trigram <-
plot.frequencies("trigram", "en_US.twitter.txt", 22) + theme(axis.text.x = element_text(
angle = 90,
vjust = 0.5,
hjust = 1
))
grid.arrange(en_US.b.trigram,
en_US.n.trigram,
en_US.t.trigram,
ncol = 3)
Coverage values for en_US.blogs.txt:
Coverage values for en_US.news.txt:
Coverage values for en_US.twitter.txt:
In our filtering we have used hunspell_check to keep valid English words only. But this approach eliminates foreign languages as well as misspelled words.
We have increased coverage by eliminating symbols and cleaning valid words. Other strategies could be for example to use synonyms to identify words that may not be in the corpora or using a smaller number of words to cover the same number of phrases.