Overview

In my work I use the en_US locale of the dataset.

##                      Blog       News     Twitter       Total
## Word counts   37334131.00 2643969.00 30373543.00 70351643.00
## Line counts     899288.00   77259.00  2360148.00  3336695.00
## File size, MB      200.42     196.28      159.36      556.06

70 millions of words are definitely too much for a model. I will leave about 35000 which is 0.05% of data.

quantile(blogs$length, c(.9995))
##   99.95% 
## 2388.356
quantile(news$length, c(.9995))
##   99.95% 
## 1155.113
quantile(twitter$length, c(.9995))
## 99.95% 
##    142

Task 1 - Getting and cleaning the data

I filter the source data according to percentiles.

blogs <- blogs %>% filter(length > 2400)
news <- news %>% filter(length > 1200)
twitter <- twitter %>% filter(length > 142)

Next, some transformations - data is prepared for tokenization.

sample <- bind_rows(blogs, news, twitter)
sample <- paste0(sample$Text)
#Prevent future errors with non-UTF-8 characters
sample <- stringr::str_conv(sample, "UTF-8")

Tokenization

All kinds of numbers, punctuation signs, parenthesis and specific symbols, like £ or $ are removed.

corpus <- Corpus(VectorSource(sample)) %>% 
  tm_map(tolower) %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeWords, stopwords()) %>%
  tm_map(stripWhitespace)

Profanity filtering

I’ve used the ready list of profanity words, it can be downloaded here - School of Computer Science at Carnegi Mellon University

download.file('https://www.cs.cmu.edu/~biglou/resources/bad-words.txt', 'profanity.txt')
prof <- readLines('profanity.txt')
corpus <- tm_map(corpus, removeWords, prof)
#Make matrix
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
## <<DocumentTermMatrix (documents: 1498, terms: 25956)>>
## Non-/sparse entries: 106353/38775735
## Sparsity           : 100%
## Maximal term length: 128
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  ’s “ can get just like one people time will
##   138  0 0   1   1    3    0   3      1    3    1
##   148  0 0   3   1    0    5   3      0    1   11
##   222  8 3  10   1    3    5  13      0   10   21
##   225  0 0   7   1    1    1   6      3    7   19
##   231  5 0   1   4    2    2   4      0    3    2
##   236  0 0   0   1    0    0   5      5    2    0
##   245  0 0  31   3    1    1  15      2    5   37
##   291  0 0   8   1    3    0   3      0    2    6
##   338  0 0   2   1    3    0   8      1    3    9
##   57   0 0   5   2    0    1   6      3    1   22

Now I purge all symbols, which can be interpreted as words but not are words (’s as an example).

toSpace <- content_transformer(function(x, pattern) gsub(pattern, ' ', x))
toNot <- content_transformer(function(x, pattern) gsub(pattern, ' not', x))
corpus1 <- corpus %>%
  tm_map(toSpace, "’s") %>%
  tm_map(toSpace, "“") %>%
  tm_map(toSpace, "–") %>%
  tm_map(toSpace, "”") %>%
  tm_map(toSpace, "’m") %>%
  tm_map(toSpace, "’re") %>%
  tm_map(toSpace, "’ve") %>%
  tm_map(toSpace, "’d") %>%
  tm_map(toSpace, "’ll") %>%
  tm_map(toSpace, "…") %>%
  tm_map(toSpace, "—") %>%
  tm_map(toNot, "n’t") %>%
  tm_map(stripWhitespace)
dtm <- DocumentTermMatrix(corpus1)
inspect(dtm)
## <<DocumentTermMatrix (documents: 1498, terms: 23966)>>
## Non-/sparse entries: 104174/35796894
## Sparsity           : 100%
## Maximal term length: 128
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  can even get just like not one people time will
##   138   1    3   1    3    0   0   3      1    3    1
##   148   3    0   1    0    5   0   3      0    1   11
##   222  11    3   1    5    6  15  18      0   15   21
##   225   7    0   1    1    1   0   6      3    7   19
##   231   1    3   4    2    2   4   4      0    3    2
##   236   0    0   1    0    0   0   5      5    2    0
##   245  31    4   3    1    1   0  15      2    5   37
##   291   8    3   1    3    0   0   3      0    2    6
##   338   2    2   1    3    0   0   8      1    3    9
##   57    5    2   2    0    1   0   6      3    1   22

Cloud

matrix <- as.matrix(dtm) 
words <- sort(colSums(matrix), decreasing = TRUE) 
df <- data.frame(word = names(words), freq = words)
set.seed(42) # for reproducibility 
wordcloud(words = df$word, freq = df$freq, 
          min.freq = 1, 
          max.words = 300, 
          random.order = FALSE, 
          rot.per = 0.35,
          colors = brewer.pal(8, 'Dark2'))

Task 2 - Exploratory Data Analysis

Question 1

Some words are more frequent than others - what are the distributions of word frequencies?

And one is the winner.

Question 2

What are the frequencies of 2-grams and 3-grams in the dataset?

Question 3

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

It can be predicted only according to the used corpus.

length(words)
## [1] 23966
sum(words)
## [1] 145633

So, I have 23966 unique words and 145633 total.

Changing the column to cumulative sum and estimating the amount of words equal to 50% and 90% will give the answer:

words <- cumsum(words)
print(paste0(sum(words < ceiling(145633 * 0.5)), ' words cover 50% of text'))
## [1] "931 words cover 50% of text"
print(paste0(sum(words < ceiling(145633 * 0.9)), ' words cover 90% of text'))
## [1] "10438 words cover 90% of text"

Question 4

How do you evaluate how many of the words come from foreign languages?

Answer 4

This could be done by filtering all the symbols except [a-zA-Z]. But it should be done carefully, some foreign words get into this regular expression (for example, not all spanish words have Ñ inside).

Question 5

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Answer 5

This task should use stemming before teaching the model and applying grammar rules then predicting.