In my work I use the en_US locale of the dataset.
## Blog News Twitter Total
## Word counts 37334131.00 2643969.00 30373543.00 70351643.00
## Line counts 899288.00 77259.00 2360148.00 3336695.00
## File size, MB 200.42 196.28 159.36 556.06
70 millions of words are definitely too much for a model. I will leave about 35000 which is 0.05% of data.
quantile(blogs$length, c(.9995))
## 99.95%
## 2388.356
quantile(news$length, c(.9995))
## 99.95%
## 1155.113
quantile(twitter$length, c(.9995))
## 99.95%
## 142
I filter the source data according to percentiles.
blogs <- blogs %>% filter(length > 2400)
news <- news %>% filter(length > 1200)
twitter <- twitter %>% filter(length > 142)
Next, some transformations - data is prepared for tokenization.
sample <- bind_rows(blogs, news, twitter)
sample <- paste0(sample$Text)
#Prevent future errors with non-UTF-8 characters
sample <- stringr::str_conv(sample, "UTF-8")
All kinds of numbers, punctuation signs, parenthesis and specific symbols, like £ or $ are removed.
corpus <- Corpus(VectorSource(sample)) %>%
tm_map(tolower) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords()) %>%
tm_map(stripWhitespace)
I’ve used the ready list of profanity words, it can be downloaded here - School of Computer Science at Carnegi Mellon University
download.file('https://www.cs.cmu.edu/~biglou/resources/bad-words.txt', 'profanity.txt')
prof <- readLines('profanity.txt')
corpus <- tm_map(corpus, removeWords, prof)
#Make matrix
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
## <<DocumentTermMatrix (documents: 1498, terms: 25956)>>
## Non-/sparse entries: 106353/38775735
## Sparsity : 100%
## Maximal term length: 128
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs ’s “ can get just like one people time will
## 138 0 0 1 1 3 0 3 1 3 1
## 148 0 0 3 1 0 5 3 0 1 11
## 222 8 3 10 1 3 5 13 0 10 21
## 225 0 0 7 1 1 1 6 3 7 19
## 231 5 0 1 4 2 2 4 0 3 2
## 236 0 0 0 1 0 0 5 5 2 0
## 245 0 0 31 3 1 1 15 2 5 37
## 291 0 0 8 1 3 0 3 0 2 6
## 338 0 0 2 1 3 0 8 1 3 9
## 57 0 0 5 2 0 1 6 3 1 22
Now I purge all symbols, which can be interpreted as words but not are words (’s as an example).
toSpace <- content_transformer(function(x, pattern) gsub(pattern, ' ', x))
toNot <- content_transformer(function(x, pattern) gsub(pattern, ' not', x))
corpus1 <- corpus %>%
tm_map(toSpace, "’s") %>%
tm_map(toSpace, "“") %>%
tm_map(toSpace, "–") %>%
tm_map(toSpace, "”") %>%
tm_map(toSpace, "’m") %>%
tm_map(toSpace, "’re") %>%
tm_map(toSpace, "’ve") %>%
tm_map(toSpace, "’d") %>%
tm_map(toSpace, "’ll") %>%
tm_map(toSpace, "…") %>%
tm_map(toSpace, "—") %>%
tm_map(toNot, "n’t") %>%
tm_map(stripWhitespace)
dtm <- DocumentTermMatrix(corpus1)
inspect(dtm)
## <<DocumentTermMatrix (documents: 1498, terms: 23966)>>
## Non-/sparse entries: 104174/35796894
## Sparsity : 100%
## Maximal term length: 128
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can even get just like not one people time will
## 138 1 3 1 3 0 0 3 1 3 1
## 148 3 0 1 0 5 0 3 0 1 11
## 222 11 3 1 5 6 15 18 0 15 21
## 225 7 0 1 1 1 0 6 3 7 19
## 231 1 3 4 2 2 4 4 0 3 2
## 236 0 0 1 0 0 0 5 5 2 0
## 245 31 4 3 1 1 0 15 2 5 37
## 291 8 3 1 3 0 0 3 0 2 6
## 338 2 2 1 3 0 0 8 1 3 9
## 57 5 2 2 0 1 0 6 3 1 22
matrix <- as.matrix(dtm)
words <- sort(colSums(matrix), decreasing = TRUE)
df <- data.frame(word = names(words), freq = words)
set.seed(42) # for reproducibility
wordcloud(words = df$word, freq = df$freq,
min.freq = 1,
max.words = 300,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(8, 'Dark2'))
Some words are more frequent than others - what are the distributions of word frequencies?
And one is the winner.
What are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
It can be predicted only according to the used corpus.
length(words)
## [1] 23966
sum(words)
## [1] 145633
So, I have 23966 unique words and 145633 total.
Changing the column to cumulative sum and estimating the amount of words equal to 50% and 90% will give the answer:
words <- cumsum(words)
print(paste0(sum(words < ceiling(145633 * 0.5)), ' words cover 50% of text'))
## [1] "931 words cover 50% of text"
print(paste0(sum(words < ceiling(145633 * 0.9)), ' words cover 90% of text'))
## [1] "10438 words cover 90% of text"
How do you evaluate how many of the words come from foreign languages?
This could be done by filtering all the symbols except [a-zA-Z]. But it should be done carefully, some foreign words get into this regular expression (for example, not all spanish words have Ñ inside).
Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
This task should use stemming before teaching the model and applying grammar rules then predicting.