The aim of this report is to load and clean the text data, build a corpus and do some exploratory analysis.

Loading and sampling

The text data to be analyzed are contained in three text files that have a compound size of almost 600Mb. I will use only a 1% sample to make all the analysis quicker.

library(quanteda)
library(dplyr)
library(stringr)
library(hunspell)
library(knitr)
library(plotly)
library(gridExtra)
#library(lemon)
#knit_print.data.frame <- lemon_print
setwd("~/Documents/Coursera/CAPSTONE/final/en_US")

blogs <- readLines("en_US.blogs.txt")
tweets <- readLines("en_US.twitter.txt")
news <- readLines("en_US.news.txt")

set.seed(456)
blogs <- sample(blogs, length(blogs)/100)
tweets <- sample(tweets, length(tweets)/100)
news <- sample(news, length(news)/100)

setwd("~/Documents/Coursera/CAPSTONE")

#save the samples for future needs: 
write.csv(blogs, "blogs.txt")
write.csv(tweets, "tweets.txt")
write.csv(news, "news.txt")

Preprocessing and Corpus Building

We can expect the texts to be “dirty”, i.e. containing spelling mistakes, foreign words, foreign letters etc. These will be cleaned only after tokenization, since removing single words from the text body would bias the ngrams. I only remove the concatenations like “I’m” now and build a corpus using the quanteda package:

all <- c(tweets, blogs, news)
all <- tolower(all)

replace1 <- c("can't" = "cannot")
replace <- c("'m" = " am", "'re" = " are", "'s" = " is", "n't" = " not", "'em" = " them", "'d" = " would", "wanna" = "want to", "gotta" = "got to", "'ve" = " have")

all <- str_replace_all(all, replace1)
all <- str_replace_all(all, replace)

corp <- corpus(all)

Tokenization and cleaning

The tokens function from the quanteda package constructs the ngrams and also does a part of cleaning. It removes punctuation, special symbols etc. The rest of the cleaning is done via spell check. The idea is that an English spell check captures not only misspelled words, but also things like email addresses, rare slang and most importantly the foreign words and languages. To save some computational time, the cleaning will be done on the frequency tables where each ngram to be removed is only represented by one entry (one line).

Unigrams:

#tokenization
tokens1 <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE)

#frequency table:
dfm1 <- dfm(tokens1)
unigrams <- textstat_frequency(dfm1)
unigrams <- as.data.frame(unigrams[,1:2])

#cleaning:
us <- hunspell::dictionary("/home/sal/R/x86_64-pc-linux-gnu-library/3.4/hunspell/dict/en_US.dic")
a <- hunspell(unigrams$feature, dict = us)
b <- sapply(a, function(x) identical(x, character(0)))
unigrams <- unigrams[b,]

Bigrams:

#tokenization
tokens2 <- tokens(corp, ngrams = 2, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE)

#frequency table:
dfm2 <- dfm(tokens2)
bigrams <- textstat_frequency(dfm2)
bigrams <- as.data.frame(bigrams[,1:2])

#cleaning:
a <- hunspell(unigrams$feature, dict = us)
b <- sapply(a, function(x) identical(x, character(0)))
bigrams <- bigrams[b,]

Trigrams:

#tokenization
tokens3 <- tokens(corp, ngrams = 3, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE)

#frequency table:
dfm3 <- dfm(tokens3)
trigrams <- textstat_frequency(dfm3)
trigrams <- as.data.frame(trigrams[,1:2])

#cleaning:
a <- hunspell(unigrams$feature, dict = us)
b <- sapply(a, function(x) identical(x, character(0)))
trigrams <- trigrams[b,]

Explaratory analysis

Basic counts:

##           terms tokens
## Unigrams  29047 960592
## Bigrams  436813 980655
## Trigrams 773237 938237

Most frequent ngrams

Coverage

Quantiles of coverage:

## [1] 960592
## [1] 980655
## [1] 938237
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##     1     1     1     1     2     3     4     6    12    31 47927
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1    1    1    1    1    1    1    1    2    3 4402
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1    1    1    1    1    1    1    1    1    1  512

We can see that the majority of coverage lies in the top 10%.

Plans for the model:

The basic method that can be used to predict a words from one, two or more preceding words is the Maximum Likelihood Estimation (MLE). In this method, we simply look for n preceding words in (n+1)grams and select the most probable one. The probability in it’s simplest form is counted solely based on frequency of the (n+1) gram in the corpus. However, such approach does not account for missing ngrams and the probabilities therefore need to be adjusted for such missing ngrams. The model could be stored in a form of a probability table together with a “predict” function that would search input words and return tails of the selected ngrams. The major disadvantage of this method is the size of the probability table. The frequency table of trigrams alone has 100Mb and only 1% of the provided text data was used. Even if we assume that the probability table would only need to be constructed once and then simply loaded into memory when the final application is launched, it still seems highly inefficient. Therefore, it is desirable to explore different text prediction methods in the following phases of the Capstone Project.