Introduction

This short report contains weak try to evaluate texts datasets. It includes some basic statistics, summary, algorithms and principles of extracting features from documents and, in fact, features.
Being non-native English speaking person I should admit probably I’ll never be good at Natural Language Processing (NLP) in English. But I’ve to make my best as course project requires.
Unfortunately, till this moment I have no enough time to try NLP on Russian dataset, but I’m going to do this a bit later.

Data Injecting

We have three files to inject for further analysis. Two of them related to social network - tweets and blogs, the last one about news. Prediction based on this limited content could be different and definitely not complete for all possible situations in word prediction task. By the way, I’ll suggest fully load all three files into documents representation in R and convert them into corpus using ‘quanteda’ package.

Text Data Summary

After loading text files make some summary of lines and words in each.
documents lines words
news_doc 1010242 34372597
blogs_doc 899284 37333760
twitter_doc 2360149 30373791

Clean text data

Before making text corpora need to clean text documents. Firstly, merge three files into one character vector and using ‘gsub’ we’ll remove useless from files, then save cleaned documents to text file.
* consolidate different apostrophe variants.
* remove hashtags
* remove URLs
* remove advertising blogs
* remove irrelevant chars (like @$%&, surrounding quotation marks, etc)
* remove surrounding apostrophes
* remove surrounding white space
* condense multiple spaces to one
In the end of cleaning we’ve saved result into text file to get fast way of injecting.

Document Feature Matrix

Document Feature matrix (DFM) refers to documents in rows and ‘features’ as columns. Features (or sometimes called terms) values for certain features with each document. Features can be entirely general, such as n-grams or syntactic dependencies. To get DFM we have to make text corpora and tokenize it.

Pre-processing Corpus

Raw corpus consists of words and punctuation from cleaned data file. Cleaned data is not actually prepared for word prediction it needs a bit pre-processing before analysis and modeling. Processing shouldn’t affect words meaning or words itself. But processing needed for simplify corpus and make it more uniform.

  1. to lower - make all words are lowercase. This is important to reduce words dissimilarities.
  2. remove punctuation - punctuation should be removed to simplify corpus and eliminates unnecessary tokens.
  3. remove numbers - removing all numbers will reduce corpus complexity.
  4. remove stopwords - removing very common and frequent words to ‘kill’ majority of usual words and give some space for others. Quite a bit like ‘Random forest’ algorithm.

The most important and complicated task is to remove very high-used words that is useless for prediction (also called ‘stopwords’). For example, pronouns are useless for prediction but very frequent words. In fact, it’s no way to predict which of pronoun you may use as the next word, at least statistically.
Also, frequently used words are articles (‘the’, ‘a’, ‘an’) and probably should be excluded from prediction. Statistically articles are used differently with the same nouns, so it can’t be predicted with high accuracy. Moreover, articles could be used non grammatically which aggravate prediction.
Modal verbs should be excluded too (‘must’, ‘should’, ‘could’, ‘will’, etc.) they are frequent and unpredictable in my opinion.
All forms of ‘to be’ (‘am’, ‘is’, ‘are’, ‘were’, etc.) unnecessary as well and should be excluded.
Quanteda and most of NLP packages has stopwords list in different languages and simplest way to clean corpus is to use this list. But list consists of many frequent words which is would be good to predict (‘during’, ‘under’ and so on). We’ll make my own stopwords list based on standard in Quanteda.
By default, Quanteda package lowers case, remove numbers and punctuation, but I implicitly include this parameters to dfm command.

Single-word DFM (1-gram, unigram)

Top 100 features of single-word DFM are:

or as a histogram of top 30:

Stopwords list imrovements

Let’s take a look on first 1000 of most frequent words. Some of them couldn’t be part of prediction process because specificity, abbreviation, interjection. We’ll remove from DFM following:

  • abbreviation - rt, lol, im, st, u.s, p.m. a.m, mr, dr, ll, ur, omg, co
  • interjection - oh, ha, haha, ha, la
  • letters - a-z

Double-word DFM (2-gram, bigram)

After improving stopwords list we’ll rebuild DFM for double-word. Top 50 features of 2-gram DFM are:

or as a histogram of top 30:

‘More than’ is more than popular:)

3-, 4-, 5-grams DFM

Let’s look at histogram of top 30 features of 3-, 4-, 5-grams DFM:

Top features of 3-gram DFM looks as expected. Very popular phrases deserve leading places.

Top features of 4-gram DFM looks good from the fourth position. First and second position occupied by repeated phrases from 1 blog post (easily found by grep in blogs.txt), third position formed by unexplained symbols which had not been filtered in cleaning phase.

Top features of 5-gram DFM looks ‘so-so’ from the fourth position as well as 4-gram DFM. The same repeated phrases from 1 blog post and unexplained symbols. Positions 1 through 3 should be removed before building prediction model. Most position occupied by word repetition which isn’t interesting in further modeling for word prediction. Still some unexplained symbols are present repeated 5 times.

Cover of All Words in the Dataset

To get percent of covered words let’s back to uni-gram DFM. Using cumulative sum count the contribution of each word in the coverage. On a diagram also marked 50%, 75% and 90% word counts. We intentionally limits number of words to make plot available to draw in a seconds, despite more than 90% are covered.

To increase coverage with the same set of features we’ll suggest to limit area of application. If we’ll explore only twitter dataset for example we’ll decrease number of phrases keep coverage on the same level. But this initial data would be convenient only for prediction words in tweets app or similr ‘chat-style’ apps

Data modelling approaches

Build a model is a trade-off process between speed and accuracy. Summary steps as follows:
1. Limit coverage for each n-gram DFM (for example, 50%)
2. Trim DFM for each n-gram to defined coverage
3. Split sample to train and test dataset
4. Using train dataset build back-off model for n-grams to predict n-th word based on existing n-1 words.
5. Validate results on test dataset.

RAM size and Processing speed

All computing below takes no more than 10-12 minutes of Core i7-4800MQ (2.7 GHz) and allocates 20-22 GB of RAM. On this step of prediction CPU time and RAM allocation doesn’t play important role because all of this is ‘training’ phase and could be done on powerful servers. Further prediction will use summary and aggregated results of this ‘training’.

Appendix

Load Text Documents

con <- file("en_US/en_US.news.txt", "r", blocking = FALSE)
news_doc <- readLines(con)
close(con)
con <- file("en_US/en_US.blogs.txt", "r", blocking = FALSE)
blogs_doc <- readLines(con)
close(con)
con <- file("en_US/en_US.twitter.txt", "r", blocking = FALSE)
twitter_doc <- readLines(con)
close(con)
df_text_summary <- data.frame(documents = c("news_doc", "blogs_doc", "twitter_doc"), lines = c(length(news_doc), length(blogs_doc), length(twitter_doc)), words = c(sum(str_count(news_doc, "\\S+")),sum(str_count(blogs_doc, "\\S+")),sum(str_count(twitter_doc, "\\S+"))))

Document cleaning

sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
all_doc <- c(news_doc, blogs_doc, twitter_doc)
rm(blogs_doc); rm(news_doc); rm(twitter_doc)
# consolidate different apostroph variants.
all_doc <- gsub(dbl_quot_rx, "\"", all_doc)
all_doc <- gsub(sngl_quot_rx, "'", all_doc)
# remove hashtags
all_doc <- gsub("[[:blank:]]#[^[:blank:]]*", " ", all_doc, perl = TRUE)
# remove URLs
all_doc <- gsub("(https?)?://[^[:blank:]]*", " ", all_doc, ignore.case = TRUE, perl = TRUE)
# remove advertising blogs
all_doc <- all_doc[-grep("^[[:upper:]]{5}[[:space:]](?=.*Blog)", all_doc, perl = TRUE)]
# remove irrelevant chars (like @$%&*, surrounding quotation marks etc)
all_doc <- gsub("[^[:alnum:]'.?!]", " ", all_doc, perl = TRUE)
# remove surrouting apostrophs
all_doc <- gsub("[[:blank:]]'([[:alnum:][:blank:]]+)'[[:blank:]]", "\\1", all_doc, perl = TRUE)
# remove sourounding whitespace
all_doc <- stri_trim_both(all_doc)
# condense multiple spaces to one
all_doc <- gsub("[[:blank:]]{2,}", " ", all_doc, perl = TRUE)
con <- file("en_US/all_doc_clean.txt")
writeLines(all_doc, con)
close(con)

Make a sample dataset

if(exists("all_doc")) {
    set.seed(123)
    sample_doc <- sample(all_doc, size = length(all_doc)/5, replace = FALSE)
} else {
    con <- file("en_US/all_doc_clean.txt", "r", blocking = FALSE)
    all_doc <- readLines(con)
    close(con)
    set.seed(123)
    sample_doc <- sample(all_doc, size = length(all_doc)/5, replace = FALSE)
}
rm(all_doc)
con <- file(paste("sample_doc.txt"))
writeLines(sample_doc, con)
close(con)

Making general corpus

if(exists("sample_doc")) {
    sample_corpus <- corpus(sample_doc)
} else {
    con <- file("sample_doc.txt", "r", blocking = FALSE)
    sample_doc <- readLines(con)
    close(con)
    sample_corpus <- corpus(sample_doc)
}
df_corpus_summary <- data.frame(documents = dim(sample_corpus$documents)[1], tokens = sum(summary(sample_corpus)[,3]))

Making initial stopwords list

my_stopwords <- stopwords("english")[-c(30, 31, 32, 33, 116, 118:119, 126:134, 144:152, 155:166, 168:170, 172:174)]

Build 1-gram DFM

set.seed(123)
sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
sample_tokens <- tokens_tolower(sample_tokens)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
sample_tokens_nostopwords_1grams <- tokens_ngrams(sample_tokens, 1)
sample_dfm_1gram <- dfm(sample_tokens_nostopwords_1grams, verbose = FALSE)
rm(sample_tokens_nostopwords_1grams)
save(sample_dfm_1gram, file = "sample_dfm_1gram_file.RData")
df_topfeatures_1gram <- data.frame(topfeatures(sample_dfm_1gram, 100))
names(df_topfeatures_1gram) <- "count"
df_topfeatures_1gram$words <- rownames(df_topfeatures_1gram)
topfeatures(sample_dfm_1gram, 1000)
my_stopwords <- c(my_stopwords, "rt", "lol", "im", "st", "u.s", "p.m", "a.m", "mr", "dr", "ll", "ur", "omg", "co", "oh", "ha", "haha", "ha", "la", letters)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
sample_tokens_nostopwords_1grams <- tokens_ngrams(sample_tokens, 1)
sample_dfm_1gram <- dfm(sample_tokens_nostopwords_1grams, verbose = FALSE)
rm(sample_tokens_nostopwords_1grams)
save(sample_dfm_1gram, file = "sample_dfm_1gram_file.RData")

Build 2-gram DFM

set.seed(123)
if (!exists("sample_tokens")) {
    sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
    sample_tokens <- tokens_tolower(sample_tokens)
    sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_2grams <- tokens_ngrams(sample_tokens, 2)
sample_dfm_2gram <- dfm(sample_tokens_nostopwords_2grams, verbose = FALSE)
rm(sample_tokens_nostopwords_2grams)
save(sample_dfm_2gram, file = "sample_dfm_2gram_file.RData")
df_topfeatures_2gram <- data.frame(topfeatures(sample_dfm_2gram, 100))
names(df_topfeatures_2gram) <- "count"
df_topfeatures_2gram$words <- rownames(df_topfeatures_2gram)

3-grams DFM

set.seed(123)
if (!exists("sample_tokens")) {
    sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
    sample_tokens <- tokens_tolower(sample_tokens)
    sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_3grams <- tokens_ngrams(sample_tokens, 3)
sample_dfm_3gram <- dfm(sample_tokens_nostopwords_3grams, verbose = FALSE)
rm(sample_tokens_nostopwords_3grams)
save(sample_dfm_3gram, file = "sample_dfm_3gram_file.RData")
df_topfeatures_3gram <- data.frame(topfeatures(sample_dfm_3gram, 100))
names(df_topfeatures_3gram) <- "count"
df_topfeatures_3gram$words <- rownames(df_topfeatures_3gram)

4-grams DFM

set.seed(123)
if (!exists("sample_tokens")) {
    sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
    sample_tokens <- tokens_tolower(sample_tokens)
    sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_4grams <- tokens_ngrams(sample_tokens, 4)
sample_dfm_4gram <- dfm(sample_tokens_nostopwords_4grams, verbose = FALSE)
rm(sample_tokens_nostopwords_4grams)
save(sample_dfm_4gram, file = "sample_dfm_4gram_file.RData")
df_topfeatures_4gram <- data.frame(topfeatures(sample_dfm_4gram, 100))
names(df_topfeatures_4gram) <- "count"
df_topfeatures_4gram$words <- rownames(df_topfeatures_4gram)

5-grams DFM

set.seed(123)
if (!exists("sample_tokens")) {
    sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
    sample_tokens <- tokens_tolower(sample_tokens)
    sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_5grams <- tokens_ngrams(sample_tokens, 5)
sample_dfm_5gram <- dfm(sample_tokens_nostopwords_5grams, verbose = FALSE)
rm(sample_tokens_nostopwords_5grams)
save(sample_dfm_5gram, file = "sample_dfm_5gram_file.RData")
df_topfeatures_5gram <- data.frame(topfeatures(sample_dfm_5gram, 100))
names(df_topfeatures_5gram) <- "count"
df_topfeatures_5gram$words <- rownames(df_topfeatures_5gram)

Get unigram word coverage for a language

df_features_1gram <- data.frame(topfeatures(sample_dfm_1gram, nfeature(sample_dfm_1gram)))
names(df_features_1gram) <- "count"
df_features_1gram$words <- rownames(df_features_1gram)
df_features_1gram$ratio <- cumsum(df_features_1gram$count)/sum(df_features_1gram$count)*100
coverage50 <- min(which(round(df_features_1gram$ratio, 0) == 50))
coverage75 <- min(which(round(df_features_1gram$ratio, 0) == 75))
coverage90 <- min(which(round(df_features_1gram$ratio, 0) == 90))

Saving results

save(df_text_summary, df_features_1gram, sample_dfm_1gram, sample_dfm_2gram, sample_dfm_3gram, sample_dfm_4gram, sample_dfm_5gram, df_topfeatures_1gram, df_topfeatures_2gram, df_topfeatures_3gram, df_topfeatures_4gram, df_topfeatures_5gram, coverage50, coverage75, coverage90, file = "summary_results.Rdata")