This short report contains weak try to evaluate texts datasets. It includes some basic statistics, summary, algorithms and principles of extracting features from documents and, in fact, features.
Being non-native English speaking person I should admit probably I’ll never be good at Natural Language Processing (NLP) in English. But I’ve to make my best as course project requires.
Unfortunately, till this moment I have no enough time to try NLP on Russian dataset, but I’m going to do this a bit later.
We have three files to inject for further analysis. Two of them related to social network - tweets and blogs, the last one about news. Prediction based on this limited content could be different and definitely not complete for all possible situations in word prediction task. By the way, I’ll suggest fully load all three files into documents representation in R and convert them into corpus using ‘quanteda’ package.
| documents | lines | words |
|---|---|---|
| news_doc | 1010242 | 34372597 |
| blogs_doc | 899284 | 37333760 |
| twitter_doc | 2360149 | 30373791 |
Before making text corpora need to clean text documents. Firstly, merge three files into one character vector and using ‘gsub’ we’ll remove useless from files, then save cleaned documents to text file.
* consolidate different apostrophe variants.
* remove hashtags
* remove URLs
* remove advertising blogs
* remove irrelevant chars (like @$%&, surrounding quotation marks, etc)
* remove surrounding apostrophes
* remove surrounding white space
* condense multiple spaces to one
In the end of cleaning we’ve saved result into text file to get fast way of injecting.
Document Feature matrix (DFM) refers to documents in rows and ‘features’ as columns. Features (or sometimes called terms) values for certain features with each document. Features can be entirely general, such as n-grams or syntactic dependencies. To get DFM we have to make text corpora and tokenize it.
Raw corpus consists of words and punctuation from cleaned data file. Cleaned data is not actually prepared for word prediction it needs a bit pre-processing before analysis and modeling. Processing shouldn’t affect words meaning or words itself. But processing needed for simplify corpus and make it more uniform.
The most important and complicated task is to remove very high-used words that is useless for prediction (also called ‘stopwords’). For example, pronouns are useless for prediction but very frequent words. In fact, it’s no way to predict which of pronoun you may use as the next word, at least statistically.
Also, frequently used words are articles (‘the’, ‘a’, ‘an’) and probably should be excluded from prediction. Statistically articles are used differently with the same nouns, so it can’t be predicted with high accuracy. Moreover, articles could be used non grammatically which aggravate prediction.
Modal verbs should be excluded too (‘must’, ‘should’, ‘could’, ‘will’, etc.) they are frequent and unpredictable in my opinion.
All forms of ‘to be’ (‘am’, ‘is’, ‘are’, ‘were’, etc.) unnecessary as well and should be excluded.
Quanteda and most of NLP packages has stopwords list in different languages and simplest way to clean corpus is to use this list. But list consists of many frequent words which is would be good to predict (‘during’, ‘under’ and so on). We’ll make my own stopwords list based on standard in Quanteda.
By default, Quanteda package lowers case, remove numbers and punctuation, but I implicitly include this parameters to dfm command.
Top 100 features of single-word DFM are:
or as a histogram of top 30:
Let’s take a look on first 1000 of most frequent words. Some of them couldn’t be part of prediction process because specificity, abbreviation, interjection. We’ll remove from DFM following:
After improving stopwords list we’ll rebuild DFM for double-word. Top 50 features of 2-gram DFM are:
or as a histogram of top 30:
‘More than’ is more than popular:)
Let’s look at histogram of top 30 features of 3-, 4-, 5-grams DFM:
Top features of 3-gram DFM looks as expected. Very popular phrases deserve leading places.
Top features of 4-gram DFM looks good from the fourth position. First and second position occupied by repeated phrases from 1 blog post (easily found by grep in blogs.txt), third position formed by unexplained symbols which had not been filtered in cleaning phase.
Top features of 5-gram DFM looks ‘so-so’ from the fourth position as well as 4-gram DFM. The same repeated phrases from 1 blog post and unexplained symbols. Positions 1 through 3 should be removed before building prediction model. Most position occupied by word repetition which isn’t interesting in further modeling for word prediction. Still some unexplained symbols are present repeated 5 times.
To get percent of covered words let’s back to uni-gram DFM. Using cumulative sum count the contribution of each word in the coverage. On a diagram also marked 50%, 75% and 90% word counts. We intentionally limits number of words to make plot available to draw in a seconds, despite more than 90% are covered.
To increase coverage with the same set of features we’ll suggest to limit area of application. If we’ll explore only twitter dataset for example we’ll decrease number of phrases keep coverage on the same level. But this initial data would be convenient only for prediction words in tweets app or similr ‘chat-style’ apps
Build a model is a trade-off process between speed and accuracy. Summary steps as follows:
1. Limit coverage for each n-gram DFM (for example, 50%)
2. Trim DFM for each n-gram to defined coverage
3. Split sample to train and test dataset
4. Using train dataset build back-off model for n-grams to predict n-th word based on existing n-1 words.
5. Validate results on test dataset.
All computing below takes no more than 10-12 minutes of Core i7-4800MQ (2.7 GHz) and allocates 20-22 GB of RAM. On this step of prediction CPU time and RAM allocation doesn’t play important role because all of this is ‘training’ phase and could be done on powerful servers. Further prediction will use summary and aggregated results of this ‘training’.
con <- file("en_US/en_US.news.txt", "r", blocking = FALSE)
news_doc <- readLines(con)
close(con)
con <- file("en_US/en_US.blogs.txt", "r", blocking = FALSE)
blogs_doc <- readLines(con)
close(con)
con <- file("en_US/en_US.twitter.txt", "r", blocking = FALSE)
twitter_doc <- readLines(con)
close(con)
df_text_summary <- data.frame(documents = c("news_doc", "blogs_doc", "twitter_doc"), lines = c(length(news_doc), length(blogs_doc), length(twitter_doc)), words = c(sum(str_count(news_doc, "\\S+")),sum(str_count(blogs_doc, "\\S+")),sum(str_count(twitter_doc, "\\S+"))))sngl_quot_rx = "[ʻʼʽ٬‘’‚‛՚︐]"
dbl_quot_rx = "[«»““”„‟≪≫《》〝〞〟\"″‶]"
all_doc <- c(news_doc, blogs_doc, twitter_doc)
rm(blogs_doc); rm(news_doc); rm(twitter_doc)
# consolidate different apostroph variants.
all_doc <- gsub(dbl_quot_rx, "\"", all_doc)
all_doc <- gsub(sngl_quot_rx, "'", all_doc)
# remove hashtags
all_doc <- gsub("[[:blank:]]#[^[:blank:]]*", " ", all_doc, perl = TRUE)
# remove URLs
all_doc <- gsub("(https?)?://[^[:blank:]]*", " ", all_doc, ignore.case = TRUE, perl = TRUE)
# remove advertising blogs
all_doc <- all_doc[-grep("^[[:upper:]]{5}[[:space:]](?=.*Blog)", all_doc, perl = TRUE)]
# remove irrelevant chars (like @$%&*, surrounding quotation marks etc)
all_doc <- gsub("[^[:alnum:]'.?!]", " ", all_doc, perl = TRUE)
# remove surrouting apostrophs
all_doc <- gsub("[[:blank:]]'([[:alnum:][:blank:]]+)'[[:blank:]]", "\\1", all_doc, perl = TRUE)
# remove sourounding whitespace
all_doc <- stri_trim_both(all_doc)
# condense multiple spaces to one
all_doc <- gsub("[[:blank:]]{2,}", " ", all_doc, perl = TRUE)
con <- file("en_US/all_doc_clean.txt")
writeLines(all_doc, con)
close(con)if(exists("all_doc")) {
set.seed(123)
sample_doc <- sample(all_doc, size = length(all_doc)/5, replace = FALSE)
} else {
con <- file("en_US/all_doc_clean.txt", "r", blocking = FALSE)
all_doc <- readLines(con)
close(con)
set.seed(123)
sample_doc <- sample(all_doc, size = length(all_doc)/5, replace = FALSE)
}
rm(all_doc)
con <- file(paste("sample_doc.txt"))
writeLines(sample_doc, con)
close(con)if(exists("sample_doc")) {
sample_corpus <- corpus(sample_doc)
} else {
con <- file("sample_doc.txt", "r", blocking = FALSE)
sample_doc <- readLines(con)
close(con)
sample_corpus <- corpus(sample_doc)
}
df_corpus_summary <- data.frame(documents = dim(sample_corpus$documents)[1], tokens = sum(summary(sample_corpus)[,3]))my_stopwords <- stopwords("english")[-c(30, 31, 32, 33, 116, 118:119, 126:134, 144:152, 155:166, 168:170, 172:174)]set.seed(123)
sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
sample_tokens <- tokens_tolower(sample_tokens)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
sample_tokens_nostopwords_1grams <- tokens_ngrams(sample_tokens, 1)
sample_dfm_1gram <- dfm(sample_tokens_nostopwords_1grams, verbose = FALSE)
rm(sample_tokens_nostopwords_1grams)
save(sample_dfm_1gram, file = "sample_dfm_1gram_file.RData")
df_topfeatures_1gram <- data.frame(topfeatures(sample_dfm_1gram, 100))
names(df_topfeatures_1gram) <- "count"
df_topfeatures_1gram$words <- rownames(df_topfeatures_1gram)
topfeatures(sample_dfm_1gram, 1000)
my_stopwords <- c(my_stopwords, "rt", "lol", "im", "st", "u.s", "p.m", "a.m", "mr", "dr", "ll", "ur", "omg", "co", "oh", "ha", "haha", "ha", "la", letters)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
sample_tokens_nostopwords_1grams <- tokens_ngrams(sample_tokens, 1)
sample_dfm_1gram <- dfm(sample_tokens_nostopwords_1grams, verbose = FALSE)
rm(sample_tokens_nostopwords_1grams)
save(sample_dfm_1gram, file = "sample_dfm_1gram_file.RData")set.seed(123)
if (!exists("sample_tokens")) {
sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
sample_tokens <- tokens_tolower(sample_tokens)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_2grams <- tokens_ngrams(sample_tokens, 2)
sample_dfm_2gram <- dfm(sample_tokens_nostopwords_2grams, verbose = FALSE)
rm(sample_tokens_nostopwords_2grams)
save(sample_dfm_2gram, file = "sample_dfm_2gram_file.RData")
df_topfeatures_2gram <- data.frame(topfeatures(sample_dfm_2gram, 100))
names(df_topfeatures_2gram) <- "count"
df_topfeatures_2gram$words <- rownames(df_topfeatures_2gram)set.seed(123)
if (!exists("sample_tokens")) {
sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
sample_tokens <- tokens_tolower(sample_tokens)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_3grams <- tokens_ngrams(sample_tokens, 3)
sample_dfm_3gram <- dfm(sample_tokens_nostopwords_3grams, verbose = FALSE)
rm(sample_tokens_nostopwords_3grams)
save(sample_dfm_3gram, file = "sample_dfm_3gram_file.RData")
df_topfeatures_3gram <- data.frame(topfeatures(sample_dfm_3gram, 100))
names(df_topfeatures_3gram) <- "count"
df_topfeatures_3gram$words <- rownames(df_topfeatures_3gram)set.seed(123)
if (!exists("sample_tokens")) {
sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
sample_tokens <- tokens_tolower(sample_tokens)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_4grams <- tokens_ngrams(sample_tokens, 4)
sample_dfm_4gram <- dfm(sample_tokens_nostopwords_4grams, verbose = FALSE)
rm(sample_tokens_nostopwords_4grams)
save(sample_dfm_4gram, file = "sample_dfm_4gram_file.RData")
df_topfeatures_4gram <- data.frame(topfeatures(sample_dfm_4gram, 100))
names(df_topfeatures_4gram) <- "count"
df_topfeatures_4gram$words <- rownames(df_topfeatures_4gram)set.seed(123)
if (!exists("sample_tokens")) {
sample_tokens <- tokens(sample_corpus, remove_punct = TRUE, remove_numbers = TRUE)
sample_tokens <- tokens_tolower(sample_tokens)
sample_tokens <- tokens_remove(sample_tokens, my_stopwords)
}
sample_tokens_nostopwords_5grams <- tokens_ngrams(sample_tokens, 5)
sample_dfm_5gram <- dfm(sample_tokens_nostopwords_5grams, verbose = FALSE)
rm(sample_tokens_nostopwords_5grams)
save(sample_dfm_5gram, file = "sample_dfm_5gram_file.RData")
df_topfeatures_5gram <- data.frame(topfeatures(sample_dfm_5gram, 100))
names(df_topfeatures_5gram) <- "count"
df_topfeatures_5gram$words <- rownames(df_topfeatures_5gram)df_features_1gram <- data.frame(topfeatures(sample_dfm_1gram, nfeature(sample_dfm_1gram)))
names(df_features_1gram) <- "count"
df_features_1gram$words <- rownames(df_features_1gram)
df_features_1gram$ratio <- cumsum(df_features_1gram$count)/sum(df_features_1gram$count)*100
coverage50 <- min(which(round(df_features_1gram$ratio, 0) == 50))
coverage75 <- min(which(round(df_features_1gram$ratio, 0) == 75))
coverage90 <- min(which(round(df_features_1gram$ratio, 0) == 90))save(df_text_summary, df_features_1gram, sample_dfm_1gram, sample_dfm_2gram, sample_dfm_3gram, sample_dfm_4gram, sample_dfm_5gram, df_topfeatures_1gram, df_topfeatures_2gram, df_topfeatures_3gram, df_topfeatures_4gram, df_topfeatures_5gram, coverage50, coverage75, coverage90, file = "summary_results.Rdata")