This is the Milestone Report of our project (collaboration: Coursera + Swiftkey) to create a predictive text model using a large corpus of documents as training data. Based on NLP techniques, our language model built its understanding of the English language from 3 different corpora: a list of blog posts, a list of news articles, and a list of tweets. Thus, we briefly present here the main aspects of our data manipulation as well as some critical problems for the development of the algorithm which occupied our exploratory data analysis:
library(quanteda)
library(data.table)
library(dplyr)
library(ggplot2)
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip", method = "curl")
unzip(zipfile = "Coursera-SwiftKey.zip", overwrite = TRUE)
list.files("./final/en_US")
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
twitter.v <- scan("./final/en_US/en_US.twitter.txt", what="character", sep="\n", skipNul = TRUE)
blogs.v <- scan("./final/en_US/en_US.blogs.txt", what="character", sep="\n", skipNul = TRUE)
news.v <- scan("./final/en_US/en_US.news.txt", what="character", sep="\n", skipNul = TRUE)
Having downloaded the basic dataset and loaded the text corpora and the necessary libraries, we have to tell that for what follows, we choose to use all the data and not proceed with sampling. After all, no matter how time consuming they are, data cleaning, manipulation and exploratory analysis will not repeat itself every time our prediction algorithm is run.
Some first basic summaries of our corpora
## Text.Corpus File.Size(MB) Num.of.Lines Num.of.Chars Chars/Line Max.Chars/Line
## 1 Twitter 159.3641 2360148 162,096,241 68.68054 140
## 2 Blogs 200.4242 899288 206,824,505 229.98695 40833
## 3 News 196.2775 1010242 203,223,159 201.16285 11384
quanteda library will be the locomotive of our basic manipulations and analysis. Indicative of its capabilities is that in only a few lines we can :
using the tokens() function to proceed to tokenization (word segmentation) - removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case and
using the dfm() and tokens_ngrams() to configure the formation and counting of sequences of words of length N, in other words of N-grams
# The datasets are really big and it's practical impossible to process them as wholes with
# common ordinary RAM sizes, so we deal with them in chuncks of 1000 lines and we aggregate accordingly all the results
twitter_ALL_1_grams.dt <- NULL
chunk_length <- 1000
for (i in 1:length(text.v)%/%chunk_length+1) {
text_chunk <- unlist(split(twitter.v, ceiling(seq(twitter.v)/chunk_length))[i])
t0kens <-tokens(
gsub("((([fF]|([hH][tT]))[tT][pP]([sS]?):?[/][/]:?)?([A-Za-z1-9]+(\\.|@))+[A-Za-z1-9]+)|#|@", "", text_chunk),
remove_punct = T,
remove_symbols = T,
remove_numbers = T,
remove_separators = T,
split_hyphens = F,
include_docvars = T,
padding = F
)
lower_tokens <- tokens_tolower(t0kens)
df <- data.frame(ed=colSums(dfm(tokens_ngrams(lower_tokens, n=1, concatenator = " "))))
dt <- data.table(x=rownames(df), y=df)
twitter_ALL_1_grams.dt <- rbind(twitter_ALL_1_grams.dt, dt)
}
twitter_ALL_1_grams.dt <- twitter_ALL_1_grams.dt[, list(y.ed=sum(y.ed)),by=x]
twitter_ALL_1_grams.dt <- twitter_ALL_1_grams.dt[order(y.ed,decreasing=TRUE),]
twitter_ALL_1_grams.dt <- cbind(twitter_ALL_1_grams.dt, z = 100*twitter_ALL_1_grams.dt$y.ed/sum(twitter_ALL_1_grams.dt$y.ed))
save(twitter_ALL_1_grams.dt, file = "twitter_ALL_1_grams.RData")
In the same way as with twitter dataset, we process the rest of our datasets (Blogs, News and all three together).
Thus, precisely due to the precious data offered by the power-horse of n-grams, we result in some more interesting insights and summaries:
## Text.Corpus Num.Words Unique.Words Words/Line Num.2.grams Unique.2.grams
## 1 Twitter 29,527,627 393,897 12.51092 25,899,222 5,036,969
## 2 Blogs 36,911,586 375,061 41.04534 36,013,290 6,336,358
## 3 News 33,479,150 336,451 33.13973 32,469,632 6,340,508
## 4 All 99,918,363 815,937 23.40185 94,382,144 14,194,691
Finally here are the horizontal histograms of the 20 most common unigrams, bigrams, trigrams, four-grams and five-grams.
One of the most interesting issues in our project is the management of profanity. At the core of our strategy, there is the swearWords.txt list from the site http://www.bannedwordlist.com and the first step is to digest it into a pattern that we can use with grep. We try also to add the list’s plurals:
keywords_filter <- scan("swearWords.txt", what="character", sep="\n", skipNul = TRUE)
if (nchar(gsub("[^a-zA-Z0-9]", "", paste(keywords_filter, collapse = ""))) > 0){
pattern <- paste("((^| )", paste(keywords_filter, collapse = "($|s | ))|((^| )"), "($|s | ))", sep = "")}
grep_twitter.v <- grep(pattern, twitter.v, value = TRUE)
A first approach would be to remove from the corpus of all the three sources each of these swear words. The problem, in this case, is that the remaining n-grams would have a semantic problem, each word just before and just after the swear word would be next to each other ending up with n-grams without meaning that would affect the reliability of our predictions.
An alternative approach would be to remove from our analysis all the lines that contain some of the swear words. In this case, we would be sure that we do not use “illegal” n-grams. However, we would end up losing a significant amount of training information/n-grams, as together with the n-grams that would contain swear words, we would reject those offered by the whole line we rejected. Even more, the inhomogeneity of the sources and the different average lengths of their lines would lead to an unfair and dangerous underestimation of the information given by each one of them, in our case by the blogs as it seems in the next table. This method could certainly be improved if we split all the lines into sentences and rejected only them that contain swear word(s) instead of the whole line.
## Text Corpus Lines Total Words Unique Words Total 5-grams Unique 5-grams
## 1 Twitter 2.7 % 3 % 9.2 % 3.1 % 3.3 %
## 2 Blogs 1.4 % 2.9 % 12.8 % 3.1 % 3.2 %
## 3 News 0.4 % 0.5 % 6.2 % 0.6 % 0.6 %
But more practical and reliable for our predictions would be another approach. We will leave the swear words in the body of our information, in the n-grams that will train our algorithm, and we will simply ban them from its final predictions. If the 1st, 2nd or n-st suggested word (prediction) of our algorithm is to be a swear word, the algorithm will omit it and return the next suggestion.
Besides the previous tables and figures regarding our different texts’ corpora, the next figure is essential for a more thorough and global image about their particulaties. Here, we can have a good impression about the distribution of the words in theses corpora, especially for its variance. We can see for instance that the “News” corpus needs constantly more unique n-grams in a frequency sorted dictionary to cover 50% of all n-grams instances in the language? But for all three text cases as well as for the whole big corpus, it is evident that regarding the 1-grams (single words), 2.5% unique words correspond to the 90% of all word occurences. In the case of 2-grams the 5% correspond to the 70%. We can see that as we go from 1-grams to 5-grams the initially angular curve becomes more and more linear.
In any case (even in the case of 5-grams) a number of n-grams are much more popular, and from some point onwards an increasingly huge number of n-grams have very few instances. Moreover, at the same time, as n increases, we observe the expected launch of the number of different unique n-grams.
Τhe development of our algorithm is mainly concerned with all the above regarding the best possible management of the available information. In ideal conditions from the point of view of computational capabilities, these observations would be of purely linguistic and sociological interest, and our algorithm would be trained with all the available data. However, this would result in an extremely heavy algorithm and prediction application that could be very accurate if it managed to generate usable/ful prediction (the application must suggest next words in live texting time) given the usual RAM memories (8-16 GB). Once again, here it is the need for a trade-off between speed and accuracy.
Τhis trade-off will be based on the selection of only the n-grams ωχωσες instances/occurrences are above a certain threshold. And in our case, as can be seen from the following figures, this threshold will be the “minimum 4 instances”.
We will thus build a 5-gram probabilistic language model assigning probabilities to sequences of words and estimating the probability of the last word of our 5-gram given the previous 4 words.
The different n-grams (5-grams, 4-grams, 3-grams, 2-grams and 1-grams) we will use to “train” our model (allocate probabilities) will only be composed of those encountered at least 4 times in all our corpora.
We will then use Stupid Backoff to rank next-word candidates, that is to say in the event of complicity during an assignment of a prediction — either non-existing 5-gram to fit or same probability for two 5-grams — the predictions will be based on the available 4-grams (given only on the previous 3 words), and so on, until, when necessary, our model simply returns the most popular 1-grams (single words), practically regardless of the previous word.
Finally, in the case of profanity, the corresponding predictions will be ignored, or else, the model will suggest the next most likely option.