This report is part of the Data Science Capstone Word Prediction Project. It’s goal is the exploratory analysis of the twitter, blogs and news text corpora. The report explains how to get, sample and preprocess the data, the steps of the analysis, and the construction of n-grams. Main features are illustrated in various plots and tables. The report ends with interesting findings and the plans for creating the prediction algorithm and Shiny application to enable next word prediction of an arbitrary text.
The text data sets are from HC Corpora (www.corpora.heliohost.org). They were collected from publicly available sources by a web crawler from three distinctive sources: News, Blogs and Twitter feeds. The following data analysis will focus on the English American dataset.
The data for this report was downloaded and unzipped from the Coursera site: Capstone Dataset.
The following table lists main features of the 3 text sources.
File Size [MB] | Lines (Documents) | Words | Characters | Avg. # Words per Line | Max. # Words per Line | |
---|---|---|---|---|---|---|
blogs | 200.42 | 899,288 | 37,546,246 | 206,824,382 | 41.8 | 6,726 |
news | 196.28 | 1,010,242 | 34,762,395 | 203,223,154 | 34.4 | 1,796 |
159.36 | 2,360,148 | 30,093,410 | 162,096,241 | 12.8 | 47 | |
total | 556.07 | 4,269,678 | 102,402,051 | 572,143,777 | 24.0 | 6,726 |
The goal of the project is next word prediction of an arbitrary (english) text. Therefore it is interesting to inspect the distribution of words in the documents.
Blogs and news have on average larger documents (42 resp. 34 words) than twitter documents (13 words). On the other hand, blogs and news corpora have some very large documents, up to over 6000, resp 1700 words. The twitter corpus has a maximum document length of only 47 words, due to the 140 character limitation of twitter feeds
The following figures show the density of the word distribution of the three text corpora (only documents with length <= 100 words). It can be seen, that many blogs documents have only few words. But there is a large ‘tail’ of longer blogs. News douments do not have such a distinct ‘tail’. And twitter documents have no ‘tail’ at all.
Since the provided data files are quite large, a 20% random subsets of the combined datasets was extracted and used for further preprocessing and n-gram creation.
Before starting the creation of n-grams, it is necessary to do some preprocessing and cleaning steps:
Stopwords are not removed, because the model should be able to analyse and predict stopwords.
In spite of the preprocessing and cleaning, still some unusual tokens were found, e.g.:
N-grams are the basis of the word prediction application. An n-gram is a contiguous sequence of n items from a given sequence of text (Wikipedia: n-gram). N-grams are calculated from some text corpora. Each n-gram gets associated the frequency, i.e. the number of times it occured in that text corpora.
According to the widely used Marcov process model in Natural Language Processing (NLP) the probability of the occurence of the next word in a sentence can be computed from it’s previous words. The combination of previous (n-1) words plus next word (1) is modelled as an n-gram. To predict the next word of a sentence, an algorithm identifies e.g. the last 3 words and looks for all 4-grams with the first 3 words matching the last 3 words of the sentence. The next word of the sentence is then predicted as that last word of all 4-grams, that has the highest frequency.
The prediction can be improved by smoothing techniques, eg. linear interpolation (e.g., taking the weighted mean of the unigram, bigram, and trigram predictions), and discounting methods.
The n-gram generation starts with the creation of a corpus from the sampled text. A corpus is a large and structured set of texts (Wikipedia: Text corpus). From the corpus all relevant n-grams (1-, 2-, 3-, 4-grams) are created using the document frequency matrix (dfm) method of the R quanteda package. The n-grams frequency information is then saved to file.
textCorpus <- corpus(sample_text)
# loop over 1-gram to 4-gram
kList <- seq(1,4)
for (k in kList) {
dfm <- dfm(textCorpus,
removeNumbers=TRUE, removePunct=TRUE, removeSeparators=TRUE,
removeTwitter=TRUE, ngrams = k, concatenator = " "
)
# calculate n-gram frequency and sort n-grams according to frequency
gramfreq <- data.table(ngram=colnames(dfm), freq=colSums(dfm)) %>% arrange(desc(freq))
# save n-gram information to file
save( gramfreq, file=paste0('data/ngram_',smpl,'_',k,'.RData') )
}
plotList <- list()
for (k in kList) {
gramfreq <- gramfreqList[[k]][1:10,]; title <- titleList[[k]]; term <- termList[[k]]
plot <- ggplot(gramfreq, aes(x=reorder(ngram,freq), y=freq , width=0.6)) +
geom_bar(stat="Identity", fill="black") +
geom_text(aes(label=freq), vjust = +0.3, hjust=-0.2) + ggtitle(title) +
ylab("Frequency") + xlab(term) +
theme(plot.title = element_text(lineheight=.8, face="bold"), axis.text=element_text(size=11) ) +
ylim(0, 1.05 * gramfreq[1,]$freq) + coord_flip()
plotList[[length(plotList)+1]] <- plot
}
The following figures show the distribution of the 10 most frequent 1-, 2-, 3-, 4-grams and their frequency in the sampled text.
It can be recognized, that
The following table shows the coverage of the top 20 n-grams and the number of n-grams necessary to cover 90% of all n-grams in the sampled text.
total # n-grams |
Top 20 n-gram coverage [%] |
n-grams with freq 1 [%] |
# n-grams for 50% coverage |
# n-grams for 90% coverage |
|
---|---|---|---|---|---|
1-grams (unigrams) | 314,288 | 27.67 | 57.12 | 142 | 7,998 |
2-grams (bigrams) | 4,459,378 | 3.13 | 73.63 | 39,759 | 2,544,441 |
3-grams (trigrams) | 11,431,454 | 0.34 | 86.86 | 2,280,991 | 9,601,362 |
4-grams | 15,337,190 | 0.10 | 94.65 | 6,599,133 | 13,589,579 |
While the top 20 unigrams cover about 28% of all unigrams in the text, the top 20 bigrams cover only 3.1% of all bigrams, and the top 20 4-grams cover only 0.1% of all 4-grams. 90% coverage is achieved by the topmost 7998 unigrams. To achieve the same coverage for bigrams and trigrams, about 2544441 bigrams, and 9601362 trigrams are necessary.
To achieve a good represention of the sampled text (90% coverage), only about 8000 unigrams, but millions of bigrams and trigrams are necessary.
Given the limited storage space of the Shiny application environment:
The occurance of unusual tokens in the n-gram data illustrates the importance of data cleaning and preprocessing. It has to be explored what to do with those unusual tokens: whether to replace them by metatokens like ’TIME, DATE, NUMBER, EMAIL, or LINK? Is it necessary to separate sentences of larger documents? Is it reasonable to remove profane words?
To guarantee a good performance of the final word prediction application, it is necessary to store the most important n-grams for efficient/fast access. Some aspects are:
The model to be developed should predict the next word of an arbitrary text. The model should calculate the word prediction as a weighted combination of the predictions of 4-grams, 3-grams, 2-grams, and 1-gram information. The weights have to be trained on a suitable training-set, obtained from the text corpus. Various algorithms should be explored to enhance the prediction accuracy.
Planned prediction algorithms:
Questions:
Skip-grams are a generalization of n-grams in which the words need not be consecutive, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis (Wikipedia: n-gram).
The final Shiny application should comprise a user interface in which the user can type words (part of a sentence), and the application should predict the next word.
The input text should be tokenized and preprocesses (toLower, remove punctuation …) and the last (up to 3) words should be used to predict the most probable next word on the basis of the stored n-gram frequency tables and the word prediction model.