Data Science Capstone Projekt - Exploratory Analysis

Executive Summary

This report is part of the Data Science Capstone Word Prediction Project. It’s goal is the exploratory analysis of the twitter, blogs and news text corpora. The report explains how to get, sample and preprocess the data, the steps of the analysis, and the construction of n-grams. Main features are illustrated in various plots and tables. The report ends with interesting findings and the plans for creating the prediction algorithm and Shiny application to enable next word prediction of an arbitrary text.

Data

The text data sets are from HC Corpora (www.corpora.heliohost.org). They were collected from publicly available sources by a web crawler from three distinctive sources: News, Blogs and Twitter feeds. The following data analysis will focus on the English American dataset.

The data for this report was downloaded and unzipped from the Coursera site: Capstone Dataset.

Basic Statistics

The following table lists main features of the 3 text sources.

	File Size [MB]	Lines (Documents)	Words	Characters	Avg. # Words per Line	Max. # Words per Line
blogs	200.42	899,288	37,546,246	206,824,382	41.8	6,726
news	196.28	1,010,242	34,762,395	203,223,154	34.4	1,796
twitter	159.36	2,360,148	30,093,410	162,096,241	12.8	47
total	556.07	4,269,678	102,402,051	572,143,777	24.0	6,726

The goal of the project is next word prediction of an arbitrary (english) text. Therefore it is interesting to inspect the distribution of words in the documents.

Blogs and news have on average larger documents (42 resp. 34 words) than twitter documents (13 words). On the other hand, blogs and news corpora have some very large documents, up to over 6000, resp 1700 words. The twitter corpus has a maximum document length of only 47 words, due to the 140 character limitation of twitter feeds

Distribution of words per document

The following figures show the density of the word distribution of the three text corpora (only documents with length <= 100 words). It can be seen, that many blogs documents have only few words. But there is a large ‘tail’ of longer blogs. News douments do not have such a distinct ‘tail’. And twitter documents have no ‘tail’ at all.

Sampling

Since the provided data files are quite large, a 20% random subsets of the combined datasets was extracted and used for further preprocessing and n-gram creation.

Preprocessing

Before starting the creation of n-grams, it is necessary to do some preprocessing and cleaning steps:

Remove UTF8 characters
Convert to lower case
Remove control characters
Remove punctuation
Remove numbers
Reduce whitespace

Stopwords are not removed, because the model should be able to analyse and predict stopwords.

Strange Tokens

In spite of the preprocessing and cleaning, still some unusual tokens were found, e.g.:

meaningless tokens: “zzzzzn” zzzzzzzzzyrtec" “________” “_________”
words with preceding/subsequent underscores: “_____when” “_____if” “_____some” “_unable“”
mixed num/alpha tokens: “100000plus” “10000000000x”
dates: ‘21/12/2015’.
times: “10001100pm” “1000pm”
duration: 18hours #1year" 12days
weights: “1000gr” “1000pound”
lengths/heights: “10000meter” “003mm”
words without spaces (twitter): 10thingsiwanttobuy 100thingsihate

N-grams

N-grams are the basis of the word prediction application. An n-gram is a contiguous sequence of n items from a given sequence of text (Wikipedia: n-gram). N-grams are calculated from some text corpora. Each n-gram gets associated the frequency, i.e. the number of times it occured in that text corpora.

According to the widely used Marcov process model in Natural Language Processing (NLP) the probability of the occurence of the next word in a sentence can be computed from it’s previous words. The combination of previous (n-1) words plus next word (1) is modelled as an n-gram. To predict the next word of a sentence, an algorithm identifies e.g. the last 3 words and looks for all 4-grams with the first 3 words matching the last 3 words of the sentence. The next word of the sentence is then predicted as that last word of all 4-grams, that has the highest frequency.

The prediction can be improved by smoothing techniques, eg. linear interpolation (e.g., taking the weighted mean of the unigram, bigram, and trigram predictions), and discounting methods.

Generation of n-grams

The n-gram generation starts with the creation of a corpus from the sampled text. A corpus is a large and structured set of texts (Wikipedia: Text corpus). From the corpus all relevant n-grams (1-, 2-, 3-, 4-grams) are created using the document frequency matrix (dfm) method of the R quanteda package. The n-grams frequency information is then saved to file.

textCorpus <- corpus(sample_text)
# loop over 1-gram to 4-gram
kList <- seq(1,4)
for (k in kList) {
    dfm <- dfm(textCorpus,
               removeNumbers=TRUE,  removePunct=TRUE,  removeSeparators=TRUE,
               removeTwitter=TRUE,  ngrams = k, concatenator = " "
    )
    # calculate n-gram frequency and sort n-grams according to frequency
    gramfreq <- data.table(ngram=colnames(dfm), freq=colSums(dfm)) %>% arrange(desc(freq))
    # save n-gram information to file
    save( gramfreq, file=paste0('data/ngram_',smpl,'_',k,'.RData') )
}

Frequency Distribution of N-grams

plotList <- list()
for (k in kList) {
    gramfreq <- gramfreqList[[k]][1:10,];  title <- titleList[[k]];  term <- termList[[k]]
    plot <- ggplot(gramfreq, aes(x=reorder(ngram,freq), y=freq , width=0.6)) +
        geom_bar(stat="Identity", fill="black") +
        geom_text(aes(label=freq), vjust = +0.3, hjust=-0.2) + ggtitle(title) +
        ylab("Frequency") + xlab(term) +
        theme(plot.title = element_text(lineheight=.8, face="bold"), axis.text=element_text(size=11) ) +
        ylim(0, 1.05 * gramfreq[1,]$freq) + coord_flip()
    plotList[[length(plotList)+1]] <- plot
}

The following figures show the distribution of the 10 most frequent 1-, 2-, 3-, 4-grams and their frequency in the sampled text.

It can be recognized, that

the top unigrams occur far more often than top bigrams or trigrams.
the frequency of unigrams decreases more substantially than the frequency of higher n-grams.

The following table shows the coverage of the top 20 n-grams and the number of n-grams necessary to cover 90% of all n-grams in the sampled text.

	total # n-grams	Top 20 n-gram coverage [%]	n-grams with freq 1 [%]	# n-grams for 50% coverage	# n-grams for 90% coverage
1-grams (unigrams)	314,288	27.67	57.12	142	7,998
2-grams (bigrams)	4,459,378	3.13	73.63	39,759	2,544,441
3-grams (trigrams)	11,431,454	0.34	86.86	2,280,991	9,601,362
4-grams	15,337,190	0.10	94.65	6,599,133	13,589,579

While the top 20 unigrams cover about 28% of all unigrams in the text, the top 20 bigrams cover only 3.1% of all bigrams, and the top 20 4-grams cover only 0.1% of all 4-grams. 90% coverage is achieved by the topmost 7998 unigrams. To achieve the same coverage for bigrams and trigrams, about 2544441 bigrams, and 9601362 trigrams are necessary.

To achieve a good represention of the sampled text (90% coverage), only about 8000 unigrams, but millions of bigrams and trigrams are necessary.

Conclusions

Given the limited storage space of the Shiny application environment:

it is not possible to store all n-gram information for the word prediction application
it is important to omit less frequent n-grams
it is necessary to efficiently store only the most important n-gram information
it is necessary to have an efficient/fast access of stored information
it is important to develop a good model that combines the n-gram information for optimal word prediction

Next Steps

Improved Data Preprocessing and Cleaning

The occurance of unusual tokens in the n-gram data illustrates the importance of data cleaning and preprocessing. It has to be explored what to do with those unusual tokens: whether to replace them by metatokens like ’TIME, DATE, NUMBER, EMAIL, or LINK? Is it necessary to separate sentences of larger documents? Is it reasonable to remove profane words?

More efficient storage of n-grams

To guarantee a good performance of the final word prediction application, it is necessary to store the most important n-grams for efficient/fast access. Some aspects are:

removing of less frequent n-grams, i.e. frequency below a given threshold
dividing each n-gram in two parts: preceeding words (n-1) and last (to be predicted) word
usage of R data.tables instead of data,frames

Prediction Model

The model to be developed should predict the next word of an arbitrary text. The model should calculate the word prediction as a weighted combination of the predictions of 4-grams, 3-grams, 2-grams, and 1-gram information. The weights have to be trained on a suitable training-set, obtained from the text corpus. Various algorithms should be explored to enhance the prediction accuracy.

Planned prediction algorithms:

Search the last 1 to 3 words of the input text in the stored n-grams (preceeding part) and retrieve the top-5 last words (according to n-gram frequency).
Use a weighted combination of the predicted words of 2-grams, 3-grams, and 4-grams.
If not found, estimate the probability of unobserved n-grams

Questions:

Should the model use special symbols for sentence start and stop?
Should the model be able to predict the end of a sentence?
Should the model be able to predict profane words?
Should the model also use skip-grams?

Skip-grams are a generalization of n-grams in which the words need not be consecutive, but may leave gaps that are skipped over. They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis (Wikipedia: n-gram).

Shiny App

The final Shiny application should comprise a user interface in which the user can type words (part of a sentence), and the application should predict the next word.

The input text should be tokenized and preprocesses (toLower, remove punctuation …) and the last (up to 3) words should be used to predict the most probable next word on the basis of the stored n-gram frequency tables and the word prediction model.