1 Executive summary

This report is part of the Coursera data science capstone project. Provided with a large corpus of text documents in the form of blog entries, tweets or news articles, the goal of this week’s task is to perform first exploratory analyses and describe the further approach to build a next word prediction model.

Given the considerable size of the whole corpus, I analyse a subsample that covers 10% of the total English texts. As expected, I find that the distribution of words in the data set is heavily skewed. However, since the goal of this project is a next word prediction model, I decided to preprocess the corpus without stopword removal. Moreover, I analysed the probability distribution of bigrams and trigrams within the corpus. Not surprisingly, since the bigram model results in higher probabilities than the trigram model, I am planning to use the former for the word prediction model.

2 Exploratory data analysis

In my analysis, I focused on the English texts of the Swiftkey dataset which were provided as three single text documents containing lines of single contributions in the form of blog entries, news articles and tweets.

In total,there are 2545 words that occur in at least 0.1% of the documents. The following graph visualizes the number of occurences of terms that occur more than 15000 times in the corpus.

In the following is a table of the 10 least frequently occurring terms in the corpus:

lesson raising pointed rick supporting numerous determine declined regardless upcoming
286 285 285 285 284 284 283 283 282 280

In the following a word cloud of the most common words is displayed:

3 Analysis of N-grams

For a first evaluation of n-grams, I used the #rstats-package ngram due to its good performance for large data sets. Not surprisingly, I find that the probabilities of an n-gram occuring are higher for the bigram as for the trigram. For this reason, I plan to use the bigram for my prediction algorithm.

Bigrams
Bigram Frequency Probability
of the 37138 0.0046946
in the 34211 0.0043246
to the 18060 0.0022830
on the 15893 0.0020090
for the 14849 0.0018771
to be 13005 0.0016440
at the 11412 0.0014426
and the 11127 0.0014066
in a 10014 0.0012659
with the 8895 0.0011244
Trigrams
Trigram Frequency Probability
one of the 2909 0.0003677
a lot of 2548 0.0003221
to be a 1360 0.0001719
the end of 1302 0.0001646
going to be 1273 0.0001609
as well as 1237 0.0001564
out of the 1222 0.0001545
some of the 1187 0.0001500
it was a 1176 0.0001487
be able to 1158 0.0001464

4 Next steps

In the next week I will focus on the prediction model. In order to be able to train the model, first an evaluation routine has to be defined.

In order to be able to measure the quality of my prediction model, I’m going to apply a train/test split. Hence, only the train subsample is going to be used to fit my actual prediction model to the data. Then, I measure the performance on the test set. In general, I assume that the higher the assigned probability to the test set is, the better my model performs. As a metric, I am going to use perplexity since minimizing perplexity is equivalent to maximizing probability.

Going forward, I plan to use a prediction model based on the ngrams that were generated by the data.

LS0tDQp0aXRsZTogIldlZWsgMjogTWlsZXN0b25lIFJlcG9ydCINCnN1YnRpdGxlOiAiRGF0YSBTY2llbmNlIENhcHN0b25lIFByb2plY3QiDQphdXRob3I6IEphbg0Kb3V0cHV0OiANCiAgaHRtbF9ub3RlYm9vazogDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCi0tLQ0KYGBge3IgbG9hZCByZXNvdXJjZXMsIGVjaG89RkFMU0UsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0UsIHJlc3VsdHM9J2hpZGUnfQ0KbGlicmFyeShrbml0cikNCmxpYnJhcnkoa2FibGVFeHRyYSkNCnNvdXJjZSgnMDJfYW5hbHlzaXMuUicpDQpgYGANCg0KIyBFeGVjdXRpdmUgc3VtbWFyeQ0KDQpUaGlzIHJlcG9ydCBpcyBwYXJ0IG9mIHRoZSBDb3Vyc2VyYSBkYXRhIHNjaWVuY2UgY2Fwc3RvbmUgcHJvamVjdC4gUHJvdmlkZWQgd2l0aCBhIGxhcmdlIGNvcnB1cyBvZiB0ZXh0IGRvY3VtZW50cyBpbiB0aGUgZm9ybSBvZiBibG9nIGVudHJpZXMsIHR3ZWV0cyBvciBuZXdzIGFydGljbGVzLCB0aGUgZ29hbCBvZiB0aGlzIHdlZWsncyB0YXNrIGlzIHRvIHBlcmZvcm0gZmlyc3QgZXhwbG9yYXRvcnkgYW5hbHlzZXMgYW5kIGRlc2NyaWJlIHRoZSBmdXJ0aGVyIGFwcHJvYWNoIHRvIGJ1aWxkIGEgbmV4dCB3b3JkIHByZWRpY3Rpb24gbW9kZWwuIA0KDQpHaXZlbiB0aGUgY29uc2lkZXJhYmxlIHNpemUgb2YgdGhlIHdob2xlIGNvcnB1cywgSSBhbmFseXNlIGEgc3Vic2FtcGxlIHRoYXQgY292ZXJzIFxhcHByb3ggMTBcJSBvZiB0aGUgdG90YWwgRW5nbGlzaCB0ZXh0cy4gQXMgZXhwZWN0ZWQsIEkgZmluZCB0aGF0IHRoZSBkaXN0cmlidXRpb24gb2Ygd29yZHMgaW4gdGhlIGRhdGEgc2V0IGlzIGhlYXZpbHkgc2tld2VkLiBIb3dldmVyLCBzaW5jZSB0aGUgZ29hbCBvZiB0aGlzIHByb2plY3QgaXMgYSBuZXh0IHdvcmQgcHJlZGljdGlvbiBtb2RlbCwgSSBkZWNpZGVkIHRvIHByZXByb2Nlc3MgdGhlIGNvcnB1cyB3aXRob3V0IHN0b3B3b3JkIHJlbW92YWwuIE1vcmVvdmVyLCBJIGFuYWx5c2VkIHRoZSBwcm9iYWJpbGl0eSBkaXN0cmlidXRpb24gb2YgYmlncmFtcyBhbmQgdHJpZ3JhbXMgd2l0aGluIHRoZSBjb3JwdXMuIE5vdCBzdXJwcmlzaW5nbHksIHNpbmNlIHRoZSBiaWdyYW0gbW9kZWwgcmVzdWx0cyBpbiBoaWdoZXIgcHJvYmFiaWxpdGllcyB0aGFuIHRoZSB0cmlncmFtIG1vZGVsLCBJIGFtIHBsYW5uaW5nIHRvIHVzZSB0aGUgZm9ybWVyIGZvciB0aGUgd29yZCBwcmVkaWN0aW9uIG1vZGVsLg0KDQojIEV4cGxvcmF0b3J5IGRhdGEgYW5hbHlzaXMNCg0KSW4gbXkgYW5hbHlzaXMsIEkgZm9jdXNlZCBvbiB0aGUgRW5nbGlzaCB0ZXh0cyBvZiB0aGUgU3dpZnRrZXkgZGF0YXNldCB3aGljaCB3ZXJlIHByb3ZpZGVkIGFzIHRocmVlIHNpbmdsZSB0ZXh0IGRvY3VtZW50cyBjb250YWluaW5nIGxpbmVzIG9mIHNpbmdsZSBjb250cmlidXRpb25zIGluIHRoZSBmb3JtIG9mIGJsb2cgZW50cmllcywgbmV3cyBhcnRpY2xlcyBhbmQgdHdlZXRzLg0KYGBge3IgZWNobz1GYWxzZSwgcmVzdWx0cz0nYXNpcyd9DQprYWJsZShkb2Nfb3ZlcnZpZXcsIGNhcHRpb24gPSAiT3ZlcnZpZXcgb24gdGhlIGRhdGEgc291cmNlcy4iLCBjb2wubmFtZXMgPSBjKCdGaWxlIG5hbWUnLCAnTm8uIG9mIERvY3VtZW50cycpKQ0KYGBgDQoNCkluIHRvdGFsLHRoZXJlIGFyZSBgYGByIGxlbmd0aChmcmVxKWBgYCB3b3JkcyB0aGF0IG9jY3VyIGluIGF0IGxlYXN0IGBgYHIgKDEtc3BhcnNpdHlfdGhyZXNob2xkKSoxMDBgYGAlIG9mIHRoZSBkb2N1bWVudHMuIFRoZSBmb2xsb3dpbmcgZ3JhcGggdmlzdWFsaXplcyB0aGUgbnVtYmVyIG9mIG9jY3VyZW5jZXMgb2YgdGVybXMgdGhhdCBvY2N1ciBtb3JlIHRoYW4gMTUwMDAgdGltZXMgaW4gdGhlIGNvcnB1cy4NCmBgYHtyIHBsb3RfZGF0YSwgZWNobz1GQUxTRX0NCnBsb3RfZnJlcXVlbmNpZXMNCmBgYA0KSW4gdGhlIGZvbGxvd2luZyBpcyBhIHRhYmxlIG9mIHRoZSAxMCBsZWFzdCBmcmVxdWVudGx5IG9jY3VycmluZyB0ZXJtcyBpbiB0aGUgY29ycHVzOg0KDQpgYGB7ciwgZWNobz1GQUxTRSwgcmVzdWx0cz0nYXNpcyd9DQprYWJsZSh0KGZyZXFbdGFpbChvcmQsIG4gPSAxMCldKSkNCmBgYA0KDQpJbiB0aGUgZm9sbG93aW5nIGEgd29yZCBjbG91ZCBvZiB0aGUgbW9zdCBjb21tb24gd29yZHMgaXMgZGlzcGxheWVkOg0KYGBge3IgZWNobz1GQUxTRX0NCndvcmRjbG91ZChuYW1lcyhmcmVxKSxmcmVxLG1pbi5mcmVxPTI1MDAsY29sb3JzPWJyZXdlci5wYWwoNiwiRGFyazIiKSkNCmBgYA0KDQoNCiMgQW5hbHlzaXMgb2YgTi1ncmFtcw0KDQpGb3IgYSBmaXJzdCBldmFsdWF0aW9uIG9mIG4tZ3JhbXMsIEkgdXNlZCB0aGUgI3JzdGF0cy1wYWNrYWdlIG5ncmFtIGR1ZSB0byBpdHMgZ29vZCBwZXJmb3JtYW5jZSBmb3IgbGFyZ2UgZGF0YSBzZXRzLiBOb3Qgc3VycHJpc2luZ2x5LCBJIGZpbmQgdGhhdCB0aGUgcHJvYmFiaWxpdGllcyBvZiBhbiBuLWdyYW0gb2NjdXJpbmcgYXJlIGhpZ2hlciBmb3IgdGhlIGJpZ3JhbSBhcyBmb3IgdGhlIHRyaWdyYW0uIEZvciB0aGlzIHJlYXNvbiwgSSBwbGFuIHRvIHVzZSB0aGUgYmlncmFtIGZvciBteSBwcmVkaWN0aW9uIGFsZ29yaXRobS4NCmBgYHtyLCBlY2hvID0gRkFMU0V9DQprYWJsZShoZWFkKGJnX3BocmFzZXMsIG49MTApLCBjYXB0aW9uID0gIkJpZ3JhbXMiLCBjb2wubmFtZXMgPSBjKCdCaWdyYW0nLCAnRnJlcXVlbmN5JywgJ1Byb2JhYmlsaXR5JykpICU+JSBrYWJsZV9zdHlsaW5nKGZ1bGxfd2lkdGggPSBGQUxTRSwgcG9zaXRpb24gPSAiZmxvYXRfbGVmdCIpDQprYWJsZShoZWFkKG5nX3BocmFzZXMsIG49MTApLCBjYXB0aW9uID0gIlRyaWdyYW1zIiwgY29sLm5hbWVzID0gYygnVHJpZ3JhbScsICdGcmVxdWVuY3knLCAnUHJvYmFiaWxpdHknKSkgJT4lIGthYmxlX3N0eWxpbmcoZnVsbF93aWR0aCA9IEZBTFNFLCBwb3NpdGlvbiA9ICJsZWZ0IikNCmBgYA0KDQojIE5leHQgc3RlcHMNCg0KSW4gdGhlIG5leHQgd2VlayBJIHdpbGwgZm9jdXMgb24gdGhlIHByZWRpY3Rpb24gbW9kZWwuIEluIG9yZGVyIHRvIGJlIGFibGUgdG8gdHJhaW4gdGhlIG1vZGVsLCBmaXJzdCBhbiBldmFsdWF0aW9uIHJvdXRpbmUgaGFzIHRvIGJlIGRlZmluZWQuIA0KDQpJbiBvcmRlciB0byBiZSBhYmxlIHRvIG1lYXN1cmUgdGhlIHF1YWxpdHkgb2YgbXkgcHJlZGljdGlvbiBtb2RlbCwgSSdtIGdvaW5nIHRvIGFwcGx5IGEgdHJhaW4vdGVzdCBzcGxpdC4gSGVuY2UsIG9ubHkgdGhlIHRyYWluIHN1YnNhbXBsZSBpcyBnb2luZyB0byBiZSB1c2VkIHRvIGZpdCBteSBhY3R1YWwgcHJlZGljdGlvbiBtb2RlbCB0byB0aGUgZGF0YS4gVGhlbiwgSSBtZWFzdXJlIHRoZSBwZXJmb3JtYW5jZSBvbiB0aGUgdGVzdCBzZXQuIEluIGdlbmVyYWwsIEkgYXNzdW1lIHRoYXQgdGhlIGhpZ2hlciB0aGUgYXNzaWduZWQgcHJvYmFiaWxpdHkgdG8gdGhlICoqdGVzdCoqIHNldCBpcywgdGhlIGJldHRlciBteSBtb2RlbCBwZXJmb3Jtcy4gQXMgYSBtZXRyaWMsIEkgYW0gZ29pbmcgdG8gdXNlICoqcGVycGxleGl0eSoqIHNpbmNlIG1pbmltaXppbmcgcGVycGxleGl0eSBpcyBlcXVpdmFsZW50IHRvIG1heGltaXppbmcgcHJvYmFiaWxpdHkuDQoNCkdvaW5nIGZvcndhcmQsIEkgcGxhbiB0byB1c2UgYSBwcmVkaWN0aW9uIG1vZGVsIGJhc2VkIG9uIHRoZSBuZ3JhbXMgdGhhdCB3ZXJlIGdlbmVyYXRlZCBieSB0aGUgZGF0YS4=