Executive summary
This report is part of the Coursera data science capstone project. Provided with a large corpus of text documents in the form of blog entries, tweets or news articles, the goal of this week’s task is to perform first exploratory analyses and describe the further approach to build a next word prediction model.
Given the considerable size of the whole corpus, I analyse a subsample that covers 10% of the total English texts. As expected, I find that the distribution of words in the data set is heavily skewed. However, since the goal of this project is a next word prediction model, I decided to preprocess the corpus without stopword removal. Moreover, I analysed the probability distribution of bigrams and trigrams within the corpus. Not surprisingly, since the bigram model results in higher probabilities than the trigram model, I am planning to use the former for the word prediction model.
Exploratory data analysis
In my analysis, I focused on the English texts of the Swiftkey dataset which were provided as three single text documents containing lines of single contributions in the form of blog entries, news articles and tweets.
In total,there are 2545 words that occur in at least 0.1% of the documents. The following graph visualizes the number of occurences of terms that occur more than 15000 times in the corpus.

In the following is a table of the 10 least frequently occurring terms in the corpus:
|
lesson
|
raising
|
pointed
|
rick
|
supporting
|
numerous
|
determine
|
declined
|
regardless
|
upcoming
|
|
286
|
285
|
285
|
285
|
284
|
284
|
283
|
283
|
282
|
280
|
In the following a word cloud of the most common words is displayed:

Analysis of N-grams
For a first evaluation of n-grams, I used the #rstats-package ngram due to its good performance for large data sets. Not surprisingly, I find that the probabilities of an n-gram occuring are higher for the bigram as for the trigram. For this reason, I plan to use the bigram for my prediction algorithm.
Bigrams
| Bigram |
Frequency |
Probability |
| of the |
37138 |
0.0046946 |
| in the |
34211 |
0.0043246 |
| to the |
18060 |
0.0022830 |
| on the |
15893 |
0.0020090 |
| for the |
14849 |
0.0018771 |
| to be |
13005 |
0.0016440 |
| at the |
11412 |
0.0014426 |
| and the |
11127 |
0.0014066 |
| in a |
10014 |
0.0012659 |
| with the |
8895 |
0.0011244 |
Trigrams
| Trigram |
Frequency |
Probability |
| one of the |
2909 |
0.0003677 |
| a lot of |
2548 |
0.0003221 |
| to be a |
1360 |
0.0001719 |
| the end of |
1302 |
0.0001646 |
| going to be |
1273 |
0.0001609 |
| as well as |
1237 |
0.0001564 |
| out of the |
1222 |
0.0001545 |
| some of the |
1187 |
0.0001500 |
| it was a |
1176 |
0.0001487 |
| be able to |
1158 |
0.0001464 |
Next steps
In the next week I will focus on the prediction model. In order to be able to train the model, first an evaluation routine has to be defined.
In order to be able to measure the quality of my prediction model, I’m going to apply a train/test split. Hence, only the train subsample is going to be used to fit my actual prediction model to the data. Then, I measure the performance on the test set. In general, I assume that the higher the assigned probability to the test set is, the better my model performs. As a metric, I am going to use perplexity since minimizing perplexity is equivalent to maximizing probability.
Going forward, I plan to use a prediction model based on the ngrams that were generated by the data.
LS0tDQp0aXRsZTogIldlZWsgMjogTWlsZXN0b25lIFJlcG9ydCINCnN1YnRpdGxlOiAiRGF0YSBTY2llbmNlIENhcHN0b25lIFByb2plY3QiDQphdXRob3I6IEphbg0Kb3V0cHV0OiANCiAgaHRtbF9ub3RlYm9vazogDQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCi0tLQ0KYGBge3IgbG9hZCByZXNvdXJjZXMsIGVjaG89RkFMU0UsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0UsIHJlc3VsdHM9J2hpZGUnfQ0KbGlicmFyeShrbml0cikNCmxpYnJhcnkoa2FibGVFeHRyYSkNCnNvdXJjZSgnMDJfYW5hbHlzaXMuUicpDQpgYGANCg0KIyBFeGVjdXRpdmUgc3VtbWFyeQ0KDQpUaGlzIHJlcG9ydCBpcyBwYXJ0IG9mIHRoZSBDb3Vyc2VyYSBkYXRhIHNjaWVuY2UgY2Fwc3RvbmUgcHJvamVjdC4gUHJvdmlkZWQgd2l0aCBhIGxhcmdlIGNvcnB1cyBvZiB0ZXh0IGRvY3VtZW50cyBpbiB0aGUgZm9ybSBvZiBibG9nIGVudHJpZXMsIHR3ZWV0cyBvciBuZXdzIGFydGljbGVzLCB0aGUgZ29hbCBvZiB0aGlzIHdlZWsncyB0YXNrIGlzIHRvIHBlcmZvcm0gZmlyc3QgZXhwbG9yYXRvcnkgYW5hbHlzZXMgYW5kIGRlc2NyaWJlIHRoZSBmdXJ0aGVyIGFwcHJvYWNoIHRvIGJ1aWxkIGEgbmV4dCB3b3JkIHByZWRpY3Rpb24gbW9kZWwuIA0KDQpHaXZlbiB0aGUgY29uc2lkZXJhYmxlIHNpemUgb2YgdGhlIHdob2xlIGNvcnB1cywgSSBhbmFseXNlIGEgc3Vic2FtcGxlIHRoYXQgY292ZXJzIFxhcHByb3ggMTBcJSBvZiB0aGUgdG90YWwgRW5nbGlzaCB0ZXh0cy4gQXMgZXhwZWN0ZWQsIEkgZmluZCB0aGF0IHRoZSBkaXN0cmlidXRpb24gb2Ygd29yZHMgaW4gdGhlIGRhdGEgc2V0IGlzIGhlYXZpbHkgc2tld2VkLiBIb3dldmVyLCBzaW5jZSB0aGUgZ29hbCBvZiB0aGlzIHByb2plY3QgaXMgYSBuZXh0IHdvcmQgcHJlZGljdGlvbiBtb2RlbCwgSSBkZWNpZGVkIHRvIHByZXByb2Nlc3MgdGhlIGNvcnB1cyB3aXRob3V0IHN0b3B3b3JkIHJlbW92YWwuIE1vcmVvdmVyLCBJIGFuYWx5c2VkIHRoZSBwcm9iYWJpbGl0eSBkaXN0cmlidXRpb24gb2YgYmlncmFtcyBhbmQgdHJpZ3JhbXMgd2l0aGluIHRoZSBjb3JwdXMuIE5vdCBzdXJwcmlzaW5nbHksIHNpbmNlIHRoZSBiaWdyYW0gbW9kZWwgcmVzdWx0cyBpbiBoaWdoZXIgcHJvYmFiaWxpdGllcyB0aGFuIHRoZSB0cmlncmFtIG1vZGVsLCBJIGFtIHBsYW5uaW5nIHRvIHVzZSB0aGUgZm9ybWVyIGZvciB0aGUgd29yZCBwcmVkaWN0aW9uIG1vZGVsLg0KDQojIEV4cGxvcmF0b3J5IGRhdGEgYW5hbHlzaXMNCg0KSW4gbXkgYW5hbHlzaXMsIEkgZm9jdXNlZCBvbiB0aGUgRW5nbGlzaCB0ZXh0cyBvZiB0aGUgU3dpZnRrZXkgZGF0YXNldCB3aGljaCB3ZXJlIHByb3ZpZGVkIGFzIHRocmVlIHNpbmdsZSB0ZXh0IGRvY3VtZW50cyBjb250YWluaW5nIGxpbmVzIG9mIHNpbmdsZSBjb250cmlidXRpb25zIGluIHRoZSBmb3JtIG9mIGJsb2cgZW50cmllcywgbmV3cyBhcnRpY2xlcyBhbmQgdHdlZXRzLg0KYGBge3IgZWNobz1GYWxzZSwgcmVzdWx0cz0nYXNpcyd9DQprYWJsZShkb2Nfb3ZlcnZpZXcsIGNhcHRpb24gPSAiT3ZlcnZpZXcgb24gdGhlIGRhdGEgc291cmNlcy4iLCBjb2wubmFtZXMgPSBjKCdGaWxlIG5hbWUnLCAnTm8uIG9mIERvY3VtZW50cycpKQ0KYGBgDQoNCkluIHRvdGFsLHRoZXJlIGFyZSBgYGByIGxlbmd0aChmcmVxKWBgYCB3b3JkcyB0aGF0IG9jY3VyIGluIGF0IGxlYXN0IGBgYHIgKDEtc3BhcnNpdHlfdGhyZXNob2xkKSoxMDBgYGAlIG9mIHRoZSBkb2N1bWVudHMuIFRoZSBmb2xsb3dpbmcgZ3JhcGggdmlzdWFsaXplcyB0aGUgbnVtYmVyIG9mIG9jY3VyZW5jZXMgb2YgdGVybXMgdGhhdCBvY2N1ciBtb3JlIHRoYW4gMTUwMDAgdGltZXMgaW4gdGhlIGNvcnB1cy4NCmBgYHtyIHBsb3RfZGF0YSwgZWNobz1GQUxTRX0NCnBsb3RfZnJlcXVlbmNpZXMNCmBgYA0KSW4gdGhlIGZvbGxvd2luZyBpcyBhIHRhYmxlIG9mIHRoZSAxMCBsZWFzdCBmcmVxdWVudGx5IG9jY3VycmluZyB0ZXJtcyBpbiB0aGUgY29ycHVzOg0KDQpgYGB7ciwgZWNobz1GQUxTRSwgcmVzdWx0cz0nYXNpcyd9DQprYWJsZSh0KGZyZXFbdGFpbChvcmQsIG4gPSAxMCldKSkNCmBgYA0KDQpJbiB0aGUgZm9sbG93aW5nIGEgd29yZCBjbG91ZCBvZiB0aGUgbW9zdCBjb21tb24gd29yZHMgaXMgZGlzcGxheWVkOg0KYGBge3IgZWNobz1GQUxTRX0NCndvcmRjbG91ZChuYW1lcyhmcmVxKSxmcmVxLG1pbi5mcmVxPTI1MDAsY29sb3JzPWJyZXdlci5wYWwoNiwiRGFyazIiKSkNCmBgYA0KDQoNCiMgQW5hbHlzaXMgb2YgTi1ncmFtcw0KDQpGb3IgYSBmaXJzdCBldmFsdWF0aW9uIG9mIG4tZ3JhbXMsIEkgdXNlZCB0aGUgI3JzdGF0cy1wYWNrYWdlIG5ncmFtIGR1ZSB0byBpdHMgZ29vZCBwZXJmb3JtYW5jZSBmb3IgbGFyZ2UgZGF0YSBzZXRzLiBOb3Qgc3VycHJpc2luZ2x5LCBJIGZpbmQgdGhhdCB0aGUgcHJvYmFiaWxpdGllcyBvZiBhbiBuLWdyYW0gb2NjdXJpbmcgYXJlIGhpZ2hlciBmb3IgdGhlIGJpZ3JhbSBhcyBmb3IgdGhlIHRyaWdyYW0uIEZvciB0aGlzIHJlYXNvbiwgSSBwbGFuIHRvIHVzZSB0aGUgYmlncmFtIGZvciBteSBwcmVkaWN0aW9uIGFsZ29yaXRobS4NCmBgYHtyLCBlY2hvID0gRkFMU0V9DQprYWJsZShoZWFkKGJnX3BocmFzZXMsIG49MTApLCBjYXB0aW9uID0gIkJpZ3JhbXMiLCBjb2wubmFtZXMgPSBjKCdCaWdyYW0nLCAnRnJlcXVlbmN5JywgJ1Byb2JhYmlsaXR5JykpICU+JSBrYWJsZV9zdHlsaW5nKGZ1bGxfd2lkdGggPSBGQUxTRSwgcG9zaXRpb24gPSAiZmxvYXRfbGVmdCIpDQprYWJsZShoZWFkKG5nX3BocmFzZXMsIG49MTApLCBjYXB0aW9uID0gIlRyaWdyYW1zIiwgY29sLm5hbWVzID0gYygnVHJpZ3JhbScsICdGcmVxdWVuY3knLCAnUHJvYmFiaWxpdHknKSkgJT4lIGthYmxlX3N0eWxpbmcoZnVsbF93aWR0aCA9IEZBTFNFLCBwb3NpdGlvbiA9ICJsZWZ0IikNCmBgYA0KDQojIE5leHQgc3RlcHMNCg0KSW4gdGhlIG5leHQgd2VlayBJIHdpbGwgZm9jdXMgb24gdGhlIHByZWRpY3Rpb24gbW9kZWwuIEluIG9yZGVyIHRvIGJlIGFibGUgdG8gdHJhaW4gdGhlIG1vZGVsLCBmaXJzdCBhbiBldmFsdWF0aW9uIHJvdXRpbmUgaGFzIHRvIGJlIGRlZmluZWQuIA0KDQpJbiBvcmRlciB0byBiZSBhYmxlIHRvIG1lYXN1cmUgdGhlIHF1YWxpdHkgb2YgbXkgcHJlZGljdGlvbiBtb2RlbCwgSSdtIGdvaW5nIHRvIGFwcGx5IGEgdHJhaW4vdGVzdCBzcGxpdC4gSGVuY2UsIG9ubHkgdGhlIHRyYWluIHN1YnNhbXBsZSBpcyBnb2luZyB0byBiZSB1c2VkIHRvIGZpdCBteSBhY3R1YWwgcHJlZGljdGlvbiBtb2RlbCB0byB0aGUgZGF0YS4gVGhlbiwgSSBtZWFzdXJlIHRoZSBwZXJmb3JtYW5jZSBvbiB0aGUgdGVzdCBzZXQuIEluIGdlbmVyYWwsIEkgYXNzdW1lIHRoYXQgdGhlIGhpZ2hlciB0aGUgYXNzaWduZWQgcHJvYmFiaWxpdHkgdG8gdGhlICoqdGVzdCoqIHNldCBpcywgdGhlIGJldHRlciBteSBtb2RlbCBwZXJmb3Jtcy4gQXMgYSBtZXRyaWMsIEkgYW0gZ29pbmcgdG8gdXNlICoqcGVycGxleGl0eSoqIHNpbmNlIG1pbmltaXppbmcgcGVycGxleGl0eSBpcyBlcXVpdmFsZW50IHRvIG1heGltaXppbmcgcHJvYmFiaWxpdHkuDQoNCkdvaW5nIGZvcndhcmQsIEkgcGxhbiB0byB1c2UgYSBwcmVkaWN0aW9uIG1vZGVsIGJhc2VkIG9uIHRoZSBuZ3JhbXMgdGhhdCB3ZXJlIGdlbmVyYXRlZCBieSB0aGUgZGF0YS4=