Summary:

This document is the Milestone report for week 2 of the Coursera Data Science Capstone and shows the initial roadmap to construct a Shiny App for predicting the next word, with Natural Language Processing as a framework.


1. Data reading

The data for model building consist of three large files/datasets written in English:

The read_lines function of the readr package was used for reading the data, because the en_US.news.txt file has some tricky end of line characters.

1.1 Data summary

Here there are the main features of the three data sources, before any cleaning process.

File Size_MB Lines Words
en_US.blogs.txt 200.4 899288 37546239
en_US.news.txt 196.3 1010242 34762395
en_US.twitter.txt 159.4 2360148 30093372

1.2 Data selection for exploratory analysis

Due to the great amount of data, some decisions had made for selecting the dataset for exploratory analysis and move forward with the model construction:

  • The tweets dataset it’s going to be excluded. That’s because it has a lot of non-common words, used for #hashtags and @usernames. Also, language used for tweets is often too informal for making a good base for correct spelling selection.

  • Some bibliographic sources state that it’s not necessary to include every record of the available data in the model construction. For that reason, a dataset with 200.000 records is going to be used for the initial exploratory analysis, having half of the records coming from the blogs dataset and half from the news dataset.


2. Data cleaning

For cleaning the dataset, the stringr and tm packages were used, aplying the following process to the data:


3. Exploratory analysis

The comparison between the summaries of the original and the clean corpora appears in the following table. Cleaning process reduced the amount of unique words to 61% of the original data.

Corpora Size_MB Lines Words Unique_Words
Original 53.7 200000 7590460 221535
Cleaned 51.8 200000 7498601 135112

3.1 Most frequent words

An interesting exercise was calculating the frequency of appearance of each word for both, the original and the clean corpora. The ten most used words for the latest, which covers the 22.1% are the following.

n % %cum
the 401065 5.3 5.3
and 210716 2.8 8.2
to 208266 2.8 10.9
a 186774 2.5 13.4
of 172883 2.3 15.7
in 133200 1.8 17.5
i 108169 1.4 18.9
that 86784 1.2 20.1
is 75937 1.0 21.1
for 75584 1.0 22.1

3.2 Wordcloud plot

This figure shows the most frequent word, with size and color, of the cleaned corpora.

3.3 Coverage comparison

The next figure shows the plot of Coverage % vs Number of words, for both the original corpora and the cleaned one. The graphic confirms what the bibliography suggests: a good coverage could be achieved without including the complete words of the corpora.


3.4 Coverage summary

The following table shows the comparison of the coverage of the most frequent words for the original and the cleaned corpora. It can be apreciated from it, that with the same amount of words, the coverage is always higher for the cleaned corpora.

From this analysis, it can be concluded that with a bag of the 10.000 most common words coming from a clean dataset, a very good prediction model can be built.

Unique_Words Original_Coverage_% Cleaned_Coverage_%
200 50.7 54.0
1000 66.9 70.4
3500 80.0 83.7
10000 88.9 92.2
50000 96.8 98.5
100000 98.4 99.5
200000 99.7 NA

4. Next steps


LS0tDQp0aXRsZTogIkRhdGEgU2NpZW5jZSBDYXBzdG9uZSINCnN1YnRpdGxlOiAiV2VlayAyIE1pbGVzdG9uZSBSZXBvcnQiDQphdXRob3I6IFBhdHJpY2sgTWFjaGFkbw0KZGF0ZTogOSBhdWcgMjAxOQ0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KLS0tDQoNCipTdW1tYXJ5OioNCg0KVGhpcyBkb2N1bWVudCBpcyB0aGUgTWlsZXN0b25lIHJlcG9ydCBmb3Igd2VlayAyIG9mIHRoZSBbQ291cnNlcmEgRGF0YSBTY2llbmNlIENhcHN0b25lXShodHRwczovL3d3dy5jb3Vyc2VyYS5vcmcvbGVhcm4vZGF0YS1zY2llbmNlLXByb2plY3QvcGVlci9CUlgyMS9taWxlc3RvbmUtcmVwb3J0KSBhbmQgc2hvd3MgdGhlIGluaXRpYWwgcm9hZG1hcCB0byBjb25zdHJ1Y3QgYSBTaGlueSBBcHAgZm9yIHByZWRpY3RpbmcgdGhlIG5leHQgd29yZCwgd2l0aCBOYXR1cmFsIExhbmd1YWdlIFByb2Nlc3NpbmcgYXMgYSBmcmFtZXdvcmsuDQoNCi0tLQ0KDQojIyAxLiBEYXRhIHJlYWRpbmcNCg0KVGhlIFtkYXRhXShodHRwczovL2QzOTZxdXN6YTQwb3JjLmNsb3VkZnJvbnQubmV0L2Rzc2NhcHN0b25lL2RhdGFzZXQvQ291cnNlcmEtU3dpZnRLZXkuemlwKSBmb3IgbW9kZWwgYnVpbGRpbmcgY29uc2lzdCBvZiB0aHJlZSBsYXJnZSBmaWxlcy9kYXRhc2V0cyB3cml0dGVuIGluIEVuZ2xpc2g6DQoNCi0gKiplbl9VUy5ibG9ncy50eHQqKjogdGV4dCBleHRyYWN0ZWQgZnJvbSB3ZWIgYmxvZ3MNCi0gKiplbl9VUy5uZXdzLnR4dCoqOiB0ZXh0IGV4dHJhY3RlZCBmcm9tIG5ld3MgYXJ0aWNsZXMNCi0gKiplbl9VUy50d2l0dGVyLnR4dCoqOiBjb2xsZWN0aW9uIG9mIHJhbmRvbSB0d2VldHMNCg0KVGhlIGByZWFkX2xpbmVzYCBmdW5jdGlvbiBvZiB0aGUgYHJlYWRyYCBwYWNrYWdlIHdhcyB1c2VkIGZvciByZWFkaW5nIHRoZSBkYXRhLCBiZWNhdXNlIHRoZSAqZW5fVVMubmV3cy50eHQqIGZpbGUgaGFzIHNvbWUgdHJpY2t5IGVuZCBvZiBsaW5lIGNoYXJhY3RlcnMuDQoNCg0KIyMjIDEuMSBEYXRhIHN1bW1hcnkNCg0KSGVyZSB0aGVyZSBhcmUgdGhlIG1haW4gZmVhdHVyZXMgb2YgdGhlIHRocmVlIGRhdGEgc291cmNlcywgYmVmb3JlIGFueSBjbGVhbmluZyBwcm9jZXNzLg0KDQpgYGB7ciBsb2FkRGF0YSwgaW5jbHVkZT1GQUxTRX0NCg0KbGlicmFyeShnZ3Bsb3QyKQ0KbGlicmFyeSh3b3JkY2xvdWQpDQpsb2FkKCIuL0RhdGEvbS5yZXBvcnQuUkRhdGEiKQ0KDQpgYGANCg0KDQpgYGB7ciBzdW1tRGF0YSwgZWNobz1GQUxTRX0NCg0Ka25pdHI6OmthYmxlKHN1bW1EYXRhLCBkaWdpdHMgPSAxKQ0KYGBgDQoNCg0KIyMjIDEuMiBEYXRhIHNlbGVjdGlvbiBmb3IgZXhwbG9yYXRvcnkgYW5hbHlzaXMNCg0KRHVlIHRvIHRoZSBncmVhdCBhbW91bnQgb2YgZGF0YSwgc29tZSBkZWNpc2lvbnMgaGFkIG1hZGUgZm9yIHNlbGVjdGluZyB0aGUgZGF0YXNldCBmb3IgZXhwbG9yYXRvcnkgYW5hbHlzaXMgYW5kIG1vdmUgZm9yd2FyZCB3aXRoIHRoZSBtb2RlbCBjb25zdHJ1Y3Rpb246DQoNCi0gVGhlIHR3ZWV0cyBkYXRhc2V0IGl0J3MgZ29pbmcgdG8gYmUgZXhjbHVkZWQuIFRoYXQncyBiZWNhdXNlIGl0IGhhcyBhIGxvdCBvZiBub24tY29tbW9uIHdvcmRzLCB1c2VkIGZvciAqI2hhc2h0YWdzKiBhbmQgKkB1c2VybmFtZXMqLiBBbHNvLCBsYW5ndWFnZSB1c2VkIGZvciB0d2VldHMgaXMgb2Z0ZW4gdG9vIGluZm9ybWFsIGZvciBtYWtpbmcgYSBnb29kIGJhc2UgZm9yIGNvcnJlY3Qgc3BlbGxpbmcgc2VsZWN0aW9uLg0KDQotIFNvbWUgYmlibGlvZ3JhcGhpYyBzb3VyY2VzIHN0YXRlIHRoYXQgaXQncyBub3QgbmVjZXNzYXJ5IHRvIGluY2x1ZGUgZXZlcnkgcmVjb3JkIG9mIHRoZSBhdmFpbGFibGUgZGF0YSBpbiB0aGUgbW9kZWwgY29uc3RydWN0aW9uLiBGb3IgdGhhdCByZWFzb24sIGEgZGF0YXNldCB3aXRoIDIwMC4wMDAgcmVjb3JkcyBpcyBnb2luZyB0byBiZSB1c2VkIGZvciB0aGUgaW5pdGlhbCBleHBsb3JhdG9yeSBhbmFseXNpcywgaGF2aW5nIGhhbGYgb2YgdGhlIHJlY29yZHMgY29taW5nIGZyb20gdGhlIGJsb2dzIGRhdGFzZXQgYW5kIGhhbGYgZnJvbSB0aGUgbmV3cyBkYXRhc2V0Lg0KDQotLS0NCg0KIyMgMi4gRGF0YSBjbGVhbmluZw0KDQpGb3IgY2xlYW5pbmcgdGhlIGRhdGFzZXQsIHRoZSBgc3RyaW5ncmAgYW5kIGB0bWAgcGFja2FnZXMgd2VyZSB1c2VkLCBhcGx5aW5nIHRoZSBmb2xsb3dpbmcgcHJvY2VzcyB0byB0aGUgZGF0YToNCg0KLSBVUkxzIHdlcmUgcmVtb3ZlZCANCi0gU3ltYm9sICImIiB3ZXJlIHJlcGxhY2VkIGJ5ICJhbmQiDQotIFN5bWJvbCAiLSIgd2VyZSByZXBsYWNlZCBieSBzcGFjZSAoIiAiKSwgdG8gc2VwYXJhdGUgY29tcG91bmQgd29yZHMNCi0gVXBwZXJjYXNlIGxldHRlcnMgd2VyZSB0cmFuc2Zvcm1lZCB0byBsb3dlcmNhc2UNCi0gTnVtYmVycyB3ZXJlIHJlbW92ZWQNCi0gUHVuY3R1YXRpb24gc3ltYm9scyB3ZXJlIHJlbW92ZWQNCi0gUmVtYWluaW5nIG5vbi1jaGFyYWN0ZXIgc3ltYm9scyB3ZXJlIHJlbW92ZWQNCi0gRXh0cmEgd2hpdGUgc3BhY2VzIHdlcmUgcmVtb3ZlZA0KDQotLS0NCg0KIyMgMy4gRXhwbG9yYXRvcnkgYW5hbHlzaXMNCg0KVGhlIGNvbXBhcmlzb24gYmV0d2VlbiB0aGUgc3VtbWFyaWVzIG9mIHRoZSBvcmlnaW5hbCBhbmQgdGhlIGNsZWFuIGNvcnBvcmEgYXBwZWFycyBpbiB0aGUgZm9sbG93aW5nIHRhYmxlLiBDbGVhbmluZyBwcm9jZXNzIHJlZHVjZWQgdGhlIGFtb3VudCBvZiB1bmlxdWUgd29yZHMgdG8gKipgciBwYXN0ZTAocm91bmQoc3VtbS5jb3Jwb3JhLmNsZWFuJFVuaXF1ZV9Xb3Jkc1syXSAqIDEwMCAvIA0KICAgIHN1bW0uY29ycG9yYS5jbGVhbiRVbmlxdWVfV29yZHNbMV0sIDEpLCAiJSIpYCoqIG9mIHRoZSBvcmlnaW5hbCBkYXRhLg0KDQoNCmBgYHtyIHN1bW1DbGVhbkNvcnBvcmEsIGVjaG89RkFMU0V9DQoNCmtuaXRyOjprYWJsZShzdW1tLmNvcnBvcmEuY2xlYW4sIGRpZ2l0cyA9IDEpDQpgYGANCg0KDQojIyMgMy4xIE1vc3QgZnJlcXVlbnQgd29yZHMNCg0KQW4gaW50ZXJlc3RpbmcgZXhlcmNpc2Ugd2FzIGNhbGN1bGF0aW5nIHRoZSBmcmVxdWVuY3kgb2YgYXBwZWFyYW5jZSBvZiBlYWNoIHdvcmQgZm9yIGJvdGgsIHRoZSBvcmlnaW5hbCBhbmQgdGhlIGNsZWFuIGNvcnBvcmEuIFRoZSB0ZW4gbW9zdCB1c2VkIHdvcmRzIGZvciB0aGUgbGF0ZXN0LCB3aGljaCBjb3ZlcnMgdGhlICoqYHIgcGFzdGUwKHJvdW5kKHRvcDEwd29yZHNbMTAsICIlY3VtIl0sIDEpLCAiJSIpIGAqKiBhcmUgdGhlIGZvbGxvd2luZy4NCg0KDQpgYGB7ciB0b3AxMHdvcmRzLCBlY2hvPUZBTFNFfQ0KDQprbml0cjo6a2FibGUodG9wMTB3b3JkcywgZGlnaXRzID0gMSkNCg0KYGBgDQoNCg0KIyMjIDMuMiBXb3JkY2xvdWQgcGxvdA0KDQpUaGlzIGZpZ3VyZSBzaG93cyB0aGUgbW9zdCBmcmVxdWVudCB3b3JkLCB3aXRoIHNpemUgYW5kIGNvbG9yLCBvZiB0aGUgY2xlYW5lZCBjb3Jwb3JhLg0KDQpgYGB7ciBXb3JkY2xvdWRDbGVhbiwgZWNobz1GQUxTRSwgZmlnLmhlaWdodD04LCBmaWcud2lkdGg9MTYsIG1lc3NhZ2U9RkFMU0V9DQoNCndvcmRjbG91ZCh3b3JkY2xvdWQuY2xlYW4kV29yZHMsIHdvcmRjbG91ZC5jbGVhbiRGcmVxLA0KICAgICAgICAgIGMoNiwgMSksIC4xMiwgICMgYyg1LCAuOSksIC4xMiwNCiAgICAgICAgICByYW5kb20ub3JkZXIgPSBGQUxTRSwNCiAgICAgICAgICBjb2xvcnMgPSBicmV3ZXIucGFsKDgsICJEYXJrMiIpKQ0KDQpgYGANCg0KDQojIyMgMy4zIENvdmVyYWdlIGNvbXBhcmlzb24NCg0KVGhlIG5leHQgZmlndXJlIHNob3dzIHRoZSBwbG90IG9mIENvdmVyYWdlICUgdnMgTnVtYmVyIG9mIHdvcmRzLCBmb3IgYm90aCB0aGUgb3JpZ2luYWwgY29ycG9yYSBhbmQgdGhlIGNsZWFuZWQgb25lLiBUaGUgZ3JhcGhpYyBjb25maXJtcyB3aGF0IHRoZSBiaWJsaW9ncmFwaHkgc3VnZ2VzdHM6IGEgZ29vZCBjb3ZlcmFnZSBjb3VsZCBiZSBhY2hpZXZlZCB3aXRob3V0IGluY2x1ZGluZyB0aGUgY29tcGxldGUgd29yZHMgb2YgdGhlIGNvcnBvcmEuDQoNCi0tLQ0KDQpgYGB7ciBwbG90Q2xlYW5Db3ZlcmFnZSwgZmlnLndpZHRoPTYuNSwgZWNobz1GQUxTRX0NCg0KY292ZXJhZ2UuY2xlYW4ucGxvdA0KDQpgYGANCg0KIyMjIDMuNCBDb3ZlcmFnZSBzdW1tYXJ5DQoNClRoZSBmb2xsb3dpbmcgdGFibGUgc2hvd3MgdGhlIGNvbXBhcmlzb24gb2YgdGhlIGNvdmVyYWdlIG9mIHRoZSBtb3N0IGZyZXF1ZW50IHdvcmRzIGZvciB0aGUgb3JpZ2luYWwgYW5kIHRoZSBjbGVhbmVkIGNvcnBvcmEuIEl0IGNhbiBiZSBhcHJlY2lhdGVkIGZyb20gaXQsIHRoYXQgd2l0aCB0aGUgc2FtZSBhbW91bnQgb2Ygd29yZHMsIHRoZSBjb3ZlcmFnZSBpcyBhbHdheXMgaGlnaGVyIGZvciB0aGUgY2xlYW5lZCBjb3Jwb3JhLg0KDQpGcm9tIHRoaXMgYW5hbHlzaXMsIGl0IGNhbiBiZSBjb25jbHVkZWQgdGhhdCB3aXRoIGEgYmFnIG9mIHRoZSAxMC4wMDAgbW9zdCBjb21tb24gd29yZHMgY29taW5nIGZyb20gYSBjbGVhbiBkYXRhc2V0LCBhIHZlcnkgZ29vZCBwcmVkaWN0aW9uIG1vZGVsIGNhbiBiZSBidWlsdC4NCg0KDQpgYGB7ciB1bmlxdWVXb3Jkc0NsZWFuX3RibCwgZWNobz1GQUxTRX0NCg0Ka25pdHI6OmthYmxlKHVuaXF1ZUNsZWFuV29yZHMsIGRpZ2l0cyA9IDEpDQpgYGANCg0KLS0tDQoNCiMjIDQuIE5leHQgc3RlcHMNCg0KLSBDb250cnVjdCAyLCAzIGFuZCA0LWdyYW1zIHdpdGggdGhlIG9idGFpbmVkIGNsZWFuIGNvcnBvcmENCi0gU2VsZWN0IHRoZSBtb3N0IGNvbW1vbiBOLWdyYW1zIGFuZCB0aGUgZnJlcXVlbmN5IG9mIGVhY2ggb25lIG9mIHRoZW0NCi0gQnVpbGQgYSBzaW1wbGUgc2VsZWN0aW9uIGFsZ29yaXRobSB0byBwcmVkaWN0IHRoZSBuZXh0IHdvcmQgYW5kIGV2YWx1YXRlIGl0cyBwZXJmb3JtYW5jZSANCi0gU3RhcnQgd2l0aCBhIG1vY2stdXAgb2YgdGhlIFNoaW55LUFwcA0KDQotLS0NCg==