Summary:
This document is the Milestone report for week 2 of the Coursera Data Science Capstone and shows the initial roadmap to construct a Shiny App for predicting the next word, with Natural Language Processing as a framework.
1. Data reading
The data for model building consist of three large files/datasets written in English:
- en_US.blogs.txt: text extracted from web blogs
- en_US.news.txt: text extracted from news articles
- en_US.twitter.txt: collection of random tweets
The read_lines
function of the readr
package was used for reading the data, because the en_US.news.txt file has some tricky end of line characters.
1.1 Data summary
Here there are the main features of the three data sources, before any cleaning process.
en_US.blogs.txt |
200.4 |
899288 |
37546239 |
en_US.news.txt |
196.3 |
1010242 |
34762395 |
en_US.twitter.txt |
159.4 |
2360148 |
30093372 |
1.2 Data selection for exploratory analysis
Due to the great amount of data, some decisions had made for selecting the dataset for exploratory analysis and move forward with the model construction:
The tweets dataset it’s going to be excluded. That’s because it has a lot of non-common words, used for #hashtags and @usernames. Also, language used for tweets is often too informal for making a good base for correct spelling selection.
Some bibliographic sources state that it’s not necessary to include every record of the available data in the model construction. For that reason, a dataset with 200.000 records is going to be used for the initial exploratory analysis, having half of the records coming from the blogs dataset and half from the news dataset.
2. Data cleaning
For cleaning the dataset, the stringr
and tm
packages were used, aplying the following process to the data:
- URLs were removed
- Symbol “&” were replaced by “and”
- Symbol “-” were replaced by space (" "), to separate compound words
- Uppercase letters were transformed to lowercase
- Numbers were removed
- Punctuation symbols were removed
- Remaining non-character symbols were removed
- Extra white spaces were removed
3. Exploratory analysis
The comparison between the summaries of the original and the clean corpora appears in the following table. Cleaning process reduced the amount of unique words to 61% of the original data.
Original |
53.7 |
200000 |
7590460 |
221535 |
Cleaned |
51.8 |
200000 |
7498601 |
135112 |
3.1 Most frequent words
An interesting exercise was calculating the frequency of appearance of each word for both, the original and the clean corpora. The ten most used words for the latest, which covers the 22.1% are the following.
the |
401065 |
5.3 |
5.3 |
and |
210716 |
2.8 |
8.2 |
to |
208266 |
2.8 |
10.9 |
a |
186774 |
2.5 |
13.4 |
of |
172883 |
2.3 |
15.7 |
in |
133200 |
1.8 |
17.5 |
i |
108169 |
1.4 |
18.9 |
that |
86784 |
1.2 |
20.1 |
is |
75937 |
1.0 |
21.1 |
for |
75584 |
1.0 |
22.1 |
3.2 Wordcloud plot
This figure shows the most frequent word, with size and color, of the cleaned corpora.

3.3 Coverage comparison
The next figure shows the plot of Coverage % vs Number of words, for both the original corpora and the cleaned one. The graphic confirms what the bibliography suggests: a good coverage could be achieved without including the complete words of the corpora.

3.4 Coverage summary
The following table shows the comparison of the coverage of the most frequent words for the original and the cleaned corpora. It can be apreciated from it, that with the same amount of words, the coverage is always higher for the cleaned corpora.
From this analysis, it can be concluded that with a bag of the 10.000 most common words coming from a clean dataset, a very good prediction model can be built.
200 |
50.7 |
54.0 |
1000 |
66.9 |
70.4 |
3500 |
80.0 |
83.7 |
10000 |
88.9 |
92.2 |
50000 |
96.8 |
98.5 |
100000 |
98.4 |
99.5 |
200000 |
99.7 |
NA |
4. Next steps
- Contruct 2, 3 and 4-grams with the obtained clean corpora
- Select the most common N-grams and the frequency of each one of them
- Build a simple selection algorithm to predict the next word and evaluate its performance
- Start with a mock-up of the Shiny-App
LS0tDQp0aXRsZTogIkRhdGEgU2NpZW5jZSBDYXBzdG9uZSINCnN1YnRpdGxlOiAiV2VlayAyIE1pbGVzdG9uZSBSZXBvcnQiDQphdXRob3I6IFBhdHJpY2sgTWFjaGFkbw0KZGF0ZTogOSBhdWcgMjAxOQ0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KLS0tDQoNCipTdW1tYXJ5OioNCg0KVGhpcyBkb2N1bWVudCBpcyB0aGUgTWlsZXN0b25lIHJlcG9ydCBmb3Igd2VlayAyIG9mIHRoZSBbQ291cnNlcmEgRGF0YSBTY2llbmNlIENhcHN0b25lXShodHRwczovL3d3dy5jb3Vyc2VyYS5vcmcvbGVhcm4vZGF0YS1zY2llbmNlLXByb2plY3QvcGVlci9CUlgyMS9taWxlc3RvbmUtcmVwb3J0KSBhbmQgc2hvd3MgdGhlIGluaXRpYWwgcm9hZG1hcCB0byBjb25zdHJ1Y3QgYSBTaGlueSBBcHAgZm9yIHByZWRpY3RpbmcgdGhlIG5leHQgd29yZCwgd2l0aCBOYXR1cmFsIExhbmd1YWdlIFByb2Nlc3NpbmcgYXMgYSBmcmFtZXdvcmsuDQoNCi0tLQ0KDQojIyAxLiBEYXRhIHJlYWRpbmcNCg0KVGhlIFtkYXRhXShodHRwczovL2QzOTZxdXN6YTQwb3JjLmNsb3VkZnJvbnQubmV0L2Rzc2NhcHN0b25lL2RhdGFzZXQvQ291cnNlcmEtU3dpZnRLZXkuemlwKSBmb3IgbW9kZWwgYnVpbGRpbmcgY29uc2lzdCBvZiB0aHJlZSBsYXJnZSBmaWxlcy9kYXRhc2V0cyB3cml0dGVuIGluIEVuZ2xpc2g6DQoNCi0gKiplbl9VUy5ibG9ncy50eHQqKjogdGV4dCBleHRyYWN0ZWQgZnJvbSB3ZWIgYmxvZ3MNCi0gKiplbl9VUy5uZXdzLnR4dCoqOiB0ZXh0IGV4dHJhY3RlZCBmcm9tIG5ld3MgYXJ0aWNsZXMNCi0gKiplbl9VUy50d2l0dGVyLnR4dCoqOiBjb2xsZWN0aW9uIG9mIHJhbmRvbSB0d2VldHMNCg0KVGhlIGByZWFkX2xpbmVzYCBmdW5jdGlvbiBvZiB0aGUgYHJlYWRyYCBwYWNrYWdlIHdhcyB1c2VkIGZvciByZWFkaW5nIHRoZSBkYXRhLCBiZWNhdXNlIHRoZSAqZW5fVVMubmV3cy50eHQqIGZpbGUgaGFzIHNvbWUgdHJpY2t5IGVuZCBvZiBsaW5lIGNoYXJhY3RlcnMuDQoNCg0KIyMjIDEuMSBEYXRhIHN1bW1hcnkNCg0KSGVyZSB0aGVyZSBhcmUgdGhlIG1haW4gZmVhdHVyZXMgb2YgdGhlIHRocmVlIGRhdGEgc291cmNlcywgYmVmb3JlIGFueSBjbGVhbmluZyBwcm9jZXNzLg0KDQpgYGB7ciBsb2FkRGF0YSwgaW5jbHVkZT1GQUxTRX0NCg0KbGlicmFyeShnZ3Bsb3QyKQ0KbGlicmFyeSh3b3JkY2xvdWQpDQpsb2FkKCIuL0RhdGEvbS5yZXBvcnQuUkRhdGEiKQ0KDQpgYGANCg0KDQpgYGB7ciBzdW1tRGF0YSwgZWNobz1GQUxTRX0NCg0Ka25pdHI6OmthYmxlKHN1bW1EYXRhLCBkaWdpdHMgPSAxKQ0KYGBgDQoNCg0KIyMjIDEuMiBEYXRhIHNlbGVjdGlvbiBmb3IgZXhwbG9yYXRvcnkgYW5hbHlzaXMNCg0KRHVlIHRvIHRoZSBncmVhdCBhbW91bnQgb2YgZGF0YSwgc29tZSBkZWNpc2lvbnMgaGFkIG1hZGUgZm9yIHNlbGVjdGluZyB0aGUgZGF0YXNldCBmb3IgZXhwbG9yYXRvcnkgYW5hbHlzaXMgYW5kIG1vdmUgZm9yd2FyZCB3aXRoIHRoZSBtb2RlbCBjb25zdHJ1Y3Rpb246DQoNCi0gVGhlIHR3ZWV0cyBkYXRhc2V0IGl0J3MgZ29pbmcgdG8gYmUgZXhjbHVkZWQuIFRoYXQncyBiZWNhdXNlIGl0IGhhcyBhIGxvdCBvZiBub24tY29tbW9uIHdvcmRzLCB1c2VkIGZvciAqI2hhc2h0YWdzKiBhbmQgKkB1c2VybmFtZXMqLiBBbHNvLCBsYW5ndWFnZSB1c2VkIGZvciB0d2VldHMgaXMgb2Z0ZW4gdG9vIGluZm9ybWFsIGZvciBtYWtpbmcgYSBnb29kIGJhc2UgZm9yIGNvcnJlY3Qgc3BlbGxpbmcgc2VsZWN0aW9uLg0KDQotIFNvbWUgYmlibGlvZ3JhcGhpYyBzb3VyY2VzIHN0YXRlIHRoYXQgaXQncyBub3QgbmVjZXNzYXJ5IHRvIGluY2x1ZGUgZXZlcnkgcmVjb3JkIG9mIHRoZSBhdmFpbGFibGUgZGF0YSBpbiB0aGUgbW9kZWwgY29uc3RydWN0aW9uLiBGb3IgdGhhdCByZWFzb24sIGEgZGF0YXNldCB3aXRoIDIwMC4wMDAgcmVjb3JkcyBpcyBnb2luZyB0byBiZSB1c2VkIGZvciB0aGUgaW5pdGlhbCBleHBsb3JhdG9yeSBhbmFseXNpcywgaGF2aW5nIGhhbGYgb2YgdGhlIHJlY29yZHMgY29taW5nIGZyb20gdGhlIGJsb2dzIGRhdGFzZXQgYW5kIGhhbGYgZnJvbSB0aGUgbmV3cyBkYXRhc2V0Lg0KDQotLS0NCg0KIyMgMi4gRGF0YSBjbGVhbmluZw0KDQpGb3IgY2xlYW5pbmcgdGhlIGRhdGFzZXQsIHRoZSBgc3RyaW5ncmAgYW5kIGB0bWAgcGFja2FnZXMgd2VyZSB1c2VkLCBhcGx5aW5nIHRoZSBmb2xsb3dpbmcgcHJvY2VzcyB0byB0aGUgZGF0YToNCg0KLSBVUkxzIHdlcmUgcmVtb3ZlZCANCi0gU3ltYm9sICImIiB3ZXJlIHJlcGxhY2VkIGJ5ICJhbmQiDQotIFN5bWJvbCAiLSIgd2VyZSByZXBsYWNlZCBieSBzcGFjZSAoIiAiKSwgdG8gc2VwYXJhdGUgY29tcG91bmQgd29yZHMNCi0gVXBwZXJjYXNlIGxldHRlcnMgd2VyZSB0cmFuc2Zvcm1lZCB0byBsb3dlcmNhc2UNCi0gTnVtYmVycyB3ZXJlIHJlbW92ZWQNCi0gUHVuY3R1YXRpb24gc3ltYm9scyB3ZXJlIHJlbW92ZWQNCi0gUmVtYWluaW5nIG5vbi1jaGFyYWN0ZXIgc3ltYm9scyB3ZXJlIHJlbW92ZWQNCi0gRXh0cmEgd2hpdGUgc3BhY2VzIHdlcmUgcmVtb3ZlZA0KDQotLS0NCg0KIyMgMy4gRXhwbG9yYXRvcnkgYW5hbHlzaXMNCg0KVGhlIGNvbXBhcmlzb24gYmV0d2VlbiB0aGUgc3VtbWFyaWVzIG9mIHRoZSBvcmlnaW5hbCBhbmQgdGhlIGNsZWFuIGNvcnBvcmEgYXBwZWFycyBpbiB0aGUgZm9sbG93aW5nIHRhYmxlLiBDbGVhbmluZyBwcm9jZXNzIHJlZHVjZWQgdGhlIGFtb3VudCBvZiB1bmlxdWUgd29yZHMgdG8gKipgciBwYXN0ZTAocm91bmQoc3VtbS5jb3Jwb3JhLmNsZWFuJFVuaXF1ZV9Xb3Jkc1syXSAqIDEwMCAvIA0KICAgIHN1bW0uY29ycG9yYS5jbGVhbiRVbmlxdWVfV29yZHNbMV0sIDEpLCAiJSIpYCoqIG9mIHRoZSBvcmlnaW5hbCBkYXRhLg0KDQoNCmBgYHtyIHN1bW1DbGVhbkNvcnBvcmEsIGVjaG89RkFMU0V9DQoNCmtuaXRyOjprYWJsZShzdW1tLmNvcnBvcmEuY2xlYW4sIGRpZ2l0cyA9IDEpDQpgYGANCg0KDQojIyMgMy4xIE1vc3QgZnJlcXVlbnQgd29yZHMNCg0KQW4gaW50ZXJlc3RpbmcgZXhlcmNpc2Ugd2FzIGNhbGN1bGF0aW5nIHRoZSBmcmVxdWVuY3kgb2YgYXBwZWFyYW5jZSBvZiBlYWNoIHdvcmQgZm9yIGJvdGgsIHRoZSBvcmlnaW5hbCBhbmQgdGhlIGNsZWFuIGNvcnBvcmEuIFRoZSB0ZW4gbW9zdCB1c2VkIHdvcmRzIGZvciB0aGUgbGF0ZXN0LCB3aGljaCBjb3ZlcnMgdGhlICoqYHIgcGFzdGUwKHJvdW5kKHRvcDEwd29yZHNbMTAsICIlY3VtIl0sIDEpLCAiJSIpIGAqKiBhcmUgdGhlIGZvbGxvd2luZy4NCg0KDQpgYGB7ciB0b3AxMHdvcmRzLCBlY2hvPUZBTFNFfQ0KDQprbml0cjo6a2FibGUodG9wMTB3b3JkcywgZGlnaXRzID0gMSkNCg0KYGBgDQoNCg0KIyMjIDMuMiBXb3JkY2xvdWQgcGxvdA0KDQpUaGlzIGZpZ3VyZSBzaG93cyB0aGUgbW9zdCBmcmVxdWVudCB3b3JkLCB3aXRoIHNpemUgYW5kIGNvbG9yLCBvZiB0aGUgY2xlYW5lZCBjb3Jwb3JhLg0KDQpgYGB7ciBXb3JkY2xvdWRDbGVhbiwgZWNobz1GQUxTRSwgZmlnLmhlaWdodD04LCBmaWcud2lkdGg9MTYsIG1lc3NhZ2U9RkFMU0V9DQoNCndvcmRjbG91ZCh3b3JkY2xvdWQuY2xlYW4kV29yZHMsIHdvcmRjbG91ZC5jbGVhbiRGcmVxLA0KICAgICAgICAgIGMoNiwgMSksIC4xMiwgICMgYyg1LCAuOSksIC4xMiwNCiAgICAgICAgICByYW5kb20ub3JkZXIgPSBGQUxTRSwNCiAgICAgICAgICBjb2xvcnMgPSBicmV3ZXIucGFsKDgsICJEYXJrMiIpKQ0KDQpgYGANCg0KDQojIyMgMy4zIENvdmVyYWdlIGNvbXBhcmlzb24NCg0KVGhlIG5leHQgZmlndXJlIHNob3dzIHRoZSBwbG90IG9mIENvdmVyYWdlICUgdnMgTnVtYmVyIG9mIHdvcmRzLCBmb3IgYm90aCB0aGUgb3JpZ2luYWwgY29ycG9yYSBhbmQgdGhlIGNsZWFuZWQgb25lLiBUaGUgZ3JhcGhpYyBjb25maXJtcyB3aGF0IHRoZSBiaWJsaW9ncmFwaHkgc3VnZ2VzdHM6IGEgZ29vZCBjb3ZlcmFnZSBjb3VsZCBiZSBhY2hpZXZlZCB3aXRob3V0IGluY2x1ZGluZyB0aGUgY29tcGxldGUgd29yZHMgb2YgdGhlIGNvcnBvcmEuDQoNCi0tLQ0KDQpgYGB7ciBwbG90Q2xlYW5Db3ZlcmFnZSwgZmlnLndpZHRoPTYuNSwgZWNobz1GQUxTRX0NCg0KY292ZXJhZ2UuY2xlYW4ucGxvdA0KDQpgYGANCg0KIyMjIDMuNCBDb3ZlcmFnZSBzdW1tYXJ5DQoNClRoZSBmb2xsb3dpbmcgdGFibGUgc2hvd3MgdGhlIGNvbXBhcmlzb24gb2YgdGhlIGNvdmVyYWdlIG9mIHRoZSBtb3N0IGZyZXF1ZW50IHdvcmRzIGZvciB0aGUgb3JpZ2luYWwgYW5kIHRoZSBjbGVhbmVkIGNvcnBvcmEuIEl0IGNhbiBiZSBhcHJlY2lhdGVkIGZyb20gaXQsIHRoYXQgd2l0aCB0aGUgc2FtZSBhbW91bnQgb2Ygd29yZHMsIHRoZSBjb3ZlcmFnZSBpcyBhbHdheXMgaGlnaGVyIGZvciB0aGUgY2xlYW5lZCBjb3Jwb3JhLg0KDQpGcm9tIHRoaXMgYW5hbHlzaXMsIGl0IGNhbiBiZSBjb25jbHVkZWQgdGhhdCB3aXRoIGEgYmFnIG9mIHRoZSAxMC4wMDAgbW9zdCBjb21tb24gd29yZHMgY29taW5nIGZyb20gYSBjbGVhbiBkYXRhc2V0LCBhIHZlcnkgZ29vZCBwcmVkaWN0aW9uIG1vZGVsIGNhbiBiZSBidWlsdC4NCg0KDQpgYGB7ciB1bmlxdWVXb3Jkc0NsZWFuX3RibCwgZWNobz1GQUxTRX0NCg0Ka25pdHI6OmthYmxlKHVuaXF1ZUNsZWFuV29yZHMsIGRpZ2l0cyA9IDEpDQpgYGANCg0KLS0tDQoNCiMjIDQuIE5leHQgc3RlcHMNCg0KLSBDb250cnVjdCAyLCAzIGFuZCA0LWdyYW1zIHdpdGggdGhlIG9idGFpbmVkIGNsZWFuIGNvcnBvcmENCi0gU2VsZWN0IHRoZSBtb3N0IGNvbW1vbiBOLWdyYW1zIGFuZCB0aGUgZnJlcXVlbmN5IG9mIGVhY2ggb25lIG9mIHRoZW0NCi0gQnVpbGQgYSBzaW1wbGUgc2VsZWN0aW9uIGFsZ29yaXRobSB0byBwcmVkaWN0IHRoZSBuZXh0IHdvcmQgYW5kIGV2YWx1YXRlIGl0cyBwZXJmb3JtYW5jZSANCi0gU3RhcnQgd2l0aCBhIG1vY2stdXAgb2YgdGhlIFNoaW55LUFwcA0KDQotLS0NCg==