The capstone project of the Data Science course focuses on text prediction. Based on an existing corpus of text I’ll attempt to construct the model that predicts the next most likely word based on the sequence of preceeding words written by the user. The goal of this report is to load, clean up and summarize the text corpus that will be used to train the prediction model.
The three data files are available under the following address. The file is quite large (over 500MB zipped), and contains corpora in four different languages: English, German, Russian and Finnish. I will focus on the english text corpus for the remainder of the project. In the selected corpus, three files are available: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. Let’s load them into the system.
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-Swiftkey.zip")
unzip("Coursera-Swiftkey.zip")
en_blog_lines <- readLines("./final/en_US/en_US.blogs.txt", encoding = "latin1")
en_news_lines <- readLines("./final/en_US/en_US.news.txt", encoding = "latin1")
en_twitter_lines <- readLines("./final/en_US/en_US.twitter.txt", encoding = "latin1", skipNul = TRUE)
length(en_blog_lines)
## [1] 899288
length(en_news_lines)
## [1] 1010242
length(en_twitter_lines)
## [1] 2360148
The raw files read in directly have 899288, 1010242 and 2360148 lines, respectively. The next important step in the exploratory analysis is text cleaning.
Text pre-processing steps are important for downstream modeling. Cleaned text can be interpreted into more useful information and contains more patterns. The steps I’ve implemented for the purpose of text cleaning are as follows:
After the cleaning step I can summarize the distribution of words in the three parts of the corpus - blogs, news and twitter. The figure below shows word couns ordered descending by their frequency in the text. In all three cases the vast majority of the entire vocabulary is covered by a small fraction of the words.
When we look at the most frequent words, we can notice that they are so-called “stopwords”. They may be problematic in the downstream modeling, as they will be rather unspecific to any patterns appearing in the text.
| blogs | num | news | num | num | ||
|---|---|---|---|---|---|---|
| 1 | the | 1860773 | the | 1974503 | the | 937948 |
| 2 | and | 1094900 | to | 906158 | to | 788906 |
| 3 | to | 1069553 | a | 893982 | i | 726831 |
| 4 | a | 904252 | and | 889535 | a | 616915 |
| 5 | of | 876893 | of | 774510 | you | 548482 |
| 6 | i | 777464 | in | 679104 | and | 438736 |
| 7 | in | 598741 | for | 353911 | for | 385485 |
| 8 | that | 460822 | that | 346835 | in | 380744 |
| 9 | is | 432858 | is | 284282 | of | 359753 |
| 10 | it | 404217 | on | 269849 | is | 358992 |
| 11 | for | 363965 | with | 254819 | it | 295457 |
| 12 | you | 298816 | said | 250432 | my | 292133 |
| 13 | with | 286781 | was | 228972 | on | 277973 |
| 14 | was | 278355 | he | 228687 | that | 234847 |
| 15 | on | 276447 | it | 219556 | me | 203448 |
| 16 | my | 270932 | at | 214199 | be | 188019 |
| 17 | this | 259183 | as | 188091 | at | 186839 |
| 18 | as | 224211 | i | 159110 | with | 173523 |
| 19 | have | 218949 | his | 157672 | your | 171344 |
| 20 | be | 209134 | be | 152872 | have | 168769 |
Removing the stopwords (using the stopwords() function from the tm library). The composition of the most frequent words changes, and we can see even some specific lineup - abbreviation “rt” is one of the most prevalent keywords in the twitter dataset. The impact of the stopwords removal should be carefully evaluated to see if it improves prediction accuracy.
| blogs | num | news | num | num | ||
|---|---|---|---|---|---|---|
| 1 | one | 127345 | said | 250432 | just | 151217 |
| 2 | will | 112848 | will | 108238 | like | 122526 |
| 3 | just | 100814 | one | 88796 | get | 112646 |
| 4 | like | 100457 | year | 76735 | love | 106894 |
| 5 | can | 98407 | new | 70787 | good | 101164 |
| 6 | time | 90972 | two | 63868 | will | 94818 |
| 7 | get | 71101 | can | 58842 | day | 92989 |
| 8 | know | 60503 | also | 58786 | can | 89869 |
| 9 | now | 60408 | first | 57868 | thanks | 89817 |
| 10 | people | 59588 | time | 57067 | rt | 89775 |
| 11 | also | 55378 | just | 53356 | now | 84183 |
| 12 | new | 54856 | last | 52083 | one | 82948 |
| 13 | day | 52413 | years | 51702 | know | 80003 |
| 14 | even | 52186 | like | 50831 | time | 76951 |
| 15 | first | 51644 | state | 50145 | great | 76213 |
| 16 | back | 51317 | people | 47702 | go | 73195 |
| 17 | make | 51216 | get | 43785 | today | 73113 |
| 18 | well | 50846 | three | 39369 | new | 69857 |
| 19 | us | 50468 | city | 37882 | see | 67117 |
| 20 | see | 50222 | now | 36530 | back | 58583 |
Construction of ngrams from the entire corpus is infeasible. To construct the ngrams I will:
The figure below summarizes the frequencies of ngrams observed in the sample datasets with and without stopwords. First of all, similarly to word frequencies, we can see a large tail of low frequency ngrams. The higher order of the ngram, the more ngrams appear with low frequency. Also, a direct comparison bwetween the corpora with and without stopwords show, that removing the stopwords greatly reduces the amount of available ngrams.
Let’s look at the most frequent ngrams “with stopwords” corpus, listed in the table below.
| digrams | freq | trigrams | freq | quadgrams | freq | |
|---|---|---|---|---|---|---|
| 1 | of the | 42953 | one of the | 3402 | the end of the | 738 |
| 2 | in the | 41107 | a lot of | 3082 | the rest of the | 680 |
| 3 | to the | 21228 | to be a | 1873 | for the first time | 641 |
| 4 | on the | 19579 | thanks for the | 1835 | at the end of | 637 |
| 5 | for the | 19470 | going to be | 1730 | at the same time | 528 |
| 6 | to be | 16391 | i want to | 1566 | thanks for the follow | 478 |
| 7 | at the | 14017 | out of the | 1490 | is going to be | 471 |
| 8 | and the | 12640 | the end of | 1481 | one of the most | 443 |
| 9 | in a | 12186 | it was a | 1421 | in the middle of | 406 |
| 10 | with the | 10419 | some of the | 1383 | is one of the | 393 |
| 11 | is a | 9943 | as well as | 1372 | to be able to | 389 |
| 12 | it was | 9677 | the u s | 1360 | going to be a | 385 |
| 13 | for a | 9273 | be able to | 1333 | when it comes to | 383 |
| 14 | i have | 8786 | i dont know | 1267 | i dont want to | 349 |
| 15 | from the | 8780 | part of the | 1197 | cant wait to see | 336 |
| 16 | i was | 8697 | i have a | 1194 | thank you for the | 325 |
| 17 | and i | 8376 | i have to | 1177 | if you want to | 318 |
| 18 | it is | 8296 | looking forward to | 1145 | in the u s | 310 |
| 19 | with a | 8292 | the rest of | 1084 | one of the best | 300 |
| 20 | will be | 8073 | the first time | 1064 | in the united states | 280 |
Let’s compare them with the most frequent ngrams “without stopwords” corpus, listed in the table below.
| digrams | freq | trigrams | freq | quadgrams | freq | |
|---|---|---|---|---|---|---|
| 1 | right now | 2282 | new york city | 269 | vested interests vested interests | 251 |
| 2 | new york | 1967 | interests vested interests | 251 | interests vested interests vested | 250 |
| 3 | year old | 1966 | vested interests vested | 251 | amazon services amazon eu | 42 |
| 4 | last year | 1854 | let us know | 238 | cake cake cake cake | 42 |
| 5 | last night | 1498 | happy mothers day | 217 | martin luther king jr | 40 |
| 6 | years ago | 1410 | two years ago | 161 | just finished mi run | 37 |
| 7 | high school | 1369 | happy new year | 146 | rock roll hall fame | 33 |
| 8 | first time | 1271 | president barack obama | 137 | new york new jersey | 29 |
| 9 | feel like | 1265 | cinco de mayo | 119 | amp amp amp gt | 28 |
| 10 | last week | 1216 | new york times | 118 | mg cholesterol mg sodium | 28 |
| 11 | make sure | 1067 | world war ii | 118 | happy cinco de mayo | 27 |
| 12 | looking forward | 1060 | will take place | 109 | calories protein carbohydrate fat | 26 |
| 13 | can get | 1052 | looking forward seeing | 93 | protein carbohydrate fat saturated | 26 |
| 14 | looks like | 927 | gov chris christie | 92 | cholesterol mg sodium fiber | 25 |
| 15 | even though | 921 | first time since | 86 | let us know think | 25 |
| 16 | new jersey | 842 | year old son | 82 | amp amp amp amp | 24 |
| 17 | just got | 802 | year old daughter | 81 | carbohydrate fat saturated mg | 23 |
| 18 | one day | 779 | four years ago | 76 | fat saturated mg cholesterol | 23 |
| 19 | next week | 773 | three years ago | 76 | get real rewards just | 23 |
| 20 | two years | 768 | new years eve | 75 | new york stock exchange | 23 |
As we can see, the ngrams in the second table seem more specific.
The available text corpus is large and needs to be filtered carefully before modeling. With the long tails of low-frequency items, it may be useful to trim down the number of collected words and ngrams. When trimming down the corpus I will do the following:
An initial comparison of object.size() of the sample corpus of ngrams shows that after this reduction
This step will be helpfull to make downstream analysis more lightweight. Introducing higher order n-grams may be necessary, but will increase the size of the entire corpus.