Introduction

The capstone project of the Data Science course focuses on text prediction. Based on an existing corpus of text I’ll attempt to construct the model that predicts the next most likely word based on the sequence of preceeding words written by the user. The goal of this report is to load, clean up and summarize the text corpus that will be used to train the prediction model.

Data loading

The three data files are available under the following address. The file is quite large (over 500MB zipped), and contains corpora in four different languages: English, German, Russian and Finnish. I will focus on the english text corpus for the remainder of the project. In the selected corpus, three files are available: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. Let’s load them into the system.

download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
              destfile = "Coursera-Swiftkey.zip")
unzip("Coursera-Swiftkey.zip")
en_blog_lines <- readLines("./final/en_US/en_US.blogs.txt", encoding = "latin1")
en_news_lines <- readLines("./final/en_US/en_US.news.txt", encoding = "latin1")
en_twitter_lines <- readLines("./final/en_US/en_US.twitter.txt", encoding = "latin1", skipNul = TRUE)
length(en_blog_lines)
## [1] 899288
length(en_news_lines)
## [1] 1010242
length(en_twitter_lines)
## [1] 2360148

The raw files read in directly have 899288, 1010242 and 2360148 lines, respectively. The next important step in the exploratory analysis is text cleaning.

Cleaning

Text pre-processing steps are important for downstream modeling. Cleaned text can be interpreted into more useful information and contains more patterns. The steps I’ve implemented for the purpose of text cleaning are as follows:

  1. Turn all letters to lowercase
  2. Map special characters - in this step I replace custom-encoded characters with their uniform version. The mapping file is available in the github repository with the capstone code.
  3. Clean the unmapped special characters - in latin1 encoding these are represented by series of pairs of hexadecimal symbols in sharp brackets (e.g. <c3><a2><e2><82><ac><c2><a6>)
  4. [optional] Remove stopwords - commonly used words, like “the”, “an”, “of” etc.
  5. Remove obscene words - the list of these words comes from three different sources: here, here and here.
  6. Clean punctation and numbers - first I remove apostrophes, so the words like “i’ve” become concatenated into “ive”, then I remove all punctation and all numbers.
  7. [optional] Remove skipwords - in this step I remove a manually curated list of artefacts that may appear in the previous steps
  8. Clean spaces - I compress multiple spaces into one, and remove possible starting or trailing spaces.

Word count exploration

After the cleaning step I can summarize the distribution of words in the three parts of the corpus - blogs, news and twitter. The figure below shows word couns ordered descending by their frequency in the text. In all three cases the vast majority of the entire vocabulary is covered by a small fraction of the words.

When we look at the most frequent words, we can notice that they are so-called “stopwords”. They may be problematic in the downstream modeling, as they will be rather unspecific to any patterns appearing in the text.

blogs num news num twitter num
1 the 1860773 the 1974503 the 937948
2 and 1094900 to 906158 to 788906
3 to 1069553 a 893982 i 726831
4 a 904252 and 889535 a 616915
5 of 876893 of 774510 you 548482
6 i 777464 in 679104 and 438736
7 in 598741 for 353911 for 385485
8 that 460822 that 346835 in 380744
9 is 432858 is 284282 of 359753
10 it 404217 on 269849 is 358992
11 for 363965 with 254819 it 295457
12 you 298816 said 250432 my 292133
13 with 286781 was 228972 on 277973
14 was 278355 he 228687 that 234847
15 on 276447 it 219556 me 203448
16 my 270932 at 214199 be 188019
17 this 259183 as 188091 at 186839
18 as 224211 i 159110 with 173523
19 have 218949 his 157672 your 171344
20 be 209134 be 152872 have 168769

Removing the stopwords (using the stopwords() function from the tm library). The composition of the most frequent words changes, and we can see even some specific lineup - abbreviation “rt” is one of the most prevalent keywords in the twitter dataset. The impact of the stopwords removal should be carefully evaluated to see if it improves prediction accuracy.

blogs num news num twitter num
1 one 127345 said 250432 just 151217
2 will 112848 will 108238 like 122526
3 just 100814 one 88796 get 112646
4 like 100457 year 76735 love 106894
5 can 98407 new 70787 good 101164
6 time 90972 two 63868 will 94818
7 get 71101 can 58842 day 92989
8 know 60503 also 58786 can 89869
9 now 60408 first 57868 thanks 89817
10 people 59588 time 57067 rt 89775
11 also 55378 just 53356 now 84183
12 new 54856 last 52083 one 82948
13 day 52413 years 51702 know 80003
14 even 52186 like 50831 time 76951
15 first 51644 state 50145 great 76213
16 back 51317 people 47702 go 73195
17 make 51216 get 43785 today 73113
18 well 50846 three 39369 new 69857
19 us 50468 city 37882 see 67117
20 see 50222 now 36530 back 58583

Exploration of ngrams data

Construction of ngrams from the entire corpus is infeasible. To construct the ngrams I will:

  1. filter out lines that have less than 5 words
  2. sample 10% of the lines of each corpus
  3. construct and compare the sets of digrams, trigrams and quadgrams for the sample set.

The figure below summarizes the frequencies of ngrams observed in the sample datasets with and without stopwords. First of all, similarly to word frequencies, we can see a large tail of low frequency ngrams. The higher order of the ngram, the more ngrams appear with low frequency. Also, a direct comparison bwetween the corpora with and without stopwords show, that removing the stopwords greatly reduces the amount of available ngrams.

Let’s look at the most frequent ngrams “with stopwords” corpus, listed in the table below.

digrams freq trigrams freq quadgrams freq
1 of the 42953 one of the 3402 the end of the 738
2 in the 41107 a lot of 3082 the rest of the 680
3 to the 21228 to be a 1873 for the first time 641
4 on the 19579 thanks for the 1835 at the end of 637
5 for the 19470 going to be 1730 at the same time 528
6 to be 16391 i want to 1566 thanks for the follow 478
7 at the 14017 out of the 1490 is going to be 471
8 and the 12640 the end of 1481 one of the most 443
9 in a 12186 it was a 1421 in the middle of 406
10 with the 10419 some of the 1383 is one of the 393
11 is a 9943 as well as 1372 to be able to 389
12 it was 9677 the u s 1360 going to be a 385
13 for a 9273 be able to 1333 when it comes to 383
14 i have 8786 i dont know 1267 i dont want to 349
15 from the 8780 part of the 1197 cant wait to see 336
16 i was 8697 i have a 1194 thank you for the 325
17 and i 8376 i have to 1177 if you want to 318
18 it is 8296 looking forward to 1145 in the u s 310
19 with a 8292 the rest of 1084 one of the best 300
20 will be 8073 the first time 1064 in the united states 280

Let’s compare them with the most frequent ngrams “without stopwords” corpus, listed in the table below.

digrams freq trigrams freq quadgrams freq
1 right now 2282 new york city 269 vested interests vested interests 251
2 new york 1967 interests vested interests 251 interests vested interests vested 250
3 year old 1966 vested interests vested 251 amazon services amazon eu 42
4 last year 1854 let us know 238 cake cake cake cake 42
5 last night 1498 happy mothers day 217 martin luther king jr 40
6 years ago 1410 two years ago 161 just finished mi run 37
7 high school 1369 happy new year 146 rock roll hall fame 33
8 first time 1271 president barack obama 137 new york new jersey 29
9 feel like 1265 cinco de mayo 119 amp amp amp gt 28
10 last week 1216 new york times 118 mg cholesterol mg sodium 28
11 make sure 1067 world war ii 118 happy cinco de mayo 27
12 looking forward 1060 will take place 109 calories protein carbohydrate fat 26
13 can get 1052 looking forward seeing 93 protein carbohydrate fat saturated 26
14 looks like 927 gov chris christie 92 cholesterol mg sodium fiber 25
15 even though 921 first time since 86 let us know think 25
16 new jersey 842 year old son 82 amp amp amp amp 24
17 just got 802 year old daughter 81 carbohydrate fat saturated mg 23
18 one day 779 four years ago 76 fat saturated mg cholesterol 23
19 next week 773 three years ago 76 get real rewards just 23
20 two years 768 new years eve 75 new york stock exchange 23

As we can see, the ngrams in the second table seem more specific.

Conclusions for future modeling

The available text corpus is large and needs to be filtered carefully before modeling. With the long tails of low-frequency items, it may be useful to trim down the number of collected words and ngrams. When trimming down the corpus I will do the following:

  1. keep single words that represent 95% of total word count
  2. remove ngrams of frequency = 1.

An initial comparison of object.size() of the sample corpus of ngrams shows that after this reduction

  1. “with stopwords” corpus size is reduced from 1.2Gb to 122.7Mb
  2. “without stopwords” corpus size is reduced from 907.6Mb to 45.4Mb

This step will be helpfull to make downstream analysis more lightweight. Introducing higher order n-grams may be necessary, but will increase the size of the entire corpus.