To begin with, I am using only the English version of the files. All three files, en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt, are read in using readLines. For example, the blogs file was read in as follows:
blogsFile <- "./Coursera-Swiftkey/final/en_US/en_US.blogs.txt"
con <- file(blogsFile, "rb")
blogsVec <- readLines(con, encoding="UTF-8", skipNul=TRUE)
close(con)
Before cleaning the data, I wanted to get an idea about the size of the files, the line count and the word count. The following table gives the basic information in terms of size, line count and word count of the three original data files: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt. To get this data, we executed the following shell commands on git bash:
wc -c
wc -l
wc -w
| file name | size (MB) | #lines | #words |
|---|---|---|---|
| en_US.blogs.txt | 210 | 899288 | 37334131 |
| en_US.news.txt | 206 | 1010242 | 34372530 |
| en_US.twitter.txt | 167 | 2360148 | 30373583 |
In the first phase of data cleaning, I cleaned the data as given below:
Tha above cleaning strategy is too conservative. However, my first goal, as per the advice given in the DS Capstone Survival Guide, is to get a working data product. Once that is done, I will make the prediction tool more robust by using a less conservative approach to cleaning.
Next we construct some basic plots to get more insight about the data sets.
I tried to get some insight on the data by plotting histograms of word lengths and understanding their distributions.
## Loading required package: NLP
We plot histograms of word lengths in each file.
As can be seen from the plots:
Based on these plots, it may be sufficient to look at only words of length 5-10 and the n words surrounding them to compute the n-grams, where n = 2, 3, 4, 5, etc.
We compute the frequencies of words in each data file and sort the words in decreasing order of frequency. The tables below give the number of unique words needed in a frequency-sorted dictionary to cover 50% (first table) and 90% (second table) of words in each file.
| file name | #unique words (50%) | list of unique words (50%) |
|---|---|---|
| en_US.blogs.txt | 2 | one, will |
| en_US.news.txt | 2 | said, will |
| en_US.twitter.txt | 1 | im |
| file name | #unique words (90%) | list of unique words (90%) |
|---|---|---|
| en_US.blogs.txt | 39 | one, will, just, like, can, time, get, im, now, know, day, new, well, also, back, make, little, people, first, really, see, love, much, good, us, even, dont, think, way, go, two, made, going, things, last, many, still, year, life |
| en_US.news.txt | 55 | said, will, one, year, new, two, also, can, first, time, last, years, just, state, like, people, get, m, s, three, city, percent, now, school, back, game, million, make, says, day, home, county, many, even, well, good, going, may, high, made, season, team, police, p, way, dont, work, u, much, still, st, four, go, take, old |
| en_US.twitter.txt | 19 | im, just, like, get, love, good, will, day, thanks, dont, can, rt, now, one, u, know, time, great, today |
To get a sense of the distribution of the unique words that cover 90% of the data in each of the three files, we plot word clouds.
## Loading required package: RColorBrewer
Word cloud of unique words that cover 90% of en_US.blogs.txt
Word cloud of unique words that cover 90% of en_US.news.txt
Word cloud of unique words that cover 90% of en_US.twitter.txt
Based on the distribution of the unique words, I gather that the en_US.news.txt draws upon a larger set of words, followed by en_US.blogs.txt and en_US.twitter.txt uses the smallest set of unique words. So I should probably prepare a sample that is composed of 30% of data from en_US.blogs.txt, 50% from en_US.news.txt and 20% from en_US.twitter.txt.