Milestone Report for Data Science Capstone

Goal

The motivation for this report is to:

Demonstrate that I’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that I amassed so far.
Give feedback on my plans for creating a prediction algorithm and Shiny app.

Review criteria:

Does the link lead to an HTML page describing the exploratory analysis of the training data set?
Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
Has the data scientist made basic plots, such as histograms to illustrate features of the data?
Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Downloading and taking a first look at the dataset

The instructors of the course have indicated that we can download the data set from this url. It is a zip file that, after being downloaded and decompressed, generates a directory called “final” with three subdirectories:

de_DE: files containing lines of texts in German language. There are three files:
- de_DE.blogs.txt
- de_DE.news.txt
- de_DE.twitter.txt
en_US: files containing lines of texts in english language. There are three files:
- en_US.blogs.txt
- en_US.news.txt
- en_US.twitter.txt
fi_FI: files containing lines of texts in finnish language. There are three files:
- fi_FI.blogs.txt
- fi_FI.news.txt
- fi_FI.twitter.txt
ru_RU: files containing lines of texts in russian language. There are three files:
- ru_RU.blogs.txt
- ru_RU.news.txt
- ru_RU.twitter.txt

According to the instructors, the lines have been obtained in The Web from public domain sources. The central term in the name of each file indicates this source: blogs, news or twitter. For a first look, I have opened each file with a plain text editor (notepad++) and I have found that they contain hundreds of thousands of lines each.

Basic characteristics of the files in the en_US directory

I have used the statistical software R, which is the basis of this specialization, to obtain basic data of the files but, following the indications of the assignment and the mentors’ advice in the forum of the course, I will not show code but I will present the results in a way that does not require knowledge of data science or programming with R.

So, let’s say that I have made four calculations about the files:

megabytes: size in megabytes that the file occupies on the disk
memory: size in megabytes that the file occupies when it is loaded into the computer’s memory, using the R software; this data does not have to coincide with the previous one and it is really more important
lines: number of lines of text contained in the file
words: number of words contained in the file

These are the results:

##                file megabytes memory   lines    words
## 1   en_US.blogs.txt    200.42 248.47  899049 37334361
## 2    en_US.news.txt    196.28 249.51 1010228 34371817
## 3 en_US.twitter.txt    159.89 301.83 2360052 30374501

We will comment on these results in the following section, after showing them in the form of graphs.

Basic graphics on the previous data

In order to be able to represent the data on a single scale, thus facilitating their comparison, I have shown the lines in tens of miles and the words in hundreds of miles. That is, in the “megabytes” and “memory” panels a height bar “100” indicates 100 MB but in the “lines” panel that same bar indicates 1.000.000 lines (one million) and in the “words” panel it indicates 10.000.000 words (ten millons).

The first thing that catches my attention, observing the first two graphs, is that although the blog and news archives occupy more disk space than the Twitter file, the latter occupies much more when it is loaded in memory.

The explanation for this seems to be that the Twitter file far exceeds the number of lines in any of the other two files (even both together), as shown in the third graph.

Finally, we observe in the last panel that the greater number of lines in the twitter file doesn’t produce a greater number of words. Actually, it contains less words than any of the other two. For sure, the restriction in terms of the number of characters of this social network, which requires writing short lines, is a reason for this, but also the style, often casual, of the users.

Feedback on my plans for creating a prediction algorithm and Shiny app

(a) Corpus and matrix of documents and terms

After the first glance and the basic exploratory analysis that I have described, I have taken a random sample of 1% of each of the three files and with them I have created a database of texts or “corpus”.

On the corpus I made five basic cleaning operations to prevent, for example, that “hellow, jack”, “hellow, jack!”. and “hellow jack” be considered different expressions:

Remove symbols such as currency symbols, mathematical formulas …Stuff like this.
Remove punctuation marks
Remove numbers
Remove “stop words”: common words of the English language as “i”, “me”, “my”, “we”, …
Remove extra blank space between words

I am still studying how to eliminate offensive and bad taste words as well as elements of The Web (email addresses, user nicknames, web addresses …)

With all this, I have obtained a matrix of 3 documents (the mentioned files of blogs, news and twitter) and 16171 terms:

<<DocumentTermMatrix (documents: 3, terms: 16171)>>
Sparsity           : 16%
Maximal term length: 17
Sample             :
Docs can  get just like new one said  the time will
   1 583  412  618  560 331 703  155 1090  509  665
   2 466  380  429  374 550 688 2137 2209  411  875
   3 875 1110 1433 1132 699 779  179  897  722  962

(b) n-grams

n-gramas are sequences of words in a corpus. An unigram is, simply, a word. A bigram is a sequence of two consecutive words and a trigram is a sequence of three consecutive words.

So far I have analyzed the unigrams, bigrams and trigrams in the corpus with the following results:

##    ngram frequency
## 1    the      4196
## 2   will      2502
## 3   just      2480
## 4   said      2471
## 5    one      2170
## 6   like      2066
## 7    can      1924
## 8    get      1902
## 9   time      1642
## 10   new      1580

##        ngram frequency
## 1     i love       404
## 2     i dont       403
## 3    i think       396
## 4     i just       341
## 5      i can       321
## 6     i know       295
## 7     i will       246
## 8     i want       223
## 9     i cant       214
## 10 right now       205

##                ngram frequency
## 1        i dont know        65
## 2       i dont think        61
## 3          i think i        46
## 4           i know i        44
## 5        i dont want        37
## 6  happy mothers day        36
## 7        i feel like        36
## 8      cant wait see        34
## 9           i wish i        34
## 10       i cant wait        32

(c) Using Katz’s back-off model

So far, my tests to create a predictive model are focused on studying the possibilities of the so-called Katz’s back-off model.

For example, if I start with an expression like:

I Really ...

, and I use my bigrams, trigrams and the Katz’s back-off model, I get a small table of most likely words to continue the expression:

##                  ngram probability
## 1        i really like       0.083
## 2        i really dont       0.061
## 3        i really hope       0.061
## 4        i really love       0.061
## 5        i really need       0.054
## 6  i really appreciate       0.047
## 7        i really want       0.047
## 8        i really hate       0.025
## 9      i really really       0.025
## 10       i really cant       0.011

So my idea is to estimate the accuracy I get with this model, improve that accuracy while maintaining an acceptable speed and build a Shiny App that allows to enter text and that applies the model to suggest the next word.

(d) Next steps

It is only the second week and I hope to find material that offers new ideas in the coming weeks.

Among the topics to review, I should mention:

I have taken a random sample of 1% of the lines of each file. Up to what percentage should the sample increase? It is very likely that each additional 1% clearly increases the accuracy of the algorithm and also clearly decreases its speed. Obviously, I do not want to get the best word after 10 minutes of processing.
I’m focusing on the tm package. Should I consider other packages?
I’m focusing on Katz’s back-off model. Should I consider other models?
I have analyzed bigrams and trigrams. Do I need larger n-grams?
Is it really good to eliminate the “stop words”?
Once I have confidence in my model, I will need to take another random sample for “testing” to assess the accuracy of the model. How big should this other sample be?
How much will the speed of my Shiny App be reduced once I upload it to a web server and run it from a mobile?

Any help, comment or advice will be greatly appreciated.