The motivation for this report is to:
Review criteria:
The instructors of the course have indicated that we can download the data set from this url. It is a zip file that, after being downloaded and decompressed, generates a directory called “final” with three subdirectories:
According to the instructors, the lines have been obtained in The Web from public domain sources. The central term in the name of each file indicates this source: blogs, news or twitter. For a first look, I have opened each file with a plain text editor (notepad++) and I have found that they contain hundreds of thousands of lines each.
I have used the statistical software R, which is the basis of this specialization, to obtain basic data of the files but, following the indications of the assignment and the mentors’ advice in the forum of the course, I will not show code but I will present the results in a way that does not require knowledge of data science or programming with R.
So, let’s say that I have made four calculations about the files:
These are the results:
## file megabytes memory lines words
## 1 en_US.blogs.txt 200.42 248.47 899049 37334361
## 2 en_US.news.txt 196.28 249.51 1010228 34371817
## 3 en_US.twitter.txt 159.89 301.83 2360052 30374501
We will comment on these results in the following section, after showing them in the form of graphs.
In order to be able to represent the data on a single scale, thus facilitating their comparison, I have shown the lines in tens of miles and the words in hundreds of miles. That is, in the “megabytes” and “memory” panels a height bar “100” indicates 100 MB but in the “lines” panel that same bar indicates 1.000.000 lines (one million) and in the “words” panel it indicates 10.000.000 words (ten millons).
The first thing that catches my attention, observing the first two graphs, is that although the blog and news archives occupy more disk space than the Twitter file, the latter occupies much more when it is loaded in memory.
The explanation for this seems to be that the Twitter file far exceeds the number of lines in any of the other two files (even both together), as shown in the third graph.
Finally, we observe in the last panel that the greater number of lines in the twitter file doesn’t produce a greater number of words. Actually, it contains less words than any of the other two. For sure, the restriction in terms of the number of characters of this social network, which requires writing short lines, is a reason for this, but also the style, often casual, of the users.
After the first glance and the basic exploratory analysis that I have described, I have taken a random sample of 1% of each of the three files and with them I have created a database of texts or “corpus”.
On the corpus I made five basic cleaning operations to prevent, for example, that “hellow, jack”, “hellow, jack!”. and “hellow jack” be considered different expressions:
I am still studying how to eliminate offensive and bad taste words as well as elements of The Web (email addresses, user nicknames, web addresses …)
With all this, I have obtained a matrix of 3 documents (the mentioned files of blogs, news and twitter) and 16171 terms:
<<DocumentTermMatrix (documents: 3, terms: 16171)>>
Sparsity : 16%
Maximal term length: 17
Sample :
Docs can get just like new one said the time will
1 583 412 618 560 331 703 155 1090 509 665
2 466 380 429 374 550 688 2137 2209 411 875
3 875 1110 1433 1132 699 779 179 897 722 962
n-gramas are sequences of words in a corpus. An unigram is, simply, a word. A bigram is a sequence of two consecutive words and a trigram is a sequence of three consecutive words.
So far I have analyzed the unigrams, bigrams and trigrams in the corpus with the following results:
## ngram frequency
## 1 the 4196
## 2 will 2502
## 3 just 2480
## 4 said 2471
## 5 one 2170
## 6 like 2066
## 7 can 1924
## 8 get 1902
## 9 time 1642
## 10 new 1580
## ngram frequency
## 1 i love 404
## 2 i dont 403
## 3 i think 396
## 4 i just 341
## 5 i can 321
## 6 i know 295
## 7 i will 246
## 8 i want 223
## 9 i cant 214
## 10 right now 205
## ngram frequency
## 1 i dont know 65
## 2 i dont think 61
## 3 i think i 46
## 4 i know i 44
## 5 i dont want 37
## 6 happy mothers day 36
## 7 i feel like 36
## 8 cant wait see 34
## 9 i wish i 34
## 10 i cant wait 32
So far, my tests to create a predictive model are focused on studying the possibilities of the so-called Katz’s back-off model.
For example, if I start with an expression like:
I Really ...
, and I use my bigrams, trigrams and the Katz’s back-off model, I get a small table of most likely words to continue the expression:
## ngram probability
## 1 i really like 0.083
## 2 i really dont 0.061
## 3 i really hope 0.061
## 4 i really love 0.061
## 5 i really need 0.054
## 6 i really appreciate 0.047
## 7 i really want 0.047
## 8 i really hate 0.025
## 9 i really really 0.025
## 10 i really cant 0.011
So my idea is to estimate the accuracy I get with this model, improve that accuracy while maintaining an acceptable speed and build a Shiny App that allows to enter text and that applies the model to suggest the next word.
It is only the second week and I hope to find material that offers new ideas in the coming weeks.
Among the topics to review, I should mention:
Any help, comment or advice will be greatly appreciated.