Milestone Report

The Capstone Project for the Data Science Specialisation at Johns Hopkins University is about Text Prediction. It relates with disciplines such as Machine Learning and Predictions, using modern technologies on computations and models for a group of problems that have started since the first computers were developed.

Obtaining and loading the Data

The data is available from the Coursera servers with links provided in the course material. To speed up the process, the compressed file was downloaded and uncompressed to make the source files available. The source files contain text extrated from feed of blogs, news and Twitter.

Cleaning

As any data, and even more so from public origin from the Internet, security measures and some initial cleaning is necessary even before any science is applyed.
Multiple encodings and some rough data was checked and repaired using basic Operating System commands such tr.
One specific file had some NULL characters, which many products interpret as end of data. Those had to be repaired so the whole data could be loaded and manipulated in R.

Loading

In essence all three files only contain lines of text, which can be considered as one variable. The text themselves contain a good number of characters in different languages and/or encodings. Most of these characters are punctuation or symbols that can be useful to humans to interpret the text, but that add very little value for a simple prediction algorithm.
The package quanteda was used for the initial load and basic manipulation, as it proven to be more efficient during the early stages of development. The package quanteda also relies on the package stringi for encoding and string manipulation which offers much broad functions.
The data load can be done in simple steps, like:

library(quanteda)
Data      <- textfile('./final/en_US/s*.txt')
CorpusAll <- corpus(Data)
DFMS      <- dfm(x             = CorpusAll,
                 removeTwitter = TRUE,
                 stem          = FALSE,
                 ngram         = 2:4,
                 concatenator  = " ")

Data Summary

The summary data collected from the original text files is shown below:

Source Documents nGrams_1 nGrams_2 nGrams_3
Blog 899288 303002 6187109 19043332
News 1010242 253499 6127832 17961995
Twitter 2360148 359461 5182920 13650858

nGrams = 1

Blog_1_Feature Blog_1_Count Blog_2_Feature Blog_2_Count Blog_3_Feature Blog_3_Count
the 1852053 of the 187189 one of the 14369
and 1091446 in the 153952 a lot of 12203
to 1068566 to the 85946 i don t 11804
a 898380 on the 75224 i don ’t 10193
i 895022 to be 68305 as well as 6898
of 876575 and the 58511 to be a 6823
in 597310 for the 57981 it was a 6781
that 482712 and i 57270 some of the 6690
it 481822 i was 49286 out of the 6513
is 432040 i have 47695 the end of 6460

nGrams = 2

News_1_Feature News_1_Count News_2_Feature News_2_Count News_3_Feature News_3_Count
the 1974316 of the 187240 one of the 14605
to 906124 in the 179388 a lot of 11563
and 889475 to the 84440 as well as 6249
a 877966 on the 73039 part of the 5698
of 774494 for the 69082 the end of 5653
in 679044 at the 58253 out of the 5635
for 353885 and the 52248 according to the 5605
that 346771 in a 51289 some of the 5470
is 284217 to be 47219 to be a 5381
on 269854 with the 43523 in the first 5254

nGrams = 3

Twit_1_Feature Twit_1_Count Twit_2_Feature Twit_2_Count Twit_3_Feature Twit_3_Count
the 937378 in the 78449 thanks for the 23620
to 788606 for the 73963 looking forward to 8832
i 723300 of the 56956 thank you for 8683
a 611409 on the 48532 i love you 8354
you 547770 to be 47092 for the follow 7927
and 438510 to the 43434 going to be 7416
for 385334 thanks for 43004 can’t wait to 7375
in 380352 at the 37243 i want to 7116
of 359623 i love 35908 a lot of 6250
is 358739 going to 34275 to be a 5995

Top Ten Features

Interesting findings

The challenges involving text, even at a basic level, are the variaty of ramifications and quite often conflicting rules. When cleaning or filtering is common to have situations where a particular text fragment is matched by more than one rule and such rules would disagree on leaving or removing a particular punctuation or delimiter.
When Machine Learning comes to the rescue, then hardware consumption might add to the problem.

Creating a prediction algorithm and Shiny application

From the item before, although there are quite a few alternatives, the hard decisions come from conflicting factors. A better model that would give more precision, is likelly to require more CPU and memory. If one falls back to disk to accomodate memory consumption, then performance is affected. As professionals or student we have to deal with dead lines as well.
For this particular project, simplicity is key, so the approach will be to have the raw data summarised before hand, creating frequency data and using those to make predictions on the application.