Milestone Report

The Capstone Project for the Data Science Specialisation at Johns Hopkins University is about Text Prediction. It relates with disciplines such as Machine Learning and Predictions, using modern technologies on computations and models for a group of problems that have started since the first computers were developed.

Obtaining and loading the Data

The data is available from the Coursera servers with links provided in the course material. To speed up the process, the compressed file was downloaded and uncompressed to make the source files available. The source files contain text extrated from feed of blogs, news and Twitter.

Cleaning

As any data, and even more so from public origin from the Internet, security measures and some initial cleaning is necessary even before any science is applyed.
Multiple encodings and some rough data was checked and repaired using basic Operating System commands such tr.
One specific file had some NULL characters, which many products interpret as end of data. Those had to be repaired so the whole data could be loaded and manipulated in R.

Loading

In essence all three files only contain lines of text, which can be considered as one variable. The text themselves contain a good number of characters in different languages and/or encodings. Most of these characters are punctuation or symbols that can be useful to humans to interpret the text, but that add very little value for a simple prediction algorithm.
The package quanteda was used for the initial load and basic manipulation, as it proven to be more efficient during the early stages of development. The package quanteda also relies on the package stringi for encoding and string manipulation which offers much broad functions.
The data load can be done in simple steps, like:

library(quanteda)
Data      <- textfile('./final/en_US/s*.txt')
CorpusAll <- corpus(Data)
DFMS      <- dfm(x             = CorpusAll,
                 removeTwitter = TRUE,
                 stem          = FALSE,
                 ngram         = 2:4,
                 concatenator  = " ")

Data Summary

The summary data collected from the original text files is shown below:

Source	Documents	nGrams_1	nGrams_2	nGrams_3
Blog	899288	303002	6187109	19043332
News	1010242	253499	6127832	17961995
Twitter	2360148	359461	5182920	13650858

nGrams = 1

Blog_1_Feature	Blog_1_Count	Blog_2_Feature	Blog_2_Count	Blog_3_Feature	Blog_3_Count
the	1852053	of the	187189	one of the	14369
and	1091446	in the	153952	a lot of	12203
to	1068566	to the	85946	i don t	11804
a	898380	on the	75224	i don ’t	10193
i	895022	to be	68305	as well as	6898
of	876575	and the	58511	to be a	6823
in	597310	for the	57981	it was a	6781
that	482712	and i	57270	some of the	6690
it	481822	i was	49286	out of the	6513
is	432040	i have	47695	the end of	6460

nGrams = 2

News_1_Feature	News_1_Count	News_2_Feature	News_2_Count	News_3_Feature	News_3_Count
the	1974316	of the	187240	one of the	14605
to	906124	in the	179388	a lot of	11563
and	889475	to the	84440	as well as	6249
a	877966	on the	73039	part of the	5698
of	774494	for the	69082	the end of	5653
in	679044	at the	58253	out of the	5635
for	353885	and the	52248	according to the	5605
that	346771	in a	51289	some of the	5470
is	284217	to be	47219	to be a	5381
on	269854	with the	43523	in the first	5254

nGrams = 3

Twit_1_Feature	Twit_1_Count	Twit_2_Feature	Twit_2_Count	Twit_3_Feature	Twit_3_Count
the	937378	in the	78449	thanks for the	23620
to	788606	for the	73963	looking forward to	8832
i	723300	of the	56956	thank you for	8683
a	611409	on the	48532	i love you	8354
you	547770	to be	47092	for the follow	7927
and	438510	to the	43434	going to be	7416
for	385334	thanks for	43004	can’t wait to	7375
in	380352	at the	37243	i want to	7116
of	359623	i love	35908	a lot of	6250
is	358739	going to	34275	to be a	5995

Top Ten Features

Interesting findings

The challenges involving text, even at a basic level, are the variaty of ramifications and quite often conflicting rules. When cleaning or filtering is common to have situations where a particular text fragment is matched by more than one rule and such rules would disagree on leaving or removing a particular punctuation or delimiter.
When Machine Learning comes to the rescue, then hardware consumption might add to the problem.

Creating a prediction algorithm and Shiny application

From the item before, although there are quite a few alternatives, the hard decisions come from conflicting factors. A better model that would give more precision, is likelly to require more CPU and memory. If one falls back to disk to accomodate memory consumption, then performance is affected. As professionals or student we have to deal with dead lines as well.
For this particular project, simplicity is key, so the approach will be to have the raw data summarised before hand, creating frequency data and using those to make predictions on the application.

Capstone Project - Word Prediction

Johns Hopkins Data Science Specialisation

Angelo Klin

26 July 2015