The Capstone Project for the Data Science Specialisation at Johns Hopkins University is about Text Prediction. It relates with disciplines such as Machine Learning and Predictions, using modern technologies on computations and models for a group of problems that have started since the first computers were developed.
The data is available from the Coursera servers with links provided in the course material. To speed up the process, the compressed file was downloaded and uncompressed to make the source files available. The source files contain text extrated from feed of blogs, news and Twitter.
As any data, and even more so from public origin from the Internet, security measures and some initial cleaning is necessary even before any science is applyed.
Multiple encodings and some rough data was checked and repaired using basic Operating System commands such tr
.
One specific file had some NULL
characters, which many products interpret as end of data. Those had to be repaired so the whole data could be loaded and manipulated in R
.
In essence all three files only contain lines of text, which can be considered as one variable. The text themselves contain a good number of characters in different languages and/or encodings. Most of these characters are punctuation or symbols that can be useful to humans to interpret the text, but that add very little value for a simple prediction algorithm.
The package quanteda
was used for the initial load and basic manipulation, as it proven to be more efficient during the early stages of development. The package quanteda
also relies on the package stringi
for encoding and string manipulation which offers much broad functions.
The data load can be done in simple steps, like:
library(quanteda)
Data <- textfile('./final/en_US/s*.txt')
CorpusAll <- corpus(Data)
DFMS <- dfm(x = CorpusAll,
removeTwitter = TRUE,
stem = FALSE,
ngram = 2:4,
concatenator = " ")
The summary data collected from the original text files is shown below:
Source | Documents | nGrams_1 | nGrams_2 | nGrams_3 |
---|---|---|---|---|
Blog | 899288 | 303002 | 6187109 | 19043332 |
News | 1010242 | 253499 | 6127832 | 17961995 |
2360148 | 359461 | 5182920 | 13650858 |
Blog_1_Feature | Blog_1_Count | Blog_2_Feature | Blog_2_Count | Blog_3_Feature | Blog_3_Count |
---|---|---|---|---|---|
the | 1852053 | of the | 187189 | one of the | 14369 |
and | 1091446 | in the | 153952 | a lot of | 12203 |
to | 1068566 | to the | 85946 | i don t | 11804 |
a | 898380 | on the | 75224 | i don ’t | 10193 |
i | 895022 | to be | 68305 | as well as | 6898 |
of | 876575 | and the | 58511 | to be a | 6823 |
in | 597310 | for the | 57981 | it was a | 6781 |
that | 482712 | and i | 57270 | some of the | 6690 |
it | 481822 | i was | 49286 | out of the | 6513 |
is | 432040 | i have | 47695 | the end of | 6460 |
News_1_Feature | News_1_Count | News_2_Feature | News_2_Count | News_3_Feature | News_3_Count |
---|---|---|---|---|---|
the | 1974316 | of the | 187240 | one of the | 14605 |
to | 906124 | in the | 179388 | a lot of | 11563 |
and | 889475 | to the | 84440 | as well as | 6249 |
a | 877966 | on the | 73039 | part of the | 5698 |
of | 774494 | for the | 69082 | the end of | 5653 |
in | 679044 | at the | 58253 | out of the | 5635 |
for | 353885 | and the | 52248 | according to the | 5605 |
that | 346771 | in a | 51289 | some of the | 5470 |
is | 284217 | to be | 47219 | to be a | 5381 |
on | 269854 | with the | 43523 | in the first | 5254 |
Twit_1_Feature | Twit_1_Count | Twit_2_Feature | Twit_2_Count | Twit_3_Feature | Twit_3_Count |
---|---|---|---|---|---|
the | 937378 | in the | 78449 | thanks for the | 23620 |
to | 788606 | for the | 73963 | looking forward to | 8832 |
i | 723300 | of the | 56956 | thank you for | 8683 |
a | 611409 | on the | 48532 | i love you | 8354 |
you | 547770 | to be | 47092 | for the follow | 7927 |
and | 438510 | to the | 43434 | going to be | 7416 |
for | 385334 | thanks for | 43004 | can’t wait to | 7375 |
in | 380352 | at the | 37243 | i want to | 7116 |
of | 359623 | i love | 35908 | a lot of | 6250 |
is | 358739 | going to | 34275 | to be a | 5995 |
The challenges involving text, even at a basic level, are the variaty of ramifications and quite often conflicting rules. When cleaning or filtering is common to have situations where a particular text fragment is matched by more than one rule and such rules would disagree on leaving or removing a particular punctuation or delimiter.
When Machine Learning comes to the rescue, then hardware consumption might add to the problem.
From the item before, although there are quite a few alternatives, the hard decisions come from conflicting factors. A better model that would give more precision, is likelly to require more CPU and memory. If one falls back to disk to accomodate memory consumption, then performance is affected. As professionals or student we have to deal with dead lines as well.
For this particular project, simplicity is key, so the approach will be to have the raw data summarised before hand, creating frequency data and using those to make predictions on the application.