Capstone Project - First Impressions

Introduction

This report explains the first steps in ingesting and understanding the three corpus of data, with the goal of creting a “next word” predicting system. The corpus analysed had excerpts of phrases from Twitter, Blogs and News, anonymised.

Procedures

Basic Summaries

The first step in the analysis was to read each of the files and count the number of lines and words in each one, using the wc command:

The result was:

File	Lines	Words	Characters
en_US.blogs.txt	899288	37334434	210160014
en_US.twitter.txt	2360148	30373830	167105338
en_US.news.txt	1010242	34372596	205811889
-	-	-	-

Train Test Split

Initially, I created a 60/40 split for train/test, unfortunatelly, my computer could not handle the dataset, so I ended using 5000 lines from each text file in the training set, totalling 15000 lines.

From those 15000 lines, I have a total of 887302 words.

Most Frequent Words

	freq	Variable
the	43970	the
and	22603	and
for	9083	for
that	8936	that
with	6403	with
was	5859	was
you	5832	you
have	4559	have
this	4432	this
but	4072	but
are	4006	are
from	3500	from
not	3399	not
his	2859	his
they	2829	they
will	2606	will
has	2518	has
all	2496	all
about	2446	about
one	2307	one
just	2256	just
when	2224	when
what	2201	what
who	2200	who
had	2136	had
out	2097	out
your	2048	your
can	2005	can
their	1960	their
like	1955	like

Less Frequent Words

	freq	Variable
"blessing	1	"blessing
“bibles”	1	“bibles”
"beta	1	"beta
“besties”,	1	“besties”,
“beloved”,	1	“beloved”,
“beeramids”	1	“beeramids”
“beaten”	1	“beaten”
"bag	1	"bag
"away	1	"away
"austin-based	1	"austin-based
"auntie	1	"auntie
"attempt	1	"attempt
“anti-semitism”.	1	“anti-semitism”.
"animal	1	"animal
“anger”	1	“anger”
“aha”	1	“aha”
"against	1	"against
"afternoon,	1	"afternoon,
"adjust	1	"adjust
“abstinence”	1	“abstinence”
"absolute	1	"absolute
"abolitionist	1	"abolitionist
“a”	1	“a”
"[arrest]	1	"[arrest]
“50s”	1	“50s”
“3.”	1	“3.”
"18	1	"18
“-ies”:	1	“-ies”:
"‘bloody’	1	"‘bloody’
"‘beauty,’	1	"‘beauty,’

Conclusion

It seems that the most frequent words are those considered stop words. They usually are removed if we want to extract some “meaning” from the texts, like in sentiment analysis, but this won´t help me in predicting the next word.

I plan to create a model that ‘learns’ the most common chains of for words, and for every three words, predict the last one by choosing the most frequenty one in it´s memory. If the sequence of three words are not in the memory, it will try to use only two words, or one if it fails. In the case that it cannot find a chain of words, will just try ‘the’, since it´s the most frequent word.