Introduction

This report explains the first steps in ingesting and understanding the three corpus of data, with the goal of creting a “next word” predicting system. The corpus analysed had excerpts of phrases from Twitter, Blogs and News, anonymised.

Procedures

Basic Summaries

The first step in the analysis was to read each of the files and count the number of lines and words in each one, using the wc command:

The result was:

File Lines Words Characters
en_US.blogs.txt 899288 37334434 210160014
en_US.twitter.txt 2360148 30373830 167105338
en_US.news.txt 1010242 34372596 205811889
- - - -

Train Test Split

Initially, I created a 60/40 split for train/test, unfortunatelly, my computer could not handle the dataset, so I ended using 5000 lines from each text file in the training set, totalling 15000 lines.

From those 15000 lines, I have a total of 887302 words.

Most Frequent Words

freq Variable
the 43970 the
and 22603 and
for 9083 for
that 8936 that
with 6403 with
was 5859 was
you 5832 you
have 4559 have
this 4432 this
but 4072 but
are 4006 are
from 3500 from
not 3399 not
his 2859 his
they 2829 they
will 2606 will
has 2518 has
all 2496 all
about 2446 about
one 2307 one
just 2256 just
when 2224 when
what 2201 what
who 2200 who
had 2136 had
out 2097 out
your 2048 your
can 2005 can
their 1960 their
like 1955 like

Less Frequent Words

freq Variable
"blessing 1 "blessing
“bibles” 1 “bibles”
"beta 1 "beta
“besties”, 1 “besties”,
“beloved”, 1 “beloved”,
“beeramids” 1 “beeramids”
“beaten” 1 “beaten”
"bag 1 "bag
"away 1 "away
"austin-based 1 "austin-based
"auntie 1 "auntie
"attempt 1 "attempt
“anti-semitism”. 1 “anti-semitism”.
"animal 1 "animal
“anger” 1 “anger”
“aha” 1 “aha”
"against 1 "against
"afternoon, 1 "afternoon,
"adjust 1 "adjust
“abstinence” 1 “abstinence”
"absolute 1 "absolute
"abolitionist 1 "abolitionist
“a” 1 “a”
"[arrest] 1 "[arrest]
“50s” 1 “50s”
“3.” 1 “3.”
"18 1 "18
“-ies”: 1 “-ies”:
"‘bloody’ 1 "‘bloody’
"‘beauty,’ 1 "‘beauty,’

Conclusion

It seems that the most frequent words are those considered stop words. They usually are removed if we want to extract some “meaning” from the texts, like in sentiment analysis, but this won´t help me in predicting the next word.

I plan to create a model that ‘learns’ the most common chains of for words, and for every three words, predict the last one by choosing the most frequenty one in it´s memory. If the sequence of three words are not in the memory, it will try to use only two words, or one if it fails. In the case that it cannot find a chain of words, will just try ‘the’, since it´s the most frequent word.