This report explains the first steps in ingesting and understanding the three corpus of data, with the goal of creting a “next word” predicting system. The corpus analysed had excerpts of phrases from Twitter, Blogs and News, anonymised.
The first step in the analysis was to read each of the files and count the number of lines and words in each one, using the wc command:
The result was:
| File | Lines | Words | Characters |
|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334434 | 210160014 |
| en_US.twitter.txt | 2360148 | 30373830 | 167105338 |
| en_US.news.txt | 1010242 | 34372596 | 205811889 |
| - | - | - | - |
Initially, I created a 60/40 split for train/test, unfortunatelly, my computer could not handle the dataset, so I ended using 5000 lines from each text file in the training set, totalling 15000 lines.
From those 15000 lines, I have a total of 887302 words.
| freq | Variable | |
|---|---|---|
| the | 43970 | the |
| and | 22603 | and |
| for | 9083 | for |
| that | 8936 | that |
| with | 6403 | with |
| was | 5859 | was |
| you | 5832 | you |
| have | 4559 | have |
| this | 4432 | this |
| but | 4072 | but |
| are | 4006 | are |
| from | 3500 | from |
| not | 3399 | not |
| his | 2859 | his |
| they | 2829 | they |
| will | 2606 | will |
| has | 2518 | has |
| all | 2496 | all |
| about | 2446 | about |
| one | 2307 | one |
| just | 2256 | just |
| when | 2224 | when |
| what | 2201 | what |
| who | 2200 | who |
| had | 2136 | had |
| out | 2097 | out |
| your | 2048 | your |
| can | 2005 | can |
| their | 1960 | their |
| like | 1955 | like |
| freq | Variable | |
|---|---|---|
| "blessing | 1 | "blessing |
| “bibles” | 1 | “bibles” |
| "beta | 1 | "beta |
| “besties”, | 1 | “besties”, |
| “beloved”, | 1 | “beloved”, |
| “beeramids” | 1 | “beeramids” |
| “beaten” | 1 | “beaten” |
| "bag | 1 | "bag |
| "away | 1 | "away |
| "austin-based | 1 | "austin-based |
| "auntie | 1 | "auntie |
| "attempt | 1 | "attempt |
| “anti-semitism”. | 1 | “anti-semitism”. |
| "animal | 1 | "animal |
| “anger” | 1 | “anger” |
| “aha” | 1 | “aha” |
| "against | 1 | "against |
| "afternoon, | 1 | "afternoon, |
| "adjust | 1 | "adjust |
| “abstinence” | 1 | “abstinence” |
| "absolute | 1 | "absolute |
| "abolitionist | 1 | "abolitionist |
| “a” | 1 | “a” |
| "[arrest] | 1 | "[arrest] |
| “50s” | 1 | “50s” |
| “3.” | 1 | “3.” |
| "18 | 1 | "18 |
| “-ies”: | 1 | “-ies”: |
| "‘bloody’ | 1 | "‘bloody’ |
| "‘beauty,’ | 1 | "‘beauty,’ |
It seems that the most frequent words are those considered stop words. They usually are removed if we want to extract some “meaning” from the texts, like in sentiment analysis, but this won´t help me in predicting the next word.
I plan to create a model that ‘learns’ the most common chains of for words, and for every three words, predict the last one by choosing the most frequenty one in it´s memory. If the sequence of three words are not in the memory, it will try to use only two words, or one if it fails. In the case that it cannot find a chain of words, will just try ‘the’, since it´s the most frequent word.