Milestone Report

Loading Process

In this project was just used the english files retrieved from blogs, twitter and news and a random sample of 1500 lines from each file was considered due to computational cost constrains.

Three text files were loaded with the function LaF::sample_lines:

en_US.blogs.txt 1500 random lines from 899288 total lines
en_US.twitter.txt 1500 random lines from 2360148 total lines
en_US.news.txt 1500 random lines from 1010242 total lines

The total of lines from the 3 files are 4269678. Just 4500 random sample lines were used. This correspond to approx 1% of the data.

Data Cleaning

On the data cleaning process step, lines with profanity words were removed. The lexicon of profanity words was retrieved using the lexicon library.

After the filter, the number of lines was reduced to 4,323.

Also, in the text data cleaning steps, number and special characters were removed and white spaces were striped. Stop words were not removed because in this type of prediction problem stop words can be used as features and labels on the model as predictors or targets.

Exploratory Data Analysis

Considering the same limits for horizontal and vertical axis, the distributions from blogs, news and twetter seen to be very similar. This fenomenom is called Zipfs law distribution and counts of percent frequecy of word counts usually approx this distribution. This fenomenom could be due to hashtags, bots, retwittes or common language practices used by the costumers of that plataform.

Bigram pairs count showed that articles and prepositions are the most frequent bigrams.

Trigrams frequency show some words that are not article or prepositions like verbs and adverbs.

Quadrigrams start to show more substatives and adjectives.

Model Insights

The distribution of the data for the 3 files follows the common text distributions and the problem suggest the use of a N-gram model to predict the next word as na option to the user that will be typing text in the application.

Some sentences endings are more likely than others conditioned in what words came before.

An n-gram model assigns a probability score to each option based on the corpus text provided to the model as training data. Then based on thousands of combinations the model learns how to correct predict the most commom next word considering the previous words of a sentence.

Milestone Report

Pedro Loes

2022-09-29

Loading Process

Data Cleaning

Exploratory Data Analysis

Model Insights