Data Science Specialization

2/23/2025

Background

This is a capstone project for the Data Science Specialization path, by Johns Hopkins University.

It is being requested to create an application of Predictive Text Model, capable of predicting subsequent words and which will be trained with a dataset from blogs, Twitter and news.

The source data set does contain text files in 4 different languages from Twitter, blogs and news. For the purpose of this capstone, we will take the English version (under ‘/en_US/’ folder).

In a first analysis, these are the stats of the source data:

##        Source.files   Lines    Words Unique_words
## 1 en_US.twitter.txt 2360148 17111806       302505
## 2   en_US.blogs.txt  899288 19347162       252893
## 3    en_US.news.txt 1010242 19760894       212079

More details of the exploratory data analysis performed can be found in this page https://rpubs.com/rmmoya/swiftkey_project_data_analysis

Building the model (1 of 2)

First we do an initial pre-processing to extract the words from each file: removing punctuation and numbers, empty strings, one-letter words, and stopwords: words like “a”, “an”, “the” that do not carry significant meaning and can be removed from text data to improve the performance of machine learning models.

The model we are going to use to build the predictive text application is based on a word n-gram language model. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word is considered, it is called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.

Bigrams and trigrams are extracted from the source database and this is the representation of the most common ones:

Building the model (2 of 2)

We start using bigrams as the basis for our language model, and we will test if it is accurate in the predictions by using 100 long sentences (more than 20 words), cropped the sentence by position 10 and use our model to predict the word in position 11.

Percentile	Total_sentences	Number_of_predictions	Number_of_successes	Accuracy_perc
1	100	5	1	20.0
2	100	8	1	12.5
5	100	25	2	8.0
10	100	37	4	10.8
15	100	54	4	7.4
20	100	71	5	7.1
30	100	78	5	6.4
40	100	88	6	6.8
50	100	94	6	6.3

It can be seen that further than 1st percentile, the ratio of accurate predictions is very low. Indicating that the bigram based model is not suitable for this, except for those clear pair of words that go usually together, i.e., within the 1st percentile of our database.

Model benchmarking

After some tests, let’s use a model that make usage of bigrams and trigrams in the following terms: - Keep the 1st percentile of bigrams and weight the probability of bigrams by a very low multiplying factor, e.g., 0.02. - If the bigram matches with the previous word, the probability (multiplied by this factor) is added to the probability of the trigrams that contain the last two words.

This is the result, when comparing with the real last word:

Ptile	Total_sentences	No_of_predictions	No_of_successes	Accuracy_perc	Avg_time_secs
10	100	28	6	21	0.24
15	100	37	7	19	0.58
20	100	44	9	20	1.53
30	100	53	16	30	2.35
40	100	59	22	37	3.75
50	100	66	28	42	4.99

We can see that the more trigrams we use from our database, the higher the number of predictions and successes, BUT there is a penalty in time. Performance, computationally, can be improved though. With this numbers, we should not go further than 30-40 percentile of the trigrams to have a responsive application.

How to use the application

The application only requires to enter a sentence in the ‘Enter text’ field and click on the button Predict. It will look in the trigram database, and complement it with the bigram database, to come up with the next probable word.

There is a description of the probabilities found in the database, along with a table of probability of trigrams and bigrams, if found.

Alt text