Prahlad
July10, 2020
| Files | Lines | Words | Size(MB) |
|---|---|---|---|
| Blogs | 899288 | 37570839 | 200.4242 |
| News | 1010242 | 34494539 | 196.2775 |
| 2360148 | 30451170 | 159.3641 | |
| Total | 4269678 | 102516548 | 556.0658 |
Ngram for the model is used to predict the next word based on the previous 5,4,3,2,1 words
As discussed above, we are using only Back-off Algorithm in its basic form
Sort the Ngrams from highest to lowest frequencies
System looks for 5gram, which has the first four words that are equal to the last four words of the input phrase entered by the user.
if no 5gram is found, algorithm backs off to 4grams - searches for the first three words amongst 4grams that are equal to the last three words of the user entered input phrase
if no 4gram is found, algorithm backs off to 3gram- looks for the second word amongst 3gram that are equal to the last two word of the input phrase
If no 3gram is found, algorithm backs off to bigram - looks for one word amongst bigram that is equal to the last one word of the input phrase
if no bigram is found, algorithm backs off to unigram -looks for the word with the highest frequency
Efficiency of the model depends on the size of the sample of the training data.The model has been developed by five ngram matrices. These matrices are based on the samples of the original dataset. The efficiency is as good as the trainig samples
Better accuracy can be achieved by increasing the sample size, but it requires more computational resources and increases the response time.