For this part of the project I will be using the dataset supplyed by SwiftKey. The dataset contains text from different sources (blogs, twitter and news) in 4 different languages. For this part of the project I will only be using the ones in English.
The overall goal of the project is to create a prediction model.
The data files contain a text entry on each line.
| file_name | lines | words |
|---|---|---|
| en_US.twitter.txt | 2360148 | 30359804 |
| en_US.news.txt | 1010242 | 1010242 |
| en_US.blogs.txt | 899288 | 37334114 |
My first thought when I loaded these files into memory was to figure out what were the most frequent words. For the purpose of this analysis I will only use a subset of the data.
Lets consider NGrams, N combinations of words.
My current plans for a prediction algorithm are as follows: The input must be a sentence and the output will be one word, the predicted next word in the sentence. The algorithm will work on a precomputed model of frequent ngrams for 4,3,2 ngrams. It will find the most frequent ngram that begins with the last word in the sentence starting from 3grams to 2grams. I will have to figure out a way of weighing them diferently and doing a the search quickly.