Anna Huynh
10 April 2021
Introduction:
Using data provided by SwiftKey, we built up the final dataset extracted from the English corpus as a subset of each 1% of the news, blogs, and twitter and then combined them to ensure equal representation and ease of calculation. The binomial distribution will be used to sample the data and remove bias in the sampling process.
The dataset was split into 80% training, 10% validation and 10% test set.