- Select the desirable language and corpus from the dropdown menu on the left.
- Click the button below to proceed.
- Enter the string in the textbox at the top-middle of the screen.
- See how the prediction results are updated in real-time.
Each dataset had 50k records of texts for each locale and corpus. Trained cache is 146Mb, which is not a huge.
The model is a derivative of a Markov model based on frequency analysis of 2-grams and 3-grams. Further results are chosen based on an ML algorithm using the “partykit” package. Training data covered four languages and two corpuses: news and Twitter.
The app relies on uploaded pre-trained data on transition probabilities. This eliminates the need for end users to train the system each time. Pre-trained datasets are produced by a separate script, which is executed locally on the corpus.
Accuracy attained is 10.5% based on testing 500 records reserved as a validation set for the en_US locale on Twitter. This accuracy is higher compared to n-gram models, which achieved 8.6% and 6.2% respectively. Performance was also compared with the text-davinci-003 model via API on 100 records, which had a prediction success rate of 11%. Manual entering of data into the GPT-4 model yielded a performance of 14-18% using different subsets. These results demonstrate the efficiency of the model despite extreme simplifications and limitations on training data.