Instructions

  • Select the desirable language and corpus from the dropdown menu on the left.
  • Click the button below to proceed.
  • Enter the string in the textbox at the top-middle of the screen.
  • See how the prediction results are updated in real-time.

Training Data

Each dataset had 50k records of texts for each locale and corpus. Trained cache is 146Mb, which is not a huge.

Model Used for Prediction

The model is a derivative of a Markov model based on frequency analysis of 2-grams and 3-grams. Further results are chosen based on an ML algorithm using the “partykit” package. Training data covered four languages and two corpuses: news and Twitter.

Functioning of the App

The app relies on uploaded pre-trained data on transition probabilities. This eliminates the need for end users to train the system each time. Pre-trained datasets are produced by a separate script, which is executed locally on the corpus.

Accuracy

Accuracy attained is 10.5% based on testing 500 records reserved as a validation set for the en_US locale on Twitter. This accuracy is higher compared to n-gram models, which achieved 8.6% and 6.2% respectively. Performance was also compared with the text-davinci-003 model via API on 100 records, which had a prediction success rate of 11%. Manual entering of data into the GPT-4 model yielded a performance of 14-18% using different subsets. These results demonstrate the efficiency of the model despite extreme simplifications and limitations on training data.

Potential Development Perspectives

  • Refine the model to improve prediction accuracy.
  • Enrich the training data by incorporating synonym substitutions.
  • Add weights to words based on their commonality to enhance the model’s understanding.
  • Explore other avenues for improvement and optimization.