Coursera data science capstone project presentation

Instructions

Select the desirable language and corpus from the dropdown menu on the left.
Click the button below to proceed.
Enter the string in the textbox at the top-middle of the screen.
See how the prediction results are updated in real-time.

Training Data

Each dataset had 50k records of texts for each locale and corpus. Trained cache is 146Mb, which is not a huge.

Model Used for Prediction

The model is a derivative of a Markov model based on frequency analysis of 2-grams and 3-grams. Further results are chosen based on an ML algorithm using the “partykit” package. Training data covered four languages and two corpuses: news and Twitter.

Functioning of the App

The app relies on uploaded pre-trained data on transition probabilities. This eliminates the need for end users to train the system each time. Pre-trained datasets are produced by a separate script, which is executed locally on the corpus.

Accuracy

Accuracy attained is 10.5% based on testing 500 records reserved as a validation set for the en_US locale on Twitter. This accuracy is higher compared to n-gram models, which achieved 8.6% and 6.2% respectively. Performance was also compared with the text-davinci-003 model via API on 100 records, which had a prediction success rate of 11%. Manual entering of data into the GPT-4 model yielded a performance of 14-18% using different subsets. These results demonstrate the efficiency of the model despite extreme simplifications and limitations on training data.

Potential Development Perspectives

Refine the model to improve prediction accuracy.
Enrich the training data by incorporating synonym substitutions.
Add weights to words based on their commonality to enhance the model’s understanding.
Explore other avenues for improvement and optimization.