TextPredictionProject

24 may 2022

Project goal of project

This project is focused on creating model for next word prediction. Basic goal is to help users to type in more fast way and to offer them choices to fill.

One potential drawback would be that model was trained on smaller corpora of text from tweeter.

Potential improvement could be to enlarge corpora of text to get better prediction model.

Corpora of text used in prediction model

Data used for training prediction model is downloaded from < https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip>. This data was acquired from SwiftKey.

Concrete dataset used for training was Tweeter data. Zip file also contain text files for blogs and news articles in 3 languges.

Package and model training

In order to train prediction model sbo package was used. Vignnete for this package could be found on < https://cran.r-project.org/web/packages/sbo/vignettes/sbo.html>

Model was train on multiple ngram ranges starting with 2 and ending with 4 ngram range. After testing accuracy of prediction 4 ngram range was used for creating prediction model.

This model was trained in different session of R and prediction table was exported as rds file to be included in shiny app.

Shiny app

In prototype of text prediction application shiny app was used. User can enter word or phrase in text input and predictions of next word would be generated.

Two predictions are generated from 4 ngram model. First prediction uses most probable prediction from model and secondo best result is also displayed in application. In order that model predicts end of sentence or prediction is not possible, than predictions shows these results.

Way forward

In order to create mode precise prediction model, it would be useful to use larger word corpora.
Maybe it would be also useful to use different R packages to create similar models and compare accuracy.
Potentially compare prediction of model with different programming language like Python or Java.