IrinaS
2023-03-19
The goal of this task: Build a smart application to presents options for what the next word might be
Source data: large, unstructured database of the English language in txt format from the SwiftKey
Raw data proccessing: RMarkdown
Source code in Git: Git Repository
Shiny application: ShinyApp
The raw data contains corpora (collected from publicly available sources by a web crawler) in 4 different languages, for this project only en_US locale files were used.
Since the data is too big and computer doesn’t have enough capacity to perform exploratory data analysis I have to randomly sampling 1% of total data(vector of size 5.55Gb). The amount of remaining data is still sufficient for statistical analysis.
To perform data analysis data was normalized:
If user enter several words:
a.App grab last 2 words and find Trigram started with this words. Prediction – last word of correspondent Trigram. If correspondent Trigram doesn’t exist – App grab 1 last word;
b.App find Bigram which starts with this word. Prediction – last word of correspondent Bigram. If doesn’t exists – app find Trigram which starts with this word. Prediction – second word of such Trigram. If such case doesn’t exist - find Trigram with second word. Prediction – last word of correspondent Trigram. If such a case also does not exist – App find the most popular words from Unigram.
If user enter 1 word – see p.1b
If user doesn’t enter word – App find the most popular word from Unigram
Link on ShinyApp
The application has following options:
On the mainPanel after entering all necessary parameters you can see:
Thank you for yout attention!