Keyboard typing prediction algorithm

Data Science Capstone

Andrey ABRAMOV

27/11/2018

Johns Hopkins University

Coursera Data Science Specialization

Summary overview

The Keyboard typing prediction algorithm was developed as part of the Coursera Data Science Capstone project. The purpose of this project is a model for predicting the user's next word when entering it from the keyboard. The algorithm is implemented with Shiny application using NLP and Text Mining algorithms. The developed application based on the proposed algorithm shows all the features and full functionality.

The algorithm is based on processed and cleaned data from Twitter, blogs and news data. The research analysis is carried out and the dictionary containing frequency terms is created. The dataset used for analysis is available on the course data page.

Model description

Twitter, blogs, and news data were processed by a natural language algorithm used to create a list of 1,2,3-word sets based on occurrence rates.
These data were sequentially numbered, filtered by the list of forbidden words, cleared of numeric and punctuation characters.
In order to reduce memory usage and speed up the prediction of the next word in the 2-gram and 3-gram assemblies, the words and combinations were replaced with the numbers of the previous assemblies. This reduced the amount of memory occupied by all assemblies from 35 Mb of text content to 26 Mb, i.e. 25% less.
The algorithm predicts the next word based on the last 2 words entered by the user. The search starts with a 3-Gram build. Then select a word from the 2-Gram Assembly, then 1-Gram. If nothing is found, it returns to the" default words “ that were most commonly used. If the words has been founded in each step, the word most common at the intersection of all the words found, or, if there are no intersections, the most common use based on occurrence rates of each founded word.

Application

To use the application enter your phrase in the input field and press the button 'Predict'. All the dictionaries will be loaded automatically. At the right of the text input field, the four forecast fields of the following field will be visible:

prediction based on 3-Gram vocabuary. This search based on 3-Gram vocaburary and 2 latest typed words.
prediction based on 2-Gram vocabuary. This search based on 2-Gram vocaburary and only 1 latest typed word.
prediction based on 1-Gram vocabuary. If nothing is found in previous searching, 1-gram vocaburary will be used.
prediction based on cumulative search. If the words has been founded in each step, the word most common at the intersection of all the words found, or, if there are no intersections, the most common use based on occurrence rates of each founded word.

App Details

Average response time is less than 2 seconds
Application memory usage only 26 MB
Application is running at: https://andre701.shinyapps.io/KeybPrediction
Github link for various code files is here: https://github.com/ANDREY700/Data-Science-Capstone-/upload