ps7391
2019-02-26
This presentation is created as Final Project Submission for the Coursera Data Science Capstone Course.
The application Prediction of the next word in a sentence developed to demonstrate the operation of a natural language processing model. The application was created in the process of working on the course project Data Science Capstone.
Baseline data for the project generously provided by the Swiftkey company.
Natural language processing model
The next word is predicted given the frequency of occurrences of phrases of two, three, or four words.
The frequency calculation is based on text that was extracted from blogs, news and tweets. Raw text for project provided by Swiftkey company.
If there are no options to predict the next word, the word here is used by default.
In the process of preparing data for building the model, materials obtained from news, blogs and tweeters were used. This data has 2.5 million lines and has a volume of about 550 megabytes. In order to reduce the amount of data to a reasonable size, 50 thousand lines were randomly selected. Based on the sample, frequency tables were created with combinations of two, three, and four words.
Data preparation takes a long time.
As a result of the preparation and processing of the initial data, a set of working data was obtained for use in the developed application. Their total volume was about 3 megabytes.
In order to reduce data preparation time and the work of the application itself, a parallel package was applied. This allowed a number of processes to work in parallel and reduce the time of the application.
Estimated time required to search and predict the next word in the tables of frequency of use of words is 0.3-0.4 seconds.
Shiny application and instructions to run it
The application Prediction of the next word in a sentence is available at: https://pr7391.shinyapps.io/Prediction_of_the_next_word_in_a_sentence/
In the input box, enter the sentence. The application will predict the following three possible words.
Thanks to all!