Jose Ignacio Gavara
31 January 2021
This presentation is part of the Final Project of the Coursera Data Science Capstone course. The final project consists of two parts:
The purpose of the app is to predict the next word in a text that a user has entered a part of. The app works with data from files contained in Swiftkey.zip, which is larger than 0.5 gb.
A sample of 1% of files has been taken, since the files are very large and would slow down computing time a lot, and I have created a single file (Corpus) that will be divided into three others that will make up the database from which the app will predict the next word in the combination of words entered in the input box.
The app uses an algorithm based on n-grams. An n-gram is a combination of n items, in this case words. From the corpus, three files have been generated divided by the n-gram of the that these files contain (two, three and four words), to make the prediction more efficient. Depending on the number of words entered by the user, the algorithm starts looking in one or another file, continuing in descending order of n-grams if it does not find a result in the first file.
The application is available at this link: https://jigavara.shinyapps.io/wordpredictor/
The user enters one or more words.
If the user does not enter any words, the app returns the prediction “NULL”
If the app does not find any combination of words In the files that matches the input, the app will return “the”, as this is the most common word.
In the remaining cases, the app will return the next word that continues in the corpus the combination of words entered by the user.
The app is a “proof of concept” that can serve as the basis for more sophisticated applications that would use larger databases. These applications would be very useful for Internet search engine users from both PC and smartphone, as well as for users of instant messaging or e-mail applications.