Coursera Data Science Course - Capstone Project
Marco Lunardi
The aim assigned to the App is not so easy:
Once given a sequence of words, organized into a sentence, the App has to return the word that makes sense as the final word of the same sentence.
It's quite an easy stuff for our brain, trained for years in learning how to fill in words into sentences, but it's not the same easy to an algorithm that hasn't got either the reasoning power of a human brain, or enough time to get the training a human being develops through a lifetime.
The starting point was a collection of posts from blogs and twitter, and sentences from news websites.
It was quite a huge dataset to be analyzed, amounting to more than 550 Megabytes, and more than 4,2 millions lines.
But two were the main hurdles to be overtaken: making the text readable by a computer, and keeping the memory-usage quite low.
So there is a huge trade-off to be faced: using the most of available data, while keeping the App light enough to be read by any device.
Easy to say, not so easy to be done.
These are the steps taken to develop the predicting algorithm for the app (with great patience and a lot of fine-tuning)
Once the text is transformed into a computer-readable format and each word combination has an assigned frequency, the algorithm can be trained and then it's able to make its predictions.
Just type a sentence into the App (better if two words or more), and it will return your sentence along with its most “probable” last word.
Please just consider that the App uses a “reduced” and less-performing version of my original algorithm, in order to make it work well on Shinyapp website.
You can find my App at the below link: have fun!