Venkat Sri (vesr)
Feb 2018
The objective of the application is to implement model that prompts hint (next set of words), related to the phrase/text entered by the user. The input for this program consists of three datasets twitter, news and blogs from HC Corpora. Data has been cleaned and a subset is used as sample data in R data frames. Back-off algorithm is used complementing with NLP techniques to create n-grams. The UI layer has been developed with Shiny package with additional libraries (such as a DT, javascript, HTML Render) to enhance the user experience.
Here are the key steps in define, design and develop the application, based on the three data sources available through Swiftkey.
Multiple tasks have been performed:
Input: The data came from HC Corpora with three files (Blogs, News and Twitter). A sample data was created based on this huge data file. The same data was converted to lower case, removed the punctuation, links, whitespace, numbers and profanity words.
Model: The sample text was tokenized* into n-grams** to construct the predictive models (* Tokenization is the process of breaking a stream of text up into words, phrases. N-gram is a contiguous sequence of n items from a given sequence of text). The final data (RDS) created as described the link Milestone Report
Output: Shiny Package has been created to enter the input data and use the model to predict the next work. The data is displayed in multiple tabs for better classification.
Sample Data (These phrases are picked from Quiz 2 and Quiz 3)
> Test Data
- You made (**my day**)
- and a case of (**the / beer**)
- make me the (**happiest**)
LINKS
- Application Shiny App ←← Core Deliverable of this project
- GitHub repository code to this application
- Exploratory Analysis link to Milestone Report
- Data Store link to Data used for this project