Coursera Capstone

The Presentation is a pitch for the Text Prediction App - Text suggesting App / SwiftKey - Data Science Capstone

Author:Natalija K. Agape Date:2017-03-12 My Figure

Introduction

Along the Captsone Project we were supposed to get familiar with a new subject which is Text mining, a field of Natural Process Language. The goal of the capstone project was developing a Shiny application which predicts next word. The concept is given by the SwiftKey, smart prediction technology for easier mobile typing. Link to SwiftKey.

Preliminary Steps

Cleaning Data on provided HC Corpora corpus, documents twitter, blogs, news . I have used subset of the English data. The cleaning process contained removing punctuation, profanity , whitespace, conversion of the lowercases, removing urls, time, numbers, dates. Vocabulary reduced by stemming

Algorithm N-gram Language Models

N-Gram Language Model, n-Gram models are widely used in the NLP. After the cleaning process bi-tri- and quadgrams frequency matrices were builed. Matrices are than sorted by the frequency. The result is being converted into dictionary (for each n-gram matrix). The dictionary returns suggestions for the given phrase. For the model only n-grams with frequency >1 and only the predicted word with highest score are considered. Predicting the next word is estimation of the probability function P. Markov assumption, only the n-1 words are highly ranked and considered for the prediction of the next word. My Figure

How to use the App - App Illustration and the Link to the Shiny Server

Type the word in the given text box. The next word will be predicted in the above tab. Once you click “enter” or click on the tab, the next word is inserted / appeneded in the text box. As well a wordcloud is created, which gives the suggestions according to the highest probabilities. The presentation pitch is accessable over the app. Simply click on the header tab “Presentation App-Pitch” https://agape.shinyapps.io/capstone_app/ My Figure Disclaimer: The accuracy of the app suffers due to the limited resources available over shinyapps.io.In order to keep the performance during the model computation only 2 ranks/best predictions were saved for the computation.

Acknowledgment

I would like to thank Coursera and Johns Hopkins University for this amazing data scientist specialisation journey during which I've learned so much. My great thanks goes as well to the amazing R-community which shares the knowledge over various sources:

https://www.r-bloggers.com/ - http://stackoverflow.com - https://shiny.rstudio.com/

Thanks to my manager A.G for giving me the opportunity to apply for the Coursera Data Scientist Specialization.

Last but not least, thanks to my precious family for the support during this journey.

Final thanks goes to my fellow peers for reviewing the application created with RStudio and Shiny: My Figure