Author:Natalija K. Agape Date:2017-03-12
Along the Captsone Project we were supposed to get familiar with a new subject which is Text mining, a field of Natural Process Language. The goal of the capstone project was developing a Shiny application which predicts next word. The concept is given by the SwiftKey, smart prediction technology for easier mobile typing. Link to SwiftKey.
Cleaning Data on provided HC Corpora corpus, documents twitter, blogs, news . I have used subset of the English data. The cleaning process contained removing punctuation, profanity , whitespace, conversion of the lowercases, removing urls, time, numbers, dates. Vocabulary reduced by stemming
N-Gram Language Model, n-Gram models are widely used in the NLP. After the cleaning process bi-tri- and quadgrams
frequency matrices were builed. Matrices are than sorted by the frequency. The result is being converted into dictionary (for each n-gram matrix). The dictionary returns suggestions for the given phrase.
For the model only n-grams with frequency >1 and only the predicted word with highest score are considered.
Predicting the next word is estimation of the probability function P. Markov assumption, only the n-1 words are highly ranked and considered for the prediction of the next word.
Type the word in the given text box. The next word will be predicted in the above tab.
Once you click “enter” or click on the tab, the next word is inserted / appeneded in the text box.
As well a wordcloud is created, which gives the suggestions according to the highest probabilities.
The presentation pitch is accessable over the app. Simply click on the header tab “Presentation App-Pitch”
https://agape.shinyapps.io/capstone_app/
Disclaimer: The accuracy of the app suffers due to the limited resources available over shinyapps.io.In order to keep the performance during the model computation only 2 ranks/best predictions were saved for the computation.
I would like to thank Coursera and Johns Hopkins University for this amazing data scientist specialisation journey during which I've learned so much. My great thanks goes as well to the amazing R-community which shares the knowledge over various sources:
Final thanks goes to my fellow peers for reviewing the application created with RStudio and Shiny: