Satindra Kathania
12 August, 2020
Capstone project for Data Science Specialization offered by Johns Hopkins University
The goal for this Capstone Project are:-
Develop a text prediction algorithm that takes a strings or a phrase (multiple words) and predict the next possible word as output. Deploy a Next word predictor, Shiny App using this algorithm. Such as a swiftkey/smart keyboard on mobile devices.
The underlying theory of the predictive model is n-grams language model .
An n-gram language model assigns a probability according to:-
where the approximation reflects the Markov assumption,i.e the most recent n-1 tokens are relevant while predicting the next word.
The maximum-likelihood (ML) probability estimates for the n-grams are given by their relative frequencies leading to sparse data problem and need further modification such as discounting/back-off weights or smoothing.
Data Source for this project: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
R packages used: The main R libraries used to accomplish this project are:'dplyr', 'tidytext','stringr','tokenizer','tm','ggplot' and 'wordcloud'
-Creating the large corpus with sampling (10%) and pooling the blogs, news, and twitter data with stop words.
-Cleaning the corpus and Building n-grams frequency tables, up to 5-grams. These frequency tables serve as frequency dictionaries for the predictive model to search for the match.
-Probability of a term is modeled based on the Markov chain assumption that the occurrence of a term is dependent on the preceding terms.
-Pruning:Removed low frequency < 2 n-grams.
-Use a 'backoff strategy' to predict, means if the probability of a penta-gram is very low, use quad-gram to predict, and so on.This is a very simple and intuitive method with no discount needed, just use fixed backoff factor (0.4) & relative frequencies to calculate the score.
The recursion ends at unigrams, with N being the size of the training corpus.
Fig:N-Grams linguistic models (J Dumbali, N Rao, 2019)
This prediction model uses 1-5 ngrams as dictionary to search with a pre-calculated back-off score from highest to lowest frequency. The probability of next word is depends on the backoff score which is calculated by dividing the counts of matches found in ngram/n-1 gram multiplied by backoff factor= 0.4 with each dropping n-grams.This is an inexpensive method with comparative good quality or accuracy
Additional Information
1. Large Language Models in Machine Translation,https://www.aclweb.org/anthology/D07-1090.pdf
2. Real Time Word Prediction Using N-Grams Model,Jaysidh Dumbali, Nagaraja Rao A.,IJITEE,ISSN: 2278-3075, Vol.8 Issue-5, 2019
3. Tokenizer: Introduction to the tokenizers R Package https://cran.r-project.org/web/packages/tokenizers/
4. Smoothing and Backoff, http://www.cs.cornell.edu/courses/cs4740/2014sp/lectures/
smoothing+backoff.pdf.
5. N-gram Language Models, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
6. Speech and Language Processing https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
7. NLP: Language Models, https://www.csd.uwo.ca/courses/CS4442b/L9-NLP-
LangModels.pdf
The web application is hosted at:
https://satindrakathania-2020.shinyapps.io/ShinyCapstone/
Capstone Data Science GitHub Repository link:
https://github.com/SatK-ds2020/DS-capstone-project-2020
To initialize the app: Wait 5-10 seconds and follow the instruction:
-There is a text input box, where you can enter your text.
-The bar-plot of top 10 predictions and wordcloud for the most probable words will be displayed once clicking submit button.
-There is a Summary and References tab for more information about the app.
Conclusions and feedback
- Learned R based data analytics,Natural language processing & text mining techniques from this DS specialization.
- Explored different tools and package and their uses to manipulate, clean, analyze and visualize the data.
- Uses GitHub to manage different data science projects for this specialization.
- Learned whole sequence from data acquisition to publication and presentation along with skills to build the web based application
- Learned to perform the regression analysis and statistical inferences by using Machine Learning techniques for big data process and predictions.
- Looking forward to demonstrate my skills and learning to solve real-world big data problems.