This project is the final part of a 10 course Data Science track by Johns Hopkins University on Coursera. Asignment was to clean and analyze a large corpus of unstructured text and build a word prediction model and use it in a web application.
My Shiny app WordPred is available here.
Code used to prepare model, create an app is available here.
I hope you are going to enjoy my work.
30000 lines from three files with natural language data have been used.
Corpus, tokens, n-grams and document feature matrixes have been created
Data tables with words of n-grams and n-gram frequencies have been created and saved as rds files.
N-Gram Model: The model utilizes n-grams, capturing contextual information by considering sequences of consecutive words.
Back-Off Strategy: Implements a back-off algorithm to handle unseen n-grams, gracefully falling back to lower-order n-grams in case of sparse data or missing probabilities.
Probabilistic Prediction: Predicts the probability of the next word in a sequence based on the context of the preceding words, enabling language prediction in natural language processing tasks.
Each box is average result of 10x1000 predictions from one, two, three words as input.
I created working word predicting model, using n-grams and back-off algorithm.
Accuracy of this model is not great, as it uses very limited corpus and it is also built on simple n-gram prediction.
There are for sure plenty of other approaches to make it more accurate, but this was not a target of this project.
Thx
P
This project is the final part of a 10 course Data Science track by Johns Hopkins University on Coursera. Asignment was to clean and analyze a large corpus of unstructured text and build a word prediction model and use it in a web application.
My Shiny app WordPred is available here.
Code used to prepare model, create an app is available here.
I hope you are going to enjoy my work.
30000 lines from three files with natural language data have been used.
Corpus, tokens, n-grams and document feature matrixes have been created
Data tables with words of n-grams and n-gram frequencies have been created and saved as rds files.
N-Gram Model: The model utilizes n-grams, capturing contextual information by considering sequences of consecutive words.
Back-Off Strategy: Implements a back-off algorithm to handle unseen n-grams, gracefully falling back to lower-order n-grams in case of sparse data or missing probabilities.
Probabilistic Prediction: Predicts the probability of the next word in a sequence based on the context of the preceding words, enabling language prediction in natural language processing tasks.
Each box is average result of 10x1000 predictions from one, two, three words as input.
I created working word predicting model, using n-grams and back-off algorithm.
Accuracy of this model is not great, as it uses very limited corpus and it is also built on simple n-gram prediction.
There are for sure plenty of other approaches to make it more accurate, but this was not a target of this project.
ThankYou.
Rashad Ahammed