Data Science Capstone Project
Raya Matoorian
[January 2016]
This capstone aims to create a text predictive model and build an applicaton to interface with the user. The picture shows the pipleline of steps implemented to achieve the goal.
Below picture shows a wireframe of the built application. User insert the text in the textbox (1) and the word with the most possiblity displays (2) and other possible words will appear in the barchart (3).
1- After loading text data from HC Corpora and cleaning, a thorough exploratory analysis has been performed to understand the distribution of words and relationship between the words in the corpora and prepared them to build a linquistic model.
2- Using n-gram models (1, 2, 3, 4 grams) in previous step, a basic n-gram model has been built to predict the next word based on the previous (n-1) words (Markov chain).
3- To be able to estimate the probability of unobserved in n-grams, Katz Back-Off Algorithm has been implement to solve this problem. The model uses the training corpus to create multiple internal models with different values of n.
4- To enhance the database and improve the accuracy of prediction, a collection of daily common sentences in English has been gathered and added to the predictive model.
test <- "I love"
predict(model, test)
word prob ngram
1 you 18.37 4
2 the 10.54 4
3 that 5.78 4
4 my 5.10 4
5 to 4.76 4