GB
6/5/2016
Natural Language Processing is a new topic for me, and following material was used to develop the model:
“Text Mining Infrastructure in R” by Ingo Feinerar, Kurt Hornik and David Meyer published in Journal of Statistical Software, March 2008, V25, 5.
Data processing for this app was described here: https://rpubs.com/GintasBu/ExplorText.
Available text was used to build n-gram model. In the final model bi-grams and tri-grams were used. 4-grams were also constructed, however due to response time and available computational resources to run the app, 4-grams were not used.
To test the app model performance 1000 sentences were chosen from the provided blogs text file. Those sentences were first 1000 sentences that were not used in constructing n-grams. This was assured from using set.seed command in building n-grams and during the test inverting the same seed results. In selection for test:
set.seed(123456)
i<-rbinom(length(text), 1, 0.1)
text3<-text[which(i==0)]
text4<-text3[1:1000]
Selected sentences were pre-processed the same way as for the model: including bad and stop words removal and stemming. Sentences that had less than 3 words left after the pre-processing were removed. That left 928 sentences to test. In test sentences the last word was removed, and the remaining part of the sentences was used to predict the last word. The predicted word was compared to the removed word. In 113 out of 928 the prediction was correct, in percent yielding to:
[1] 12.2
Model improvements can be done in following steps: