Capstone Report

23/3/2021

Context

The coursera “Data Science Specialization” provides a wide range of skills to those who have taken it.

The last part of this specialization gives to the pupil the oportunity of explore a field not explored before in the course the data mining.

Using tm as basis we are encouraged to design a system to predict the next word in a text.

In this presentation I describe the steps and the result of this work.

Analysis and design

I explored the following algorithms and techniques to do the work:

GloVe algorithm
Word distance
Markov Networks

After a certain study I concluded that the Markov networks could be useful using two of them, one for the 2-grams and another for the 3-grams this networks could help to predict the word based on the 2 or 1 words before the word to predict.

Besides this the word distance could help to get a solid prediction if there were some misspellings. To this I also included a vocabulary of all the words found in 200000 blogs, news and twits.

Development

To improve the performance I also saved the networks and vocabulary in an RDS file which I load in the app. The use of the .RDS files optimize the times quite well, see this example:

ptm <- proc.time()
getPredictedWords("how do you ",2)

##     word      prob
## 1: think 0.1957082
## 2:  have 0.1630901

proc.time() - ptm

##    user  system elapsed 
##    0.09    0.00    0.11

How to use it

To use the app only write some words in the text box, once done this use the button “Send text”, you can use the slidder to get more than one word suggested.

Conclusions

The algorithm made optimizes the time sacrificing a little the precision because the use of two Markov Networks could be thought as limited. Yet the accuracy is good and covers a wide range of variants. In order to improve its usability it would be a good idea create a microservice with this algorithm maybe using rApache or openCPU. Anyway this algorithm, I think, is well suited for use in cell phones provided there is a way to run R in the cell phone, given the close development of R and Linux the chances are that there is a way.