The goal of this project was to build a predictive text model. By giving some input words, it should predict which is going to be the next word that the user will type. Important points for that purpose:
- which data? –> text data sets in English from three different media sources (blogs, news and tweets); some files with >800000 lines (aprox. 200 Mb per file)
- which tools? –> ‘’tm’’ turned to be not efficient with big size data so we used ‘’quanteda’’ package.
- Model 1: 50000 lines sample. Model 2: 100000 lines sample.
- evaluate models –> cross entropy, benchmarks
- build an app –> shiny