N Grams Model

Sahil Sharma

Introduction

This presentation was created as part of the Data Science Specialisation-capstone project . It briefly describes how the current n grams based next word prediction model was developed.

The next slides will briefly describe how the model was trained.

The code used for following step has been shared on Git Hub:

Data Split: The data was split into training, test and validation sets.
N Gram Modelling:
- Due to limited hardware resources the model was trained only on the first 2,00,000 chunks of texts in the training dataset.
- The raw chunks were split into sentences and the sentences were split into words/tokens which were then cleaned.

N Gram Modelling
- Remove_profanity function applied.
- N Grams computed till n = 5, infrequent n grams removed.
- Mutate probability function applied to attach probability scores. Probabilities saved as a new list file.
- Saved probabilities used in next word prediction function and Shiny app to predict next word.

Accuracy: Currently the model is 0.14 accurate, which is indeed very low.
- Currently limited hardware resources
- In future, the state-of-the-art Transformers or Deep Learning methods can be used.
Shiny App:The shiny app was created using the the Shiny library in R. The app was hosted on shinyapps.io while the associated code is shared through the GitHub Repository.

Sahil Sharma
PhD Student

Data Science for Tourism Research
Email- sahilsharmahimalaya@gmail.com
LinkedIn
Twitter