2022-09-30

Overview

  • The main objetive of this project is to build a model that can predict the next word condioned on a input sentence. The final data product was served by an interactive shiny application on the web.

  • Project Steps:

    • Load and Cleaning Data
    • Exploratory Data Analysis
    • Data Modeling
    • Model Evaluation
    • Interactive application

Load Data

The HC corpus data was loaded using a random sample of 600.000 lines, 200.000 from blogs, 200.000 from twitter and 200.000 from news. This is just about 15% of the total data avaliable for the project but this decision was based on computational costs contrais.

Profanity words were removed from the corpus as well as numbers, ponctuation, white spaces and special characters

Transformations such as Document Term Matrix were applied to generate tokens features

Exploratory Data Analysis

  • The Zipfs like distribution of the data suggest a N-Gram model that predict the next word based on the conditional probability of a previous sentence

Data Modeling

  • The final model was developed with the sbo package with the followig parameters:
    • Split Train 80% and Test 20%
    • Five gram Model
    • Dictionary of size 0.75
    • Default Preprocess Function
    • End of Sentences “.?!:;”
    • Penalization lambda of 0.4
    • Output with 3 words
    • Remove unknown tokens

Model Evaluation

  • Using just 15% of the source data the models performance is good considering the random probability of 1/1684 words. With an accuracy of 52% and a coverage of 77% this is the best model the avaliable computational resources could provide considering about an hour to train each model.

Interactive Shiny application