18/3/2022

Introduction

EDA Text by Corpus

The first analysis was made to the different corpus (datasets) by source. This way I could understand how written expressions change according to each platform they were written for.

Text of news was the most different. It was written mostly in third person, with complex words (measured by the length of each word) even though it wasn’t the one with the longest parragraphs

EDA N-Grams

Most Common Bi-Grams by Data source. It is more obvious how each media uses different types of expressions.

Methodology

Given the hardware constraints of 1GB of RAM per app on the shiny server I had to optimize the size of the App. I decided to use an n-gram model that ranges from 2 to 5 n-gram using the quanteda package. It was significantly faster than tidytext and tm packages. A comparison is available here.

A simple back-off model is then created using data.table to optimize the read and indexing speed.

We take the longest possible gram from the text we want to predict (up to 4 words), then we use those words to find in our model the next possible word based on the frequency of appeareance on corpus. If the frequency is negligible then we take the next possible gram until and we make those iterations until a prediction is returned.

Useful Links:

App-Usage

  • Go to this link
  • Input your initial text
  • Make sure you finish the word you would like to predict. For example: “I woul” is not a valid input. Input should be “I would”.
  • Hit the Predict Button
  • The first time it might take some time to load the initial answer, since the app could be sleeping.
  • The prediction should appear as in the image below