FinalPresentation

- Fernandes Valdrich

Introduction

Dataset No. of Sentences (mil.) No. of Words (mil.) No. of unique words (thous.)
Tweets 3.16 29.61 442
Blogs 2.17 36.89 403
News 1.78 33.59 330

Method

The data was pre-processed to remove numbers and punctuations (preserving contractions). Links and email addresses were also removed.

A “Stupid Backoff interpolation” was trained on the data using ngrams containing 2 to 5 words.

Only ngrams which are repeated more than once is considered while calculating the probability. However, only ngrams which appeared more than 3 times are suggested.

R packages such as quanteda, data.table and stringr were used to preprocess and perform the required calculations.

Special features

Application

The Application is hosted by shiny.io.

After entering your phrase in the provided box, hit the Enter key to generate the prediction off of