19/5/2020

Concept for the application

  • The application was created as the capstone project for Coursera’s JHU Data Science Specialization.

  • The idea was to take a data set provided by JHU containing some millions of text lines from news sites, twitteer and blogs, and use it to make a text prediction application, based on Natural Language Processing.

  • This is, given that certain word combinations (ngrams) are used more often than others, we can make an educated guess of which word(s) will follow.

Data Analysis

  1. The first step consisted on sampling our database. Particularly, I took only 10% of the observations.

  2. Then using mainly the stringrand quanteda packages, as well as some tidyverse and tidytext, we separated each line into unigrams (individual words), bigrams (pair of words that follow each other), trigrams and quadgrams.

As expected, some ngrams are more common than others. For example, a quadgram saying “thanks for the memories” is far more common in twitter than, say, “thanks for the ostrich”.

In the next slide, just as an example, we show the most common quadgrams in news sites. For the complete exploratory data analysis, you can click here

Most frequent quadgrams in news sites

Making the app

-The app works in a very simple way, and the user interface can be found in the following slide:

  1. First, it takes the prhase, and clean it from symbols, upper case letters, etc.

  2. Then, depending on the length of the phrase, it takes up to the last 3 words as an input, and tries to match it with the first three words of a quadgram, and outputs the fourth word as a suggetion.

  3. In case no match is found, it substracts one letter from the input, and tries again with a lower level ngram.

Really simple, and when in use, it takes less than 200 MW, which could be reduced to less than half that with a smaller sample from the data. For more information about this app development, look here

App interface

The app can be accessed from this link