Capstone project

Csaba Farago
2021-05-24

This is the final project of Data Science Specialization course series at Coursera.

Overview

The task was to try to guess the next word of a started sentence. More specifically:

  • Input: corpus of text, containing blog entries, news and tweets, 4 languages (English, German, Finnish, Russian).
  • Size (English): ~1-2 million rows, ~30-40 million words each.
  • Key idea of elaboration: the n-gram model (word triplets).
  • Result: an online application which suggests the next word.

Issues

Main idea: divide the corpus into word triplets, create a hash which calculates the number of occurrences, and the results could be used for the next word guess.

There were several issues with this approach:

  • The corpus is huge. There were memory and CPU problems.
  • The R list is not optimized for hash. An external R package hash was used instead.
  • The input was not clean. The news contained end of file character (the SUB character), where file read terminated. I removed it manually.

Algorithm

The algorithm is split into 2 major parts.

I. Create cache. It is divided into 4 major steps:

  • Create word triplets.
  • Merge and split the triplets into files (starting letter).
  • Calculate number of occurrences.
  • Input for guessing next word: 2 words key, third words value.

II. Use cache in the app. The Shiny app reads the cached input at startup which takes a few seconds only.

Result

The user should write something into the left-hand side input field.

Guess the next word

Ideas for developing further: consider more words, other languages, better GUI.