Capstone project

Csaba Farago
2021-05-24

This is the final project of Data Science Specialization course series at Coursera.

Overview

The task was to try to guess the next word of a started sentence. More specifically:

Input: corpus of text, containing blog entries, news and tweets, 4 languages (English, German, Finnish, Russian).
Size (English): ~1-2 million rows, ~30-40 million words each.
Key idea of elaboration: the n-gram model (word triplets).
Result: an online application which suggests the next word.

Issues

Main idea: divide the corpus into word triplets, create a hash which calculates the number of occurrences, and the results could be used for the next word guess.

There were several issues with this approach:

The corpus is huge. There were memory and CPU problems.
The R list is not optimized for hash. An external R package hash was used instead.
The input was not clean. The news contained end of file character (the SUB character), where file read terminated. I removed it manually.

Algorithm

The algorithm is split into 2 major parts.

I. Create cache. It is divided into 4 major steps:

Create word triplets.
Merge and split the triplets into files (starting letter).
Calculate number of occurrences.
Input for guessing next word: 2 words key, third words value.

II. Use cache in the app. The Shiny app reads the cached input at startup which takes a few seconds only.

Result

The user should write something into the left-hand side input field.

Guess the next word

Ideas for developing further: consider more words, other languages, better GUI.