Capstone Project

A.B.

2025-02-27

1. Project overview

This project covers basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model, based on the following technics and concepts (N-gram model, back-off, Markov chain).

Finally, a predictive text product is built.

2. The Data

The data is provided in three *.txt files containing texts in English (tweets, News and Blogs).

Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 2017. https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html

Data is loaded in RStudio for exploratory analysis and further processing.

3. Exploratory data analysis processing and train-test split

Skipping technical details the following conclusion about the could be made: 1. Overall count of samples (texts or phrases) in data: 3 336 695.

This amount are split into train and test parts in proportion 80% / 20%.

  1. After sentence tokenization we can find that:

    • count of sentences in the train data - 5’039’888
    • count of sentences in the test data - 1’261’421.
  2. To avoid errors in ngram R-package, responsible for extracting 4-gram frequency table from a text, we exclude sentences of 3 and shorter words from the data.

  3. After mentioned manipulations there are 4’113’533 sentences in train data and 1’041’760 sentences in test data.

  4. There are 400’797 unique words in the train data. Having said that, we don’t consider register of letters (for example, words ‘LoVe’, ‘LOve’ and ‘love’ are considered to be the same, however words ‘love’ and ‘loved’ are treated as different).

  5. There are 33’053’461 unique 4-grams in the train data. Memory usage - 2.9 GB.

  6. The shape of word-frequency (aka 1-gram) graph (histogram) and the shape of N-gram (N=2,3,4…) frequency graph (histogram) look similar. These graph are of the same pattern:

  7. With the graph above we can evaluate have many words/N-grams we need to cover whole train data corpus. Couple examples for 4-grams:

    • 10% (3’305’346 items) of 4-grams (the most frequent ones) covers 30.3% of all cases in training data. Memory usage will be less than 300 MB.
    • If we take only 4-grams with frequency more than 1 (2’698’785 items), they will cover ~28% of all cases in training data. Memory usage will be less than 300 MB. We will proceed model development with that option.

4. Building the model

Model is built basing on combination of the following concepts:

5. Evaluating the model

Two metrics are considered for evaluating the model:

5. Online App

In conclusion of the project, data product is developed and provided. It highlights prediction algorithm that has been built and to provides an interface that can be accessed by others.

The product (app prototype) is developed within shiny framework for R programming language and can be found at: https://ecopsy-app.shinyapps.io/my_capstone_app/

Instructions: