A.B.
2025-02-27
1. Project overview
This project covers basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model, based on the following technics and concepts (N-gram model, back-off, Markov chain).
Finally, a predictive text product is built.
2. The Data
The data is provided in three *.txt files containing texts in English (tweets, News and Blogs).
Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 2017. https://web-beta.archive.org/web/20160930083655/http://www.corpora.heliohost.org/aboutcorpus.html
Data is loaded in RStudio for exploratory analysis and further processing.
3. Exploratory data analysis processing and train-test split
Skipping technical details the following conclusion about the could be made: 1. Overall count of samples (texts or phrases) in data: 3 336 695.
This amount are split into train and test parts in proportion 80% / 20%.
After sentence tokenization we can find that:
To avoid errors in ngram R-package, responsible for extracting 4-gram frequency table from a text, we exclude sentences of 3 and shorter words from the data.
After mentioned manipulations there are 4’113’533 sentences in train data and 1’041’760 sentences in test data.
There are 400’797 unique words in the train data. Having said that, we don’t consider register of letters (for example, words ‘LoVe’, ‘LOve’ and ‘love’ are considered to be the same, however words ‘love’ and ‘loved’ are treated as different).
There are 33’053’461 unique 4-grams in the train data. Memory usage - 2.9 GB.
The shape of word-frequency (aka 1-gram) graph (histogram) and
the shape of N-gram (N=2,3,4…) frequency graph (histogram) look similar.
These graph are of the same pattern:
With the graph above we can evaluate have many words/N-grams we need to cover whole train data corpus. Couple examples for 4-grams:
4. Building the model
Model is built basing on combination of the following concepts:
N-gram (4-gram) model, used as the basis of a Markov chain algorithm. For our purposes, the term “Markov chain” is synonymous with “text generated from n-gram model probability tables”. N-gram is a purely statistical model of language. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words.
To prevent a zero probability being assigned to unseen words, each word’s probability is slightly lower than its frequency count in data corpus. To calculate it, stupid back-off algorithm is implemented.
5. Evaluating the model
Two metrics are considered for evaluating the model:
Accuracy - percentage of correct predictions from the whole range of predictions. For built 4-gram model model Accuracy = ~11%.
Perplexity - could be considered as a way to capture the degree of ‘uncertainty’ a model has in predicting (i.e. assigning probabilities to) text. With different parameters of the model We obtain different of perplexity. For our model, with shorten “4-gram dictionary” and preliminary prepared test data (see 8.b in the section “Exploratory data analysis and train-test split”) we have Perplexity = 39’005.
5. Online App
In conclusion of the project, data product is developed and provided. It highlights prediction algorithm that has been built and to provides an interface that can be accessed by others.
The product (app prototype) is developed within shiny framework for R programming language and can be found at: https://ecopsy-app.shinyapps.io/my_capstone_app/
Instructions:
On the first launch Please wait a few seconds while online App is loaded (and prediction for the empty phrase - the word ‘the’ - appears on the screen).
Then app can be used for the folowing word prediction for any phrase.
Just type your phrase in the input box and click “Submit” button to get word prediction.
Normally, the app predicts the following word for a phrase faster then in 1 second.