TMPOTNW

class: center, middle, inverse, title-slide

# TMPOTNW
## (The Magnificient Predictor Of The Next Word)
### HE
### Coursera
### 2020-12-27

---

## Steps in developing the app

In very broad terms, there were three steps in developing my app:

1. Most important: Reading and cleaning the text (blogs.txt, twitter.txt, news.txt):
  * The better you clean, the less noise there is in your data. Thus, the less data you need. 
  * Punctuation, numbers and non-ASC-II characters were excluded. 
2. Calculating the frequencies of n-grams:
  * I calculated the n-grams line-by-line - considering the linebreaks real boundaries. Thus, I only used "real" word combinations. 
  * I calculated different corpora for the raw text including and excluding "stopwords" (the, it, to, ...).
  * 10% the size of the original data were the optimum from an accuracy/time-tradeoff point of view.
  * 4-grams were sufficient. The accuracy does not improve with larger n-grams.
3. Developing the app:
  * How to predict the most likely word?
  * What (input) options should the user have?

---

## How does the algorithm work?

The algorithm is very simple:

1. The string entered is split into words using `str_split(string, " ")`
2. If you want to exclude "stopwords" (to, the, it, ...), they are deleted from the string. 
3. The number of words is used to select the best n-gram frequency table:
  * If the string consists of three (or more) words, the algorithm checks if there are 4-grams starting with the last three words of the string.
  * If there are, the next word predicted is the fourth word of the most frequent of those 4-grams.
  * If there aren't, the algorithm checks if there are 3-grams starting with the last two words of the string.
  * ...
  * If the string consists of two words, 3-grams are checked. 
  * ...
  * If no n-gram is found, the most frequent words are returned as a prediction ("the" without stopwords and "said" including stopwords).
  
---

## Shiny App: User Input

There are three inputs you can make on the left side of the app:

1. Text input: Enter your search string. Separate words wit a space character.
2. Slider input: How many alternative predictions should the app generate?
3. Checkbox input: Do you want to include or exclude stopwords?

---

## Shiny App: Output

On the right hand side of the app, you see all three outputs:

1. You'll see your search string repeated (black), followed by the predicted word (red).
2. You'll see how likely the word predicted is accurate. This calculation is based on my tests that I ran with a test data set (10% of the original raw data) with the set of your input (number of words / including or excluding stopwords).
3. A set of alternative (and less likely) predictions. The maximum number of alternatives is the number you selected in the slider input. If there are no alternatives among the n-grams, less or no alternatives at all are returned.

---
class: center, middle

# Have fun and good luck!