class: center, middle, inverse, title-slide # TMPOTNW ## (The Magnificient Predictor Of The Next Word) ### HE ### Coursera ### 2020-12-27 --- ## Steps in developing the app In very broad terms, there were three steps in developing my app: 1. Most important: Reading and cleaning the text (blogs.txt, twitter.txt, news.txt): * The better you clean, the less noise there is in your data. Thus, the less data you need. * Punctuation, numbers and non-ASC-II characters were excluded. 2. Calculating the frequencies of n-grams: * I calculated the n-grams line-by-line - considering the linebreaks real boundaries. Thus, I only used "real" word combinations. * I calculated different corpora for the raw text including and excluding "stopwords" (the, it, to, ...). * 10% the size of the original data were the optimum from an accuracy/time-tradeoff point of view. * 4-grams were sufficient. The accuracy does not improve with larger n-grams. 3. Developing the app: * How to predict the most likely word? * What (input) options should the user have? --- ## How does the algorithm work? The algorithm is very simple: 1. The string entered is split into words using `str_split(string, " ")` 2. If you want to exclude "stopwords" (to, the, it, ...), they are deleted from the string. 3. The number of words is used to select the best n-gram frequency table: * If the string consists of three (or more) words, the algorithm checks if there are 4-grams starting with the last three words of the string. * If there are, the next word predicted is the fourth word of the most frequent of those 4-grams. * If there aren't, the algorithm checks if there are 3-grams starting with the last two words of the string. * ... * If the string consists of two words, 3-grams are checked. * ... * If no n-gram is found, the most frequent words are returned as a prediction ("the" without stopwords and "said" including stopwords). --- ## Shiny App: User Input There are three inputs you can make on the left side of the app: 1. Text input: Enter your search string. Separate words wit a space character. 2. Slider input: How many alternative predictions should the app generate? 3. Checkbox input: Do you want to include or exclude stopwords? --- ## Shiny App: Output On the right hand side of the app, you see all three outputs: 1. You'll see your search string repeated (black), followed by the predicted word (red). 2. You'll see how likely the word predicted is accurate. This calculation is based on my tests that I ran with a test data set (10% of the original raw data) with the set of your input (number of words / including or excluding stopwords). 3. A set of alternative (and less likely) predictions. The maximum number of alternatives is the number you selected in the slider input. If there are no alternatives among the n-grams, less or no alternatives at all are returned. --- class: center, middle # Have fun and good luck!