Predicative Text Algorithm

2023-03-28

- This app contains a predicative text algorithm that predicts the next word when given user input of up to 4 words. - The data set that is used is a combination of the News, Twitter, and Blogs Dataset that was given from this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

I applied a profanity filter I created to take the profanity out of the entire dataset before sampling.

Methodology (Steps)

Steps:
1. Tokenize dataset (used quantenda package)
2. Clean data (remove stop words, non english characters)
3. Create ngrams from the clean dataset

I created ngrams from 1-5 words and made an algorithm that looks at the previous words to return the last word. I split the ngrams up for 1, 2, 3, 4, and 5 words so that the 5th looks at the previous 4 words and returns the 5th word, 4th looks at the previous 3 words, and so on.
I took a random sample of 5% of the data
The algorithm gets better the more data you sample, but R cannot handle too large of a dataset when trying to use ngrams, so I opted for a smaller sample size. The more data you can get the better the algorithm, but you have to weigh the cost of time it takes for the algorithm to run against the performance of the algorithm.
To cut down on time the app takes to return a word, I saved the dataframes after creating the ngrams from the sample and loaded that into my shiny app.

Here is an example of the working app. If you type in “big fat greek” the output is wedding