by Herchu
Sleek design. Shows or hides profanity.
Works in all modern (JavaScript enabled) browsers. Responsiveness guaranteed: tested in an iPad and an iPhone.
4-gram Model with Linear Interpolation Smoothing
\[ \begin{align*} P(w_n|w_{n-1},\dotsc,w_{n-3}) = & \lambda_0 P(w_n) + \lambda_1 P(w_n|w_{n-1}) + \dotsc + \\ & \lambda_3 P(w_n|w_{n-1},\dotsc,w_{n-3}) \end{align*} \]
30,000 words dictionary, unigram to tetagram tables. 2-3-4-grams have 1 million entries each (MLE; frequencies >= 3.)
n-grams tuples include begin-of-sentence tokens.
Achieves 16% accuracy for the first word and 26%
within the best three. Independently scored by
benchmark.R
Web app hosted in shinyapps.io (developed in R)
Words in the ngram tables are integer coded: less char strings
results in a 50MB total memory footprint.
One line in R (fast!) gets the most probable next word:
head(order(rowSums(sweep(ngrams,2,weights,`*`)),decreasing=T),n=1)
Weights \( (\lambda_0,\lambda_1,\lambda_2,\lambda_3) \) were eyeballed. Optimization (COBYLA) didn't get better results.
Includes two optional extra features: