Machines should work. People should think.
Phrase-based n-gram extraction,matrix-format storage, multiple predictions, and combinable probabilities allow this algorithm to choose the most likely prediction from a wide array of candidates, while keeping the lookup dataset relatively small and processing speed acceptable.
author: John Barnes date: 24 Jan 2016 autosize: TRUE
After harvesting lines from the blogs, news, and twitter datasets the lines were broken at {. ? ! {} :} – the common full-stop punctuation marks
The reference dataset stores N-grams,regardless of length,as hexgrams plus probability in 7 columns of a table:
Last 5 words of the input phrase are matched against Prior01-Prior05 columns to form 5 Elementary True/False Vectors.
Probabilities from each selected row are multiplied by a coefficient for the model that selected them; using group_by and summarize, aggregate weighted probabilities for each prediction are computed, and the highest aggregate probability “wins.”
The final version of the coefficients for the multiple model, which achieved a dismal 9.4% success rate on training data, was
Exact54321 ExactX4321 ExactXX321 ExactXXX21 ExactXXXX1
3.560 7.110 5.330 8.890 1.780
SkipOne5432X SkipOneX432X SkipOneXX32X SkipOneXXX2X SkipTwo543XX
5.330 2.670 4.000 1.330 10.000
SkipTwoX43XX SkipTwoXX3XX Unk543X1 Unk54X21 Unk54XX1
6.670 3.330 1.590 9.520 3.170
Unk5X321 Unk5X3X1 Unk5XX21 ContextModel MadGuess
7.940 4.760 6.350 6.670 0.005
With fitted coefficients, combined model was only 9.4% correct on training, 3.5% on test data. Isolated Unk models can be overtrained to 17.3% on training but hit under 1% in test data.