Multi-Weighted Probability Prediction: 20 Models, 1 Compact Dataset

Machines should work. People should think.

Phrase-based n-gram extraction,matrix-format storage, multiple predictions, and combinable probabilities allow this algorithm to choose the most likely prediction from a wide array of candidates, while keeping the lookup dataset relatively small and processing speed acceptable.

author: John Barnes date: 24 Jan 2016 autosize: TRUE

Reference dataset drawn from phrases

After harvesting lines from the blogs, news, and twitter datasets the lines were broken at {. ? ! {} :} – the common full-stop punctuation marks

Meaningful word combinations occur much more frequently just before and just after full stop punctuation; this is a major part of how human beings learn to read for meaning. See, for example, Arcand et al.
Hexgrams harvested from the first and last 6 words of a phrase are a very high-hit population
Harvesting hexgrams from the middle of longer phrases consistently resulted in <30% as many hits as hexgrams from phrase beginnings and ends.
each blog and news line yielded 3.4 phrases;* each twitter line about 1.8.

PPP: Priors,Prediction,Probability

The reference dataset stores N-grams,regardless of length,as hexgrams plus probability in 7 columns of a table:

Prior01 thru Prior05 Word just before prediction=Prior01;5th word before=Prior05. Higher-numbered riors on shorter-than-6 word phrases are assigned NA
Prediction a word observed to follow this sequence of Priors in the sample
Probability (observations of this prior&prediction combination)/(observations of prior combination + 1) Adding 1 in the denominator downweights values based on small n.
For example, if “My grandmother ate my” occurs 7x, followed by “dog"3x: Prior05=NA; Prior04=my; Prior03=grandmother; Prior02=ate; Prior01=my;Prediction=dog; Probability=37.5% <—3/(7+1).

Twenty models select predictions collectively

Last 5 words of the input phrase are matched against Prior01-Prior05 columns to form 5 Elementary True/False Vectors.

18 Model Vectors, Logical-ANDing combinations of Elementary Vectors select relevant rows from the PPP reference dataset. Named by kind+elementary vectors they include
- Exact: exact match for n-gram. Equivalent to n-grams and stupid backoff.
- SkipOne,SkipTwo: Priors01-02 left out to match “any” just before prediction
- Unk(unknown): Some mix of Priors02-04 left out to match “any” for multiple similar phrases
- Elementary vector sequence: numbers denote Priors that are matched. X denotes not matched
Context Vector logically ORs all 5 Elementaries against 5 randomly chosen-non-stop words from the whole phrase
MadGuess picks at random from 100 most commonly predicted words

Combined weighted probability

Probabilities from each selected row are multiplied by a coefficient for the model that selected them; using group_by and summarize, aggregate weighted probabilities for each prediction are computed, and the highest aggregate probability “wins.”

Often the chosen prediction was not the first choice of any of the 20 models but it had so many second- or third-choice scores that they added up to selecting it
Coefficients tell us something about each model
- Infrequent hit/high accuracy models have high coefficients
- Models with many low-probability hits have low coefficients
- Although Context Model has a low coefficient it almost always contributes to the “winner”
- MadGuess's coefficient is deliberately held very low – it is only there so the app will always return something

Final Estimates

The final version of the coefficients for the multiple model, which achieved a dismal 9.4% success rate on training data, was

  Exact54321   ExactX4321   ExactXX321   ExactXXX21   ExactXXXX1 
       3.560        7.110        5.330        8.890        1.780 
SkipOne5432X SkipOneX432X SkipOneXX32X SkipOneXXX2X SkipTwo543XX 
       5.330        2.670        4.000        1.330       10.000 
SkipTwoX43XX SkipTwoXX3XX     Unk543X1     Unk54X21     Unk54XX1 
       6.670        3.330        1.590        9.520        3.170 
    Unk5X321     Unk5X3X1     Unk5XX21 ContextModel     MadGuess 
       7.940        4.760        6.350        6.670        0.005

Assessment

With fitted coefficients, combined model was only 9.4% correct on training, 3.5% on test data. Isolated Unk models can be overtrained to 17.3% on training but hit under 1% in test data.

Skip models miss completely or overtrain wildly, giving them highly variable coefficients.
The Context model includes the right answer in its top 5 results more than 50% of time but inflates the score of numerous wrong answers more.
Exact and Unk models get close to 10% correctness but grossly exaggerate the Prior01 Elementary Vector (equivalent to bigrams), siding with it in more than 60% of trials.
8%-12% correctness on test data by weighted combinations Exact/Unk combinations that:
- Zero out models most sensitive to Prior01 Elementary Vector effect
- Retain Context model with low but non-zero coefficients.