A word prediction app

Coursera capstone project

Joyce Clemente

2024-10-17 (v.1); 2024-10-17 (last update)

Objective and method

Objective: Use 3 corpora (tweet, blog, news) to create a word prediction model used in a Shiny app.

Number of unique ngrams (n_size) or colocates (c_size) used in the models.
ngram	n_size	c	c_size
ugram	398317	-5	1506385
bgram	4689189	-6	1415277
tgram	2593472	-7	1367120
qgram	1868130

Approach
- Assume predicted word (w1) depends on previous words (Jurafsky & Martin 2024).
- Create 2,3,4-grams using 40% of each corpus.
- Simple Good-Turing smoothing to account for unobserved events (Gale & Sampson 1995).
- Trim 3,4-grams and colocates, remove Frequency = 1.
- Weigh n-grams, total = 1.
- Four most promising models:
  - n-grams only (n), n-grams with penalty for repeats and stopwords (np), n-grams and colocates (nc), and n-grams with colocates and penalty for repeats and stopwords (ncp).
Sample calculation

#Compute weights (user provides wh in field #2 of app).
w1 <- ((1 - wh)/3) * 2; w2 <- (1 - wh)/3; w3 <- (1 - wh)
#Weigh ngram probabilities (e.g. for 2,3,4-grams)
pw4 <- pw4 * wh; pw3 <- pw3 * w1; pw2 <- pw2 * w2
#N-grams and weights involved will change depending on the highest matched n-gram

App performance

Per phrase number of matches (test set).
	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
no_letter	1	462	2725	9173	12396	44264
one_letter	0	23	122	510	642	4986
two_letters	0	4	20	96	92	2347

Test 3x 4000 phrases, validate 1x 4000 phrases.
Large number total predictions per phrase (1000s; target word in 81 - 82% of phrases).
Highest accuracy from models n and nc.
Low accuracy for top 1 match (m01 ~0.12 - 0.53 out of 1).
Accuracy improves with clue (i.e. user provides first n letters).

How to use the app

App: https://j1924cle.shinyapps.io/C10JoyClemDraft3/
App has three tabs
The ‘Predict a Word’ tab has 8 fields, three buttons, and two results (graph, table).
- Fields #3, #4 required by 2 of 4 models (info on valid values in app).
- Field #8 populated after pressing ‘predict’
- Toggling a button in #8 and pressing ‘update text & table’ updates field #6, table, and resets the prediction list in field #8.

What information does the app provide?

I. Pressing ‘predict’
- Lists predicted words (up to 10) in field #8 + ‘WORD_NOT_LISTED’
- Plots graph (by probability, then alphabetical).
II. Pressing choice from #8 –> ‘update text & table’
- Creates/updates table with summary of parameters & word selections