Samer Alwan
April 16, 2019
Data Science Specialization
Capstone Project
www.coursera.org
Link: https://sal2040.shinyapps.io/word_prediction/
Discount method:
Transformation to integers
Unknown words
Pruning
| Highest n-gram orde:r | Discount factor: |
|---|---|
| 3-gram | Basic: |
| 4-gram | \[ D_1 = 0.5 \] |
| \[ D_{\ge 2} = 0.75 \] | |
| Pruning cut-off method: | Modified: |
| “Hide” = cut-off done after probability calculations | \[ Y = \frac{n_1}{n_1+2n_2} \] |
| “Remove” = cut-off done before probability calculations | \[ D_1 = 1-2Y\frac{n_2}{n_1} \] |
| No pruning | \[ D_2 = 2-3Y\frac{n_3}{n_2} \] |
| \[ D_3 = 3-4Y\frac{n_4}{n_3} \] |
| Type | No. of texts in corpus |
|---|---|
| Train | 373,600 |
| Dev | 100 x 4000 |
| Test | 400,000 |
| Model | Shapiro-Wilk P-value | Mean Perplexity | Delta | T-test P-value |
|---|---|---|---|---|
| UNPRUNED_BASIC_4G | 0.4934 | 110.6234 | -15.0483 | 1.75e-172 |
| UNPRUNED_MOD_4G | 0.5747 | 125.6716 | -4.9026 | 2.60e-119 |
| HIDE_BASIC_4G | 0.5207 | 130.5743 | -5.1068 | 3.21e-165 |
| HIDE_MOD_4G | 0.5798 | 135.6810 | -1.2578 | 3.04e-26 |
| REMOVE_BASIC_4G | 0.5021 | 136.9388 | -2.2620 | 1.46e-37 |
| UNPRUNED_BASIC_3G | 0.3521 | 139.2009 | -9.0935 | 1.44e-94 |
| REMOVE_BASIC_3G | 0.6195 | 148.2943 | -7.1126 | 4.09e-85 |
| UNPRUNED_MOD_3G | 0.4552 | 155.4069 | -6.5194 | 8.99e-124 |
| HIDE_BASIC_3G | 0.3624 | 161.9264 | -5.5459 | 7.20e-168 |
| HIDE_MOD_3G | 0.4129 | 167.4723 | NA | NA |