D.Dakhno
06. Oct. 2016
JHU Data Science Specialization Capstone Project (in cooperation with SwitfKey)
Have fun! 頑張ってね~~
Hit Chart Elegy (音楽番付哀歌)
Just Like Great Britain’s–Unfortunately Bibi on Fox News Две ракеты были выпущены по южной территории #Израиль из #СекторГаза
★ 12月30日 Awoi Instore at Zeal Link Shibuya
…fermented red beancurd or hong fu ru 红腐乳…
…to order their debut EP, 플레이걸의 24時.
Xi Lu Han is the full name of Lu Han (루한 / 鹿 晗)
..LOL we may be getting an adlut♥♥ I hope !
He loves YOU!(((: #His<3isbettter
Regular English by itself offers fast endless plenty of grammatical, syntactical and lexical forms and combinations with extremely rich semantic background. In its function of lingua franca in the times of globalization and Internet, it multiplies the variability of the lexical forms an analyst has to deal with.
The most challenging step in the course of analysis of a given text corpus HC corpus for the purposes of prediction was therefore preprocessing of the text, trying to remove the “garbage”, reduce the variability but avoiding extreme bias as a result. The corpus includes patterns from different segments of Internet (news, blogs, Twitter).
The main steps of text preprocessing in this project were:
N-Gramms model has been chosen as a basis for prediction engine. Such models are widely used in statistical natural language processing. They are typically collected from a text or speech corpus. So did we as well (HC Corpus, available over Coursera/SwiftKey).
The following principles were implemented for building n-gram databases:
Using the data bases created, following predictive model has been implemented:
Benchmarking of statistical and technical performance of the Next Word Prediction Engine has been conducted using the n-grams derived from the Corpus of Contemporary American English (one million most frequent pentagrams, case insensitive).
| Mbyte | nr.mono | nr.bi | nr.tri | nr.four | nr.penta | accuracy | by_ranking | mean_time | median_time | sd_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 932.87 | 292005 | 1576301 | 4535312 | 3652975 | 5320886 | 0.5981 | 0.5548 | 0.0087 | 0.005 | 0.0317 |
| 2 | 4141.19 | 422469 | 3379408 | 16324126 | 21729539 | 20934044 | 0.6009 | 0.5319 | 0.0082 | 0.005 | 0.0414 |
| 3 | 9078.07 | 861463 | 5360592 | 32419383 | 48965451 | 49787933 | 0.6546 | 0.5281 | 0.0087 | 0.004 | 0.0799 |
| 4 | 8262.96 | 292005 | 1576301 | 16324126 | 48965451 | 49787933 | 0.6523 | 0.5247 | 0.0071 | 0.004 | 0.0321 |
| 5 | 1603.60 | 861463 | 5360592 | 16324126 | 3652975 | 5320886 | 0.6058 | 0.5572 | 0.0102 | 0.005 | 0.0789 |
Larger data bases show about 10% higher accuracy for the price of the ten-fold higher memory consumption (lines 1. to 3. in the table). The comparison of the the right-loaded setup (larger four- and pentagramm databases, smaller for shorter predictors - line 4.) versus the left-loaded reveals the prevalence of the first one (one more time, for the price of RAM consumption).
Some surprisingly, the average execution time of predictions were not affected by the size of the data bases (some higher times in the last setup are apparently due to more drilling down through the data bases needed for prediction). The relative short response times are due to the choice of indexed data tables as a storage mechanism.
For the real world scenarios like typing in a phrase letter by letter caching mechanisms have been implemented, reducing the load on the data bases and response times of fuzzy search (queries with %like%), especially of in-word prediction.
The results are noway the final truth, but pioneer the way to the better and parsimonious word prediction models and implementations.