Data Science Capstone: Predictive Typing

Yunfeng Xi
04/11/15

Features of data product

Word self-completion

If you type something, it will show ten highest-frequency words starting with what you typed in word cloud. Note that font size is scaled to frequency.

For example, if you type “trans”, the word cloud plot will be like this.

fig1

Features of data product

Word self-correction

To make it simple, only the typos that are off by one edit distance are considered. There are four kinds of typos:

  • missing one letter
  • adding one letter
  • two adjacent letters swapped
  • replacing one letter

If you type “mistkae”, you will plot on the right.

fig2

Features of data product

Next word prediction

To save the corpus uploading time, cases are considered up to 3-gram. If there are less than ten candidates found from corpus, the algorithm will take step back and search 2-gram, if the number of candidates is still less than ten, it will search 1-gram till there are ten words in plot. If you type “university of ”, you will see:

fig3

Accuracy test

Below is ten phrases randomly grabbed from twitters, 4 out of 10 are in plot which means in rank of top 10. It is a small sample just show how I did the test. For a larger sample, the accuracy is no more than 20%.

fig4