Data Analysis and Modeling for Next Word Prediction App Production

Synopsis

In this report, we deal with plan for producing application that recommends the most likely next word. For this, with dataset from swiftkey, exploratory analysis was conducted. After it, prediction model was constructed combining trigram model with Katz’s backoff model that is for dealing with zero probability problem. Based on it, we described application that presents the 3 most likely words to users.

About Dataset

Dataset we used in this report are from ‘swiftkey’, contain corpora obtained from blogs, news, twitters in each of 4 languages(German, English, Finnish, Russian). Among them, we used English dataset. An example of English news document is as follows.

‘The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.’

Exploratory Data Analysis

In 3 files(blogs, news, twitters), total number of documents(lines) is following.

## [1] "blog : 899288"
## [1] "news : 77259"
## [1] "twitter : 2360148"

And When we sample 1000 lines from each files, total frequency of unigrams(words) is

## [1] "blog : 32090"
## [1] "news : 24764"
## [1] "twitter : 9987"

## [1] "total : 66841"

And when we sample 1000 lines in each of 3 files, the size of vocabulary in each file and total is

## [1] "blog : 3053"
## [1] "news : 3192"
## [1] "twitter : 1657"

## [1] "total : 5216"

So, how many words in vocabulary make up most of the frequency?

## [1] 5216

## [1] "33 words among 5216  words make up 50 % of the total frequency"
## [1] "66 words among 5216  words make up 60 % of the total frequency"
## [1] "144 words among 5216  words make up 70 % of the total frequency"
## [1] "365 words among 5216  words make up 80 % of the total frequency"
## [1] "1078 words among 5216  words make up 90 % of the total frequency"

This result suggests chance to improvement of runtime efficiency by reducing the size of vocabulary.

The distributions of the number of tokens in each lines in each of 3 files is following.
And maximum length of line is

## [1] "blog : 421"
## [1] "news : 143"
## [1] "twitter : 27"

The most frequent unigrams in each of 3 files is
We can see that most of frequent words are stopwords(e.g.,‘the’, ‘a’). So, let’s check the frequency again without stopwords!

And let’s check the frequency of bigrams!
Also check in trigram.

Prediction Model

Trigram(3-gram) model

‘n-gram model’ is model assuming that the probability of nth word depends not on all of the previous words, but on previous (n-1) words. For example, there are 5 words, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, and document ‘abcde’, the probability that ‘e’ occurs after ‘abcd’ can be expressed as
‘P(e|abcd) = P(e|cd)’.
Also, probability that ‘d’ occurs after ‘abc’ can be expressed as ‘P(d|bc)’.

Among various n-gram model according to the value of n, we will use ‘trigram(3-gram) model’ for app production. So we will calculate the probability that the next word is ‘w3’ using previous 2 words(w1w2).
### Katz’s Backoff Model Even though there are a massive kind of n-grams that can occur in real world, there is a problem that zero probability is given to n-grams that have not seen in training set. To deal with this problem, we will use ‘Katz’s backoff model’ along with the trigram model. This model estimate the probability of unseen n-grams by distributing the probability mass taken from seen n-grams in training set. The specific method of estimation is following.

When there is a 3-gram, ‘w1w2’w3, where ’w1w2’ is determined, we can divide the number of cases into 3 in according to whether the any ‘w1w2w3’ exists in the training set.

When ‘w1w2’ is determined,(e.g. ‘ab’)
1. Case that any kind of ‘w1w2w3’ exists in the training set.(in form of ‘abc’)
2. Case that any kind of ‘w1w2w3’ does not exist, but ‘w2w3’ exists.(in form of ‘dbc’)
3. Case that even ‘w2w3’ does not exist.(in form of ‘dec’)

In each case, probability Ps(w3|w1w2) is calculated as follows.(Ps() means ‘estimated probability’)

*Ps(A|B) = d x P(A|B)
P(A|B) = c(A&B)/c(B) (where c(k) is frequency of k in training set.)

1. Case that any kind of ‘w1w2w3’ exists in the training set.(in form of ‘abc’)

1-1) About ‘w3’ that ‘w1w2w3’ exists when ‘w1w2’ is determined(any ‘c’ of ‘abc’)
Ps(w3|w1w2) = d x P(w3|w1w2)
Multipying by constant is for taking some probability mass from ‘w1w2w3’ existing in training set. That is called ‘discounting’, and constant ‘d’ is refered to as ‘discounting rate’

1-2) About ‘w3’ in the rest of vocabulary that ‘w1w2w3’ does not exist, but ‘w2w3’ exist with other ‘w1’(any ‘c’ of ‘dbc’)
Ps(w3|w1w2) = a x Ps(w3|w2)
It means redistribution of probability mass taken from 1-1) to ‘w3’ in case 1-2) in according to proportion of ‘Ps(w3|w2)’

2. Case that any kind of ‘w1w2w3’ does not exist, but ‘w2w3’ exists.(in form of ‘dbc’)

2-1) About ‘w3’ that have ‘w2w3’(in form of ‘dbc’)
Ps(w3|w1w2) is estimated to be Ps(w3|w2)

2-2) About ‘w3’ in the rest of vocabulary that does not have ‘w2w3’(in form of ‘dec’)
Ps(w3|w1w2) = a x Ps(w3)
It also means redistribution of probability mass taken from 2-1) to ‘w3’ in case 2-2) in according to proportion of ‘Ps(w3|w2)’

3. Case that even ‘w2w3’ does not exist.(in form of ‘dec’)
Ps(w3|w1w2) is estimated to be Ps(w3).

In conclusion, by combining ‘trigram model’ and ‘Katz’s backoff model’, the next word is predicted by estimating the probability of seen and unseen 3-grams when the previous 2 word is given.

About Application

Our application presents the most likely word as next word based on previous two words while users are typing. Specific operation process is following.

The user types the text.
Texts typed are preprocessed so that algorithm can recognize them.
The 3 most likely words are predicted and presented to user based on previous 2 words.
If user select predicted word, the word is typed on the screen automatically.