In this report, we deal with plan for producing application that recommends the most likely next word. For this, with dataset from swiftkey, exploratory analysis was conducted. After it, prediction model was constructed combining trigram model with Katz’s backoff model that is for dealing with zero probability problem. Based on it, we described application that presents the 3 most likely words to users.
Dataset we used in this report are from ‘swiftkey’, contain corpora obtained from blogs, news, twitters in each of 4 languages(German, English, Finnish, Russian). Among them, we used English dataset. An example of English news document is as follows.
‘The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.’
In 3 files(blogs, news, twitters), total number of documents(lines) is following.
## [1] "blog : 899288"
## [1] "news : 77259"
## [1] "twitter : 2360148"
And When we sample 1000 lines from each files, total frequency of unigrams(words) is
## [1] "blog : 32090"
## [1] "news : 24764"
## [1] "twitter : 9987"
## [1] "total : 66841"
And when we sample 1000 lines in each of 3 files, the size of vocabulary in each file and total is
## [1] "blog : 3053"
## [1] "news : 3192"
## [1] "twitter : 1657"
## [1] "total : 5216"
So, how many words in vocabulary make up most of the frequency?
## [1] 5216
## [1] "33 words among 5216 words make up 50 % of the total frequency"
## [1] "66 words among 5216 words make up 60 % of the total frequency"
## [1] "144 words among 5216 words make up 70 % of the total frequency"
## [1] "365 words among 5216 words make up 80 % of the total frequency"
## [1] "1078 words among 5216 words make up 90 % of the total frequency"
This result suggests chance to improvement of runtime efficiency by reducing the size of vocabulary.
The distributions of the number of tokens in each lines in each of 3
files is following.
And maximum length of line is
## [1] "blog : 421"
## [1] "news : 143"
## [1] "twitter : 27"
The most frequent unigrams in each of 3 files is
We can see that most of frequent words are stopwords(e.g.,‘the’,
‘a’). So, let’s check the frequency again without stopwords!
And let’s check the frequency of bigrams!
Also check in trigram.
‘n-gram model’ is model assuming that the probability of nth word
depends not on all of the previous words, but on previous (n-1) words.
For example, there are 5 words, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, and document
‘abcde’, the probability that ‘e’ occurs after ‘abcd’ can be expressed
as
‘P(e|abcd) = P(e|cd)’.
Also, probability that ‘d’ occurs
after ‘abc’ can be expressed as ‘P(d|bc)’.
Among various n-gram model according to the value of n, we will use
‘trigram(3-gram) model’ for app production. So we will calculate the
probability that the next word is ‘w3’ using previous 2 words(w1w2).
### Katz’s Backoff Model Even though there are a massive kind of
n-grams that can occur in real world, there is a problem that zero
probability is given to n-grams that have not seen in training set. To
deal with this problem, we will use ‘Katz’s backoff model’ along with
the trigram model. This model estimate the probability of unseen n-grams
by distributing the probability mass taken from seen n-grams in training
set. The specific method of estimation is following.
When there is a 3-gram, ‘w1w2’w3, where ’w1w2’ is determined,
we can divide the number of cases into 3 in according to whether the any
‘w1w2w3’ exists in the training set.
When ‘w1w2’ is determined,(e.g. ‘ab’)
1. Case that any kind of
‘w1w2w3’ exists in the training set.(in form of ‘abc’)
2. Case
that any kind of ‘w1w2w3’ does not exist, but ‘w2w3’ exists.(in form of
‘dbc’)
3. Case that even ‘w2w3’ does not exist.(in form of
‘dec’)
In each case, probability Ps(w3|w1w2) is calculated as follows.(Ps() means ‘estimated probability’)
*Ps(A|B) = d x P(A|B)
P(A|B) = c(A&B)/c(B) (where c(k) is
frequency of k in training set.)
1. Case that any kind of
‘w1w2w3’ exists in the training set.(in form of ‘abc’)
1-1) About ‘w3’ that ‘w1w2w3’ exists when ‘w1w2’ is determined(any
‘c’ of ‘abc’)
Ps(w3|w1w2) = d x P(w3|w1w2)
Multipying by
constant is for taking some probability mass from ‘w1w2w3’ existing in
training set. That is called ‘discounting’, and constant ‘d’ is refered
to as ‘discounting rate’
1-2) About ‘w3’ in the rest of vocabulary that ‘w1w2w3’ does not
exist, but ‘w2w3’ exist with other ‘w1’(any ‘c’ of ‘dbc’)
Ps(w3|w1w2) = a x Ps(w3|w2)
It means redistribution of probability
mass taken from 1-1) to ‘w3’ in case 1-2) in according to proportion of
‘Ps(w3|w2)’
2. Case that any kind of ‘w1w2w3’ does not exist,
but ‘w2w3’ exists.(in form of ‘dbc’)
2-1) About ‘w3’ that have ‘w2w3’(in form of ‘dbc’)
Ps(w3|w1w2)
is estimated to be Ps(w3|w2)
2-2) About ‘w3’ in the rest of vocabulary that does not have
‘w2w3’(in form of ‘dec’)
Ps(w3|w1w2) = a x Ps(w3)
It also
means redistribution of probability mass taken from 2-1) to ‘w3’ in case
2-2) in according to proportion of ‘Ps(w3|w2)’
3. Case that
even ‘w2w3’ does not exist.(in form of ‘dec’)
Ps(w3|w1w2) is
estimated to be Ps(w3).
In conclusion, by combining ‘trigram model’ and ‘Katz’s backoff
model’, the next word is predicted by estimating the probability of seen
and unseen 3-grams when the previous 2 word is given.
Our application presents the most likely word as next word based on previous two words while users are typing. Specific operation process is following.