By 2017, over a third of the world’s population is projected to own a smartphone, an estimated total of almost 2.6 billion smartphone users in the world. Many will use it to send messages to friends and family, but the text prediction feature doesn’t work as expected sometimes as you can see in this image.
This project will dive into text prediction algorithms to understand them and attempt to create a better algorithm so that billions of users can communicate clearly with their smartphones. However, I hope text prediction doesn’t get too accurate because everyone can use a good laugh now and then.
To predict text, we use a language model based on the probabilities a word would be used, given other words are used at the same time. We analyse a corpus or a large set of text and find all words and their frequencies or the number of times they appear in the corpus. We build tables with the counts of single words, pairs and triplets of words (also called N-grams), to be used to compute the probability of the next word.
We’ll use data from HC Corpora which has collected 2,580,000,000 words from 67 languages at last count. This site used a progam called a web crawler to download text and tag it with publication date, website, entry type, and subject. The text was categorized as news, blog entries, or Twitter posts and packaged in three files. We’ll combine the three files into one and choose a 5% sample size of the 4,269,678 lines of text.
| Size | Filename | Line Count | Word Count | |
|---|---|---|---|---|
| 210MB | en_US.blogs.txt | 899,288 | 37,334,690 | |
| 206MB | en_US.news.txt | 1,010,242 | 34,372,720 | |
| 167MB | en_US.twitter.txt | 2,360,148 | 30,374,206 | |
| Totals | 583MB | 4,269,678 | 102,081,616 |
Table 1 - Dataset summary statistics
Text were divided further into the following sub-categories.
Table 2 - Alphabetical list of the sub-categories
The next table shows the web sites which were used and the number of lines of text downloaded. The top three sites were cleveland.com, nj.com, and stltoday.com. There were 24,595 unique blogs from blogspot and 19,160 unique blogs from wordpress. 169,230 unique Twitter users were found in the Twitter data.
| Web Site | Count | |
|---|---|---|
| nj.com | 19,250 | |
| freep.com | 4,830 | |
| startribune.com | 4,208 | |
| ajc.com | 1,400 | |
| azcentral.com | 5,531 | |
| sfgate.com | 6,398 | |
| oregonlive.com | 15,120 | |
| cleveland.com | 19,390 | |
| accessatlanta.com | 82 | |
| tampabay.com | 268 | |
| indystar.com | 1,661 | |
| baltimoresun.com | 6,803 | |
| stltoday.com | 15,549 | |
| denverpost.com | 3,878 | |
| philly.com | 779 | |
| mercurynews.com | 3,400 | |
| usatoday.com | 972 | |
| kansascity.com | 1,448 | |
| bostonherald.com | 531 | |
| suntimes.com | 3,229 | |
| siliconvalley.com | 246 | |
| orlandosentinel.com | 1,913 | |
| wsj.com | 2,518 | |
| utsandiego.com | 2,563 | |
| chron.com | 457 | |
| nydailynews.com | 737 | |
| sacbee.com | 2,516 | |
| latimes.com | 9,446 | |
| chicagotribune.com | 1,809 | |
| ocregister.com | 3,307 | |
| nypostonline.com | 1,448 | |
| blogspot.com | 97,607 (24,595 unique blogs) | |
| wordpress.com | 54,133 (19,160 unique blogs) | |
| twitter.com | 2,360,148 (169,230 unique users) |
Table 3 - Lines of text counts by web site
With such a vast amount of text from Twitter, news and blogging sites, one can imagine that it’s full of all kinds of information such as stories, dates, addresses, bible verse numbers, and stock tickers. But it also contains messy data in the form of profanity, typos, and foreign languages. Additionally, punctuation marks, email addresses, and Twitter hashtags were discovered.
We’ll clean up the dataset by removing punctuation marks, characters in email addresses, numbers, profanity, and common word endings like “ing” and “es”. Typos can be corrected with a spell checking program.
My sample size contained 219,893 lines of text and after cleaning, I found 148,816 words or unigrams. The minimum and median word counts were 1 with a mean of 27.26, 3rd quartile of 4 and a maximum of 244,900. The standard deviation of the words was 818.093. There are 84,018 words that appear only one time in the corpus, with more than 86% of words appearing ten times or less. A histogram also shows the word frequency distribution is heavily right skewed. We create lists of the 20 least frequent and most frequent words from the corpus. Words that appear a significant number of times are plotted and a word cloud of the most frequently used words is also shown. The counts of word pairs (bigrams) and triplets (trigrams) are summarized in the same way.
## [1] 148816
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 27.26 4.00 244900.00
## [1] 818.093
## [1] 84018
## [1] 0.864571
## aaaa aaaaaaaaaaaaaaaaaaaaaack aaaaaahhhh
## 1 1 1
## aaaaaahhhhhh aaaaaahhhhhi aaaaagggghhhh
## 1 1 1
## aaaah aaaahhhhh aaaalll
## 1 1 1
## aaaallllrrreeeaaadddyyi aaaammm aaaand
## 1 1 1
## aaaannnd aaagh aaand
## 1 1 1
## aaaph aaarrgh aaarrrgggghhhh
## 1 1 1
## aaas aaawwww
## 1 1
## just his said will they all its from not but
## 15550 15683 15699 15979 16360 16810 18521 19349 21287 24604
## are this have was with you that for and the
## 25110 27415 27504 32318 36699 48294 53700 56925 123990 244882
I found 1,754,566 bigrams, with a minimum and median of 1, and a mean of 2.929, 3rd quartile of 1 and a maximum of 22,030. The standard deviation of the bigram distribution was 39.72995. There are 1,356,861 bigrams that appear only one time in the corpus, with more than 97% of bigrams appearing ten times or less.
## [1] 1754566
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.929 1.000 22030.000
## [1] 39.72995
## [1] 1356861
## [1] 0.9713867
## a aaa a aauw a ablum a absolut
## 1 1 1 1
## a abv a acc a accomplishments a account
## 1 1 1 1
## a acheat a ackerman a acrosstheboard a actually
## 1 1 1 1
## a ad a aday a addiction a adding
## 1 1 1 1
## a adkins a admin a admired a adult
## 1 1 1 1
## will be it is with a and i from the i was i have for a
## 4140 4214 4222 4258 4381 4474 4511 4907
## it was is a with the in a and the at the to be on the
## 4997 5165 5296 6248 6528 7226 8332 10209
## for the to the in the of the
## 10318 11088 21387 22031
I found 3,849,394 trigrams, with a minimum of 1, median of 1, and a mean of 1.335, 3rd quartile of 1 and a maximum of 1,819. The standard deviation of the trigrams count was 3.830097. There are 3,478,219 trigrams that appear only one time in the corpus, with more than 99% of trigrams appearing ten times or less.
## [1] 3849394
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.335 1.000 1819.000
## [1] 3.830097
## [1] 3478219
## [1] 0.9943334
## a a allstar a a and a a bit a a blast a a bowl
## 1 1 1 1 1
## a a commerci a a commodity a a days a a dream a a for
## 1 1 1 1 1
## a a good a a google a a i a a lot a a map
## 1 1 1 1 1
## a a pair a a part a a quick a a real a a report
## 1 1 1 1 1
## im going to the rest of thank you for
## 531 534 544
## i dont know looking forward to i have a
## 581 592 610
## i have to be able to some of the
## 611 652 660
## part of the as well as the end of
## 667 694 709
## i want to out of the it was a
## 739 768 770
## going to be to be a thanks for the
## 857 955 1183
## a lot of one of the
## 1603 1819
My algorithm will work the same way as most smartphone’s text messaging applications do.
Step 1. Show the top 3 words most likely to start a sentence.
Step 2. Process a word from the user and present the top 3 words most likely to follow all words entered so far.
Step 3. Continue Step 2 until the user hits send.
Example
Show I The and I’m
User chooses I, and app shows love, don’t and just
User chooses I love, and app shows you, it, and the
User chooses I love you and app shows so, so much, and too.
The algorithm will need to use the n-gram counts from the corpus and compute the probabilities of the next word, given each n-gram entered by the user and display the ones with the three highest probabilities. Because of the many n-gram combinations possible in the English language, we’ll need to borrow an assumption from a mathematician named Andrey Markov which says we can estimate the probabilities of these n-grams by using the last few n-grams entered instead of all the n-grams. This reduces the number of possibilities the algorithm needs to consider.
The corpus may not contain the n-grams entered by the user because it has a limited vocabulary due the memory size limitation of the smartphone. Another drawback is that the model knows the n-grams but hasn’t counted them which results in a zero probability when predicting a word following those n-grams.
To evaluate how well our model performs text prediction, we’ll compute an indicator called perplexity using a test set that was unseen when training our model. It is the probability of the test set normalized by the number of words. The lower the perplexity, the better the model is.
How the developer chooses to handle these issues will determine the user experience of the text prediction application.