Executive Summary

By 2017, over a third of the world’s population is projected to own a smartphone, an estimated total of almost 2.6 billion smartphone users in the world. Many will use it to send messages to friends and family, but the text prediction feature doesn’t work as expected sometimes as you can see in this image.

Apartment Hunt

This project will dive into text prediction algorithms to understand them and attempt to create a better algorithm so that billions of users can communicate clearly with their smartphones. However, I hope text prediction doesn’t get too accurate because everyone can use a good laugh now and then.

To predict text, we use a language model based on the probabilities a word would be used, given other words are used at the same time. We analyse a corpus or a large set of text and find all words and their frequencies or the number of times they appear in the corpus. We build tables with the counts of single words, pairs and triplets of words (also called N-grams), to be used to compute the probability of the next word.

The Dataset

We’ll use data from HC Corpora which has collected 2,580,000,000 words from 67 languages at last count. This site used a progam called a web crawler to download text and tag it with publication date, website, entry type, and subject. The text was categorized as news, blog entries, or Twitter posts and packaged in three files. We’ll combine the three files into one and choose a 5% sample size of the 4,269,678 lines of text.

Size Filename Line Count Word Count
210MB en_US.blogs.txt 899,288 37,334,690
206MB en_US.news.txt 1,010,242 34,372,720
167MB en_US.twitter.txt 2,360,148 30,374,206
Totals 583MB 4,269,678 102,081,616

Table 1 - Dataset summary statistics

Text were divided further into the following sub-categories.

  1. Armed Forces
  2. Arts & Culture
  3. Blogs
  4. Business & Economy
  5. Crime & Law
  6. Education
  7. Emergency & Disaster
  8. Entertainment
  9. Environment
  10. Family
  11. Food
  12. Health
  13. Holidays (Christmas, Hannukkah, Ramadan, etc.)
  14. Home & Garden
  15. International News (compared to the nation of the language)
  16. Leisure Time (hobbies, etc.)
  17. Lifestyle & Fashion
  18. Local news
  19. No subject
  20. Obituaries
  21. Politics
  22. Recipes
  23. Religion
  24. Science & Technology
  25. Sport
  26. Transport
  27. Travel
  28. Weather
  29. WWII

Table 2 - Alphabetical list of the sub-categories

The next table shows the web sites which were used and the number of lines of text downloaded. The top three sites were cleveland.com, nj.com, and stltoday.com. There were 24,595 unique blogs from blogspot and 19,160 unique blogs from wordpress. 169,230 unique Twitter users were found in the Twitter data.

Web Site Count
nj.com 19,250
freep.com 4,830
startribune.com 4,208
ajc.com 1,400
azcentral.com 5,531
sfgate.com 6,398
oregonlive.com 15,120
cleveland.com 19,390
accessatlanta.com 82
tampabay.com 268
indystar.com 1,661
baltimoresun.com 6,803
stltoday.com 15,549
denverpost.com 3,878
philly.com 779
mercurynews.com 3,400
usatoday.com 972
kansascity.com 1,448
bostonherald.com 531
suntimes.com 3,229
siliconvalley.com 246
orlandosentinel.com 1,913
wsj.com 2,518
utsandiego.com 2,563
chron.com 457
nydailynews.com 737
sacbee.com 2,516
latimes.com 9,446
chicagotribune.com 1,809
ocregister.com 3,307
nypostonline.com 1,448
blogspot.com 97,607 (24,595 unique blogs)
wordpress.com 54,133 (19,160 unique blogs)
twitter.com 2,360,148 (169,230 unique users)

Table 3 - Lines of text counts by web site

Features of the Dataset

With such a vast amount of text from Twitter, news and blogging sites, one can imagine that it’s full of all kinds of information such as stories, dates, addresses, bible verse numbers, and stock tickers. But it also contains messy data in the form of profanity, typos, and foreign languages. Additionally, punctuation marks, email addresses, and Twitter hashtags were discovered.

Preprocessing the Dataset

We’ll clean up the dataset by removing punctuation marks, characters in email addresses, numbers, profanity, and common word endings like “ing” and “es”. Typos can be corrected with a spell checking program.

Dataset Summary

My sample size contained 219,893 lines of text and after cleaning, I found 148,816 words or unigrams. The minimum and median word counts were 1 with a mean of 27.26, 3rd quartile of 4 and a maximum of 244,900. The standard deviation of the words was 818.093. There are 84,018 words that appear only one time in the corpus, with more than 86% of words appearing ten times or less. A histogram also shows the word frequency distribution is heavily right skewed. We create lists of the 20 least frequent and most frequent words from the corpus. Words that appear a significant number of times are plotted and a word cloud of the most frequently used words is also shown. The counts of word pairs (bigrams) and triplets (trigrams) are summarized in the same way.

## [1] 148816
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00      1.00      1.00     27.26      4.00 244900.00
## [1] 818.093
## [1] 84018
## [1] 0.864571

##                     aaaa aaaaaaaaaaaaaaaaaaaaaack               aaaaaahhhh 
##                        1                        1                        1 
##             aaaaaahhhhhh             aaaaaahhhhhi            aaaaagggghhhh 
##                        1                        1                        1 
##                    aaaah                aaaahhhhh                  aaaalll 
##                        1                        1                        1 
##  aaaallllrrreeeaaadddyyi                  aaaammm                   aaaand 
##                        1                        1                        1 
##                 aaaannnd                    aaagh                    aaand 
##                        1                        1                        1 
##                    aaaph                  aaarrgh           aaarrrgggghhhh 
##                        1                        1                        1 
##                     aaas                  aaawwww 
##                        1                        1
##   just    his   said   will   they    all    its   from    not    but 
##  15550  15683  15699  15979  16360  16810  18521  19349  21287  24604 
##    are   this   have    was   with    you   that    for    and    the 
##  25110  27415  27504  32318  36699  48294  53700  56925 123990 244882

I found 1,754,566 bigrams, with a minimum and median of 1, and a mean of 2.929, 3rd quartile of 1 and a maximum of 22,030. The standard deviation of the bigram distribution was 39.72995. There are 1,356,861 bigrams that appear only one time in the corpus, with more than 97% of bigrams appearing ten times or less.

## [1] 1754566
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.000     1.000     1.000     2.929     1.000 22030.000
## [1] 39.72995
## [1] 1356861
## [1] 0.9713867

##             a aaa            a aauw           a ablum         a absolut 
##                 1                 1                 1                 1 
##             a abv             a acc a accomplishments         a account 
##                 1                 1                 1                 1 
##          a acheat        a ackerman  a acrosstheboard        a actually 
##                 1                 1                 1                 1 
##              a ad            a aday       a addiction          a adding 
##                 1                 1                 1                 1 
##          a adkins           a admin         a admired           a adult 
##                 1                 1                 1                 1
##  will be    it is   with a    and i from the    i was   i have    for a 
##     4140     4214     4222     4258     4381     4474     4511     4907 
##   it was     is a with the     in a  and the   at the    to be   on the 
##     4997     5165     5296     6248     6528     7226     8332    10209 
##  for the   to the   in the   of the 
##    10318    11088    21387    22031

I found 3,849,394 trigrams, with a minimum of 1, median of 1, and a mean of 1.335, 3rd quartile of 1 and a maximum of 1,819. The standard deviation of the trigrams count was 3.830097. There are 3,478,219 trigrams that appear only one time in the corpus, with more than 99% of trigrams appearing ten times or less.

## [1] 3849394
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.335    1.000 1819.000
## [1] 3.830097
## [1] 3478219
## [1] 0.9943334

##   a a allstar       a a and       a a bit     a a blast      a a bowl 
##             1             1             1             1             1 
##  a a commerci a a commodity      a a days     a a dream       a a for 
##             1             1             1             1             1 
##      a a good    a a google         a a i       a a lot       a a map 
##             1             1             1             1             1 
##      a a pair      a a part     a a quick      a a real    a a report 
##             1             1             1             1             1
##        im going to        the rest of      thank you for 
##                531                534                544 
##        i dont know looking forward to           i have a 
##                581                592                610 
##          i have to         be able to        some of the 
##                611                652                660 
##        part of the         as well as         the end of 
##                667                694                709 
##          i want to         out of the           it was a 
##                739                768                770 
##        going to be            to be a     thanks for the 
##                857                955               1183 
##           a lot of         one of the 
##               1603               1819

Prediction Algorithm

My algorithm will work the same way as most smartphone’s text messaging applications do.

Step 1. Show the top 3 words most likely to start a sentence.

Step 2. Process a word from the user and present the top 3 words most likely to follow all words entered so far.

Step 3. Continue Step 2 until the user hits send.

Example

The algorithm will need to use the n-gram counts from the corpus and compute the probabilities of the next word, given each n-gram entered by the user and display the ones with the three highest probabilities. Because of the many n-gram combinations possible in the English language, we’ll need to borrow an assumption from a mathematician named Andrey Markov which says we can estimate the probabilities of these n-grams by using the last few n-grams entered instead of all the n-grams. This reduces the number of possibilities the algorithm needs to consider.

The corpus may not contain the n-grams entered by the user because it has a limited vocabulary due the memory size limitation of the smartphone. Another drawback is that the model knows the n-grams but hasn’t counted them which results in a zero probability when predicting a word following those n-grams.

To evaluate how well our model performs text prediction, we’ll compute an indicator called perplexity using a test set that was unseen when training our model. It is the probability of the test set normalized by the number of words. The lower the perplexity, the better the model is.

How the developer chooses to handle these issues will determine the user experience of the text prediction application.