Executive Summary

By 2017, over a third of the world’s population is projected to own a smartphone, an estimated total of almost 2.6 billion smartphone users in the world. Many will use it to send messages to friends and family, but the text prediction feature doesn’t work as expected sometimes as you can see in this image.

Apartment Hunt

This project will dive into text prediction algorithms to understand them and attempt to create a better algorithm so that billions of users can communicate clearly with their smartphones. However, I hope text prediction doesn’t get too accurate because everyone can use a good laugh now and then.

To predict text, we use a language model based on the probabilities a word would be used, given other words are used at the same time. We analyse a corpus or a large set of text and find all words and their frequencies or the number of times they appear in the corpus. We build tables with the counts of single words, pairs and triplets of words (also called N-grams), to be used to compute the probability of the next word.

The Dataset

We’ll use data from HC Corpora which has collected 2,580,000,000 words from 67 languages at last count. This site used a progam called a web crawler to download text and tag it with publication date, website, entry type, and subject. The text was categorized as news, blog entries, or Twitter posts and packaged in three files. We’ll combine the three files into one and choose a 5% sample size of the 4,269,678 lines of text.

	Size	Filename	Line Count	Word Count
	210MB	en_US.blogs.txt	899,288	37,334,690
	206MB	en_US.news.txt	1,010,242	34,372,720
	167MB	en_US.twitter.txt	2,360,148	30,374,206
Totals	583MB		4,269,678	102,081,616

Table 1 - Dataset summary statistics

Text were divided further into the following sub-categories.

Armed Forces
Arts & Culture
Blogs
Business & Economy
Crime & Law
Education
Emergency & Disaster
Entertainment
Environment
Family
Food
Health
Holidays (Christmas, Hannukkah, Ramadan, etc.)
Home & Garden
International News (compared to the nation of the language)
Leisure Time (hobbies, etc.)
Lifestyle & Fashion
Local news
No subject
Obituaries
Politics
Recipes
Religion
Science & Technology
Sport
Transport
Travel
Weather
WWII

Table 2 - Alphabetical list of the sub-categories

The next table shows the web sites which were used and the number of lines of text downloaded. The top three sites were cleveland.com, nj.com, and stltoday.com. There were 24,595 unique blogs from blogspot and 19,160 unique blogs from wordpress. 169,230 unique Twitter users were found in the Twitter data.

	Web Site	Count
	nj.com	19,250
	freep.com	4,830
	startribune.com	4,208
	ajc.com	1,400
	azcentral.com	5,531
	sfgate.com	6,398
	oregonlive.com	15,120
	cleveland.com	19,390
	accessatlanta.com	82
	tampabay.com	268
	indystar.com	1,661
	baltimoresun.com	6,803
	stltoday.com	15,549
	denverpost.com	3,878
	philly.com	779
	mercurynews.com	3,400
	usatoday.com	972
	kansascity.com	1,448
	bostonherald.com	531
	suntimes.com	3,229
	siliconvalley.com	246
	orlandosentinel.com	1,913
	wsj.com	2,518
	utsandiego.com	2,563
	chron.com	457
	nydailynews.com	737
	sacbee.com	2,516
	latimes.com	9,446
	chicagotribune.com	1,809
	ocregister.com	3,307
	nypostonline.com	1,448
	blogspot.com	97,607 (24,595 unique blogs)
	wordpress.com	54,133 (19,160 unique blogs)
	twitter.com	2,360,148 (169,230 unique users)

Table 3 - Lines of text counts by web site

Features of the Dataset

With such a vast amount of text from Twitter, news and blogging sites, one can imagine that it’s full of all kinds of information such as stories, dates, addresses, bible verse numbers, and stock tickers. But it also contains messy data in the form of profanity, typos, and foreign languages. Additionally, punctuation marks, email addresses, and Twitter hashtags were discovered.

Preprocessing the Dataset

We’ll clean up the dataset by removing punctuation marks, characters in email addresses, numbers, profanity, and common word endings like “ing” and “es”. Typos can be corrected with a spell checking program.

Dataset Summary

My sample size contained 219,893 lines of text and after cleaning, I found 148,816 words or unigrams. The minimum and median word counts were 1 with a mean of 27.26, 3rd quartile of 4 and a maximum of 244,900. The standard deviation of the words was 818.093. There are 84,018 words that appear only one time in the corpus, with more than 86% of words appearing ten times or less. A histogram also shows the word frequency distribution is heavily right skewed. We create lists of the 20 least frequent and most frequent words from the corpus. Words that appear a significant number of times are plotted and a word cloud of the most frequently used words is also shown. The counts of word pairs (bigrams) and triplets (trigrams) are summarized in the same way.

## [1] 148816

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.00      1.00      1.00     27.26      4.00 244900.00

## [1] 818.093

## [1] 84018

## [1] 0.864571

##                     aaaa aaaaaaaaaaaaaaaaaaaaaack               aaaaaahhhh 
##                        1                        1                        1 
##             aaaaaahhhhhh             aaaaaahhhhhi            aaaaagggghhhh 
##                        1                        1                        1 
##                    aaaah                aaaahhhhh                  aaaalll 
##                        1                        1                        1 
##  aaaallllrrreeeaaadddyyi                  aaaammm                   aaaand 
##                        1                        1                        1 
##                 aaaannnd                    aaagh                    aaand 
##                        1                        1                        1 
##                    aaaph                  aaarrgh           aaarrrgggghhhh 
##                        1                        1                        1 
##                     aaas                  aaawwww 
##                        1                        1

##   just    his   said   will   they    all    its   from    not    but 
##  15550  15683  15699  15979  16360  16810  18521  19349  21287  24604 
##    are   this   have    was   with    you   that    for    and    the 
##  25110  27415  27504  32318  36699  48294  53700  56925 123990 244882

I found 1,754,566 bigrams, with a minimum and median of 1, and a mean of 2.929, 3rd quartile of 1 and a maximum of 22,030. The standard deviation of the bigram distribution was 39.72995. There are 1,356,861 bigrams that appear only one time in the corpus, with more than 97% of bigrams appearing ten times or less.

## [1] 1754566

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     1.000     1.000     1.000     2.929     1.000 22030.000

## [1] 39.72995

## [1] 1356861

## [1] 0.9713867

##             a aaa            a aauw           a ablum         a absolut 
##                 1                 1                 1                 1 
##             a abv             a acc a accomplishments         a account 
##                 1                 1                 1                 1 
##          a acheat        a ackerman  a acrosstheboard        a actually 
##                 1                 1                 1                 1 
##              a ad            a aday       a addiction          a adding 
##                 1                 1                 1                 1 
##          a adkins           a admin         a admired           a adult 
##                 1                 1                 1                 1

##  will be    it is   with a    and i from the    i was   i have    for a 
##     4140     4214     4222     4258     4381     4474     4511     4907 
##   it was     is a with the     in a  and the   at the    to be   on the 
##     4997     5165     5296     6248     6528     7226     8332    10209 
##  for the   to the   in the   of the 
##    10318    11088    21387    22031

I found 3,849,394 trigrams, with a minimum of 1, median of 1, and a mean of 1.335, 3rd quartile of 1 and a maximum of 1,819. The standard deviation of the trigrams count was 3.830097. There are 3,478,219 trigrams that appear only one time in the corpus, with more than 99% of trigrams appearing ten times or less.

## [1] 3849394

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    1.000    1.335    1.000 1819.000

## [1] 3.830097

## [1] 3478219

## [1] 0.9943334

##   a a allstar       a a and       a a bit     a a blast      a a bowl 
##             1             1             1             1             1 
##  a a commerci a a commodity      a a days     a a dream       a a for 
##             1             1             1             1             1 
##      a a good    a a google         a a i       a a lot       a a map 
##             1             1             1             1             1 
##      a a pair      a a part     a a quick      a a real    a a report 
##             1             1             1             1             1

##        im going to        the rest of      thank you for 
##                531                534                544 
##        i dont know looking forward to           i have a 
##                581                592                610 
##          i have to         be able to        some of the 
##                611                652                660 
##        part of the         as well as         the end of 
##                667                694                709 
##          i want to         out of the           it was a 
##                739                768                770 
##        going to be            to be a     thanks for the 
##                857                955               1183 
##           a lot of         one of the 
##               1603               1819

Prediction Algorithm

My algorithm will work the same way as most smartphone’s text messaging applications do.

Step 1. Show the top 3 words most likely to start a sentence.

Step 2. Process a word from the user and present the top 3 words most likely to follow all words entered so far.

Step 3. Continue Step 2 until the user hits send.

Example

Show I The and I’m
User chooses I, and app shows love, don’t and just
User chooses I love, and app shows you, it, and the
User chooses I love you and app shows so, so much, and too.

The algorithm will need to use the n-gram counts from the corpus and compute the probabilities of the next word, given each n-gram entered by the user and display the ones with the three highest probabilities. Because of the many n-gram combinations possible in the English language, we’ll need to borrow an assumption from a mathematician named Andrey Markov which says we can estimate the probabilities of these n-grams by using the last few n-grams entered instead of all the n-grams. This reduces the number of possibilities the algorithm needs to consider.

The corpus may not contain the n-grams entered by the user because it has a limited vocabulary due the memory size limitation of the smartphone. Another drawback is that the model knows the n-grams but hasn’t counted them which results in a zero probability when predicting a word following those n-grams.

To evaluate how well our model performs text prediction, we’ll compute an indicator called perplexity using a test set that was unseen when training our model. It is the probability of the test set normalized by the number of words. The lower the perplexity, the better the model is.

How the developer chooses to handle these issues will determine the user experience of the text prediction application.

Text Prediction Application

Calvin Seto

March 20, 2016