Predictive Text Milestone Report

Background

Using devices such as smartphones, people are increasingly using text based interfaces to communicate with one another.

The website Openmarket.com states “Findings reveal millennials have an overwhelming affinity for texting. In fact, when given the choice between only being able to text or call on their mobile phones, a surprising 75 percent of millennials would rather lose the ability to talk versus text.

Survey respondents say texts are “more convenient” and on their own schedule (76 percent), texts are “less disruptive than a voice call” (63 percent), they “prefer to text vs. call” in general (53 percent) and because they “never check voicemails” (19 percent)."

Communicating via text (texting– see, it’s so popular it’s a verb now), however, is riddled with challenges; typing every letter of every word, using appropriate punctuation, substituting non verbal cues with text appropriate words or symbols are just a few of the issues that are unique to text based communication. As I type this Markdown document, I am keenly aware of the inherent inefficiencies of written communication.

Several innovations have been made that greatly improve both the efficiency and the efficacy of texting. Some of the first major efficiency improvements were spell checkers( which this Markdown editor does not have… ironic, maybe?),auto correction and automatic formatting. Emoticons, Bitmoji, etc. have been very effective at replacing many words or sentences that would have been needed to communicate context, intent, and emotion.

Most texting is performed on small, handheld devices. As a result, " keyboards" used for text entry are extremely small, limiting the number of characters that are readily available, as well exascerbating keyboard entry errors (typos). Therefore, messages can be exchanged with greater speed and accuracy if an individual’s interaction with the keyboard is reduced. One way to reduce the number of keyboard inputs is use so called “Predictive Text” (PT) algorithms.

“Predictive text is an input technology used where one key or button represents many letters, such as on the numeric keypads of mobile phones and in accessibility technologies. Each key press results in a prediction rather than repeatedly sequencing through the same group of “letters” it represents, in the same, invariable order. Predictive text could allow for an entire word to be input by single keypress. Predictive text makes efficient use of fewer device keys to input writing into a text message, an e-mail, an address book, a calendar, and the like.” – Wikipedia

In other words, a PT algorithm attempts to predict what your next word will be and suggests it to you. If you agree with the suggestion, a single keyboard input than places the entire word into the message. In a perfect world with perfect prediction, only one keystroke per word would be required to compose a message.

A PT algorithm is, at its core, a statistical model. The next word in a string is never known for certain, based on the previous word or any number of previous words. Therefore, one can only make predictions based on the likelihood of a word given some prior condition. If the preferences of the user are unknown, the user must be considered to be an individual in a population of texters. The parameters of the population are then assumed to apply to the individual as well. For example, if the population of all texters has a high probability of following the words “how are” with the word “you”, then the algorithm will assume that any individual from that population would be likely to do the same.

This report describes my preliminary investigation in developing a PT algorithm.In addition, it describes my plan to develop an app that implements the algorithm.

Preliminary Investigation

In order to determine how an individual message might be constructed, it is important to know how messages are constructed in the population of all messages (big population). Large samples of text are required. Care must be taken, however, in choosing text to be used as representative samples. Authors of text messages do not constuct messages the way that authors of novels do. It would be extremely unlikely to find “idk” in a novel, but easy to find in a text message.

As a result, I used 3 large text files as representative samples. The first, is a large file containing only lines from twitter accounts(tweets). The second is a file containing blog posts, and the third, is news.

Read In the text files and count the number of lines in each file.

numLinesBlog

## [1] 899288

numLinesNews

## [1] 1010242

numLinesTwitter

## [1] 2360148

Change all text to lower case, remove punctuation , numerals, and url’s. Split lines into words.

Sample output of “cleaned” blog text.

##  [1] "in"         "the"        "years"      "thereafter" "most"      
##  [6] "of"         "the"        "oil"        "fields"     "and"

Sample output of “cleaned” news text.

##  [1] "he"         "wasnt"      "home"       "alone"      "apparently"
##  [6] "the"        "st"         "louis"      "plant"      "had"

Sample output of “cleaned” twitter text.

##  [1] "how"    "are"    "you"    "btw"    "thanks" "for"    "the"   
##  [8] "rt"     "you"    "gonna"

Number of Words

I tabulated the frequencies of each word in all three files and ranked them by relative frequency (%) The total number of words, and the total number of UNIQUE words in each file is listed below.

##      Type Unique    Total
## 1    Blog 369083 37210525
## 2    News  86935  2581769
## 3 Twitter 467495 29372088

Most Common Words

I tabulated the frequencies of each word in all three files and ranked them by relative frequency (%)

10 Most Common Blog Words by Relative Frequency (%)

##        Word   Count  RelFreq
## 324338  the 1857050 4.990658
## 11131   and 1088448 2.925108
## 329675   to 1066100 2.865050
## 1         a  898485 2.414599
## 227837   of  875167 2.351934
## 151317    i  827376 2.223500
## 154280   in  594047 1.596449
## 324114 that  472282 1.269216
## 160850   it  441208 1.185708
## 160121   is  432199 1.161497

10 Most Common News Words by Relative Frequency (%)

##       Word  Count   RelFreq
## 77208  the 151552 5.8700837
## 78286   to  69355 2.6863364
## 2525   and  68258 2.6438461
## 1        a  67227 2.6039123
## 53711   of  59091 2.2887795
## 36830   in  51468 1.9935168
## 27902  for  27116 1.0502876
## 77189 that  26648 1.0321605
## 38332   is  21969 0.8509282
## 54143   on  20586 0.7973603

10 Most Common Twitter Words by Relative Frequency (%)

##        Word  Count  RelFreq
## 399679  the 934321 3.180983
## 408941   to 786676 2.678311
## 187652    i 715096 2.434611
## 1         a 609250 2.074248
## 463001  you 544807 1.854846
## 13439   and 433779 1.476841
## 142567  for 384581 1.309342
## 192651   in 377109 1.283903
## 284178   of 359003 1.222259
## 199346   is 357583 1.217425

The word “the” is the most common in word in all three files, althought it is used less on Twitter. The words “to” , “in”, “and”, and “a” are among the most common in all three files, but those are the only words that are most common in all three text samples.

Note that the platform or application can have a huge effect on word usage. For example the word “rt” in twitter is more commonly used than words we use all the time when not using twitter. My daughter informs me that “rt” means re-tweet :).

40-50th Most Common Twitter Words by Relative Frequency (%)

##          Word Count   RelFreq    name
## 159604   good 99731 0.3395434 Twitter
## 449042   will 94297 0.3210429 Twitter
## 1048    about 90961 0.3096852 Twitter
## 94613     day 90132 0.3068628 Twitter
## 59796     can 89600 0.3050515 Twitter
## 341826     rt 88778 0.3022529 Twitter
## 108773   dont 88721 0.3020589 Twitter
## 398560 thanks 88707 0.3020112 Twitter
## 147188   from 83715 0.2850155 Twitter
## 280191    now 81989 0.2791392 Twitter
## 446219   when 81805 0.2785127 Twitter

Word Lengths

Word lengths might be useful in predicitive algorithms. For example, words that are very long are seldom used, and therefore, should not appear often as a predicted value.

“Mom leans _________”

left
right
antidisestablishmentarianism

I removed all words longer than 12 characters from the blog, news, and Twitter files.

Descriptive stats on word lengths.

Profanity — Do not read if you are easily offended.

With the sample data largely cleaned there’s still work left to do; for example removing profanity. The twitter file contains the most profanity and a search for the “fword” returned almost 26,000 occurrances of the string “fuck” and 860 different words containing “fuck”. It’s fairly entertaining perusing the list to see the creative ways in which people are able to use these 4 letters :).

##      TwitterFword  Freq
## 129          fuck 12080
## 250       fucking  7218
## 242        fuckin  1819
## 173        fucked  1398
## 363         fucks   440
## 191        fucker   244
## 601  motherfucker   166
## 199       fuckers   165
## 189        fucken   145
## 336         fuckn   109
## 610 motherfucking    79
## 604 motherfuckers    76
## 643    muthafucka    56
## 206       fuckery    54
## 297         fuckk    52
## 653   muthafuckin    44
## 130         fucka    40
## 609  motherfuckin    37
## 579    mothafucka    29
## 300       fuckkkk    26

diffFwords <- sum(FreqTableFwordSort[,2])
diffFwords

## [1] 25719

numFwords <- length(FreqTableFwordSort[,1])
numFwords

## [1] 860

freqFwords<- numFwords/length(TwitterDataClean)
freqFwords *100

## [1] 0.00292795

All kidding aside, there is a fair amount of work involved in enssuring that the PT algorithm does not include profanity as word suggestions. Defining profanity is no small task, and eliminating all variations of a profane word is a real challenge.

More generally, if an author commonly communicates using profanity, is it the resposibility of the algorithm to make it more difficult for the author to communicate as he or she sees fit … ?

Goals for Shiny app implementing a predictive text algorithm.

Using the cleaned text files, I intend to generate tables of so called n-grams. A 1-gram is a single word, a 2-gram is a pair of words, etc. The frequency of the n-grams can be used to fit a model using the blog, twitter and news files as training data. At this point, I am not sure what type of model makes the most sense. Since the predicted word depends only on a single value (a n-gram), some type of decision tree may be useful.

After I implement and verify the model, I will develop a Shiny app that will allow a user to type words and see what the algorithm suggests as the next word.