Using devices such as smartphones, people are increasingly using text based interfaces to communicate with one another.
The website Openmarket.com states “Findings reveal millennials have an overwhelming affinity for texting. In fact, when given the choice between only being able to text or call on their mobile phones, a surprising 75 percent of millennials would rather lose the ability to talk versus text.
Survey respondents say texts are “more convenient” and on their own schedule (76 percent), texts are “less disruptive than a voice call” (63 percent), they “prefer to text vs. call” in general (53 percent) and because they “never check voicemails” (19 percent)."
Communicating via text (texting– see, it’s so popular it’s a verb now), however, is riddled with challenges; typing every letter of every word, using appropriate punctuation, substituting non verbal cues with text appropriate words or symbols are just a few of the issues that are unique to text based communication. As I type this Markdown document, I am keenly aware of the inherent inefficiencies of written communication.
Several innovations have been made that greatly improve both the efficiency and the efficacy of texting. Some of the first major efficiency improvements were spell checkers( which this Markdown editor does not have… ironic, maybe?),auto correction and automatic formatting. Emoticons, Bitmoji, etc. have been very effective at replacing many words or sentences that would have been needed to communicate context, intent, and emotion.
Most texting is performed on small, handheld devices. As a result, " keyboards" used for text entry are extremely small, limiting the number of characters that are readily available, as well exascerbating keyboard entry errors (typos). Therefore, messages can be exchanged with greater speed and accuracy if an individual’s interaction with the keyboard is reduced. One way to reduce the number of keyboard inputs is use so called “Predictive Text” (PT) algorithms.
“Predictive text is an input technology used where one key or button represents many letters, such as on the numeric keypads of mobile phones and in accessibility technologies. Each key press results in a prediction rather than repeatedly sequencing through the same group of “letters” it represents, in the same, invariable order. Predictive text could allow for an entire word to be input by single keypress. Predictive text makes efficient use of fewer device keys to input writing into a text message, an e-mail, an address book, a calendar, and the like.” – Wikipedia
In other words, a PT algorithm attempts to predict what your next word will be and suggests it to you. If you agree with the suggestion, a single keyboard input than places the entire word into the message. In a perfect world with perfect prediction, only one keystroke per word would be required to compose a message.
A PT algorithm is, at its core, a statistical model. The next word in a string is never known for certain, based on the previous word or any number of previous words. Therefore, one can only make predictions based on the likelihood of a word given some prior condition. If the preferences of the user are unknown, the user must be considered to be an individual in a population of texters. The parameters of the population are then assumed to apply to the individual as well. For example, if the population of all texters has a high probability of following the words “how are” with the word “you”, then the algorithm will assume that any individual from that population would be likely to do the same.
This report describes my preliminary investigation in developing a PT algorithm.In addition, it describes my plan to develop an app that implements the algorithm.
In order to determine how an individual message might be constructed, it is important to know how messages are constructed in the population of all messages (big population). Large samples of text are required. Care must be taken, however, in choosing text to be used as representative samples. Authors of text messages do not constuct messages the way that authors of novels do. It would be extremely unlikely to find “idk” in a novel, but easy to find in a text message.
As a result, I used 3 large text files as representative samples. The first, is a large file containing only lines from twitter accounts(tweets). The second is a file containing blog posts, and the third, is news.
numLinesBlog
## [1] 899288
numLinesNews
## [1] 1010242
numLinesTwitter
## [1] 2360148
Sample output of “cleaned” blog text.
## [1] "in" "the" "years" "thereafter" "most"
## [6] "of" "the" "oil" "fields" "and"
Sample output of “cleaned” news text.
## [1] "he" "wasnt" "home" "alone" "apparently"
## [6] "the" "st" "louis" "plant" "had"
Sample output of “cleaned” twitter text.
## [1] "how" "are" "you" "btw" "thanks" "for" "the"
## [8] "rt" "you" "gonna"
I tabulated the frequencies of each word in all three files and ranked them by relative frequency (%) The total number of words, and the total number of UNIQUE words in each file is listed below.
## Type Unique Total
## 1 Blog 369083 37210525
## 2 News 86935 2581769
## 3 Twitter 467495 29372088
I tabulated the frequencies of each word in all three files and ranked them by relative frequency (%)
## Word Count RelFreq
## 324338 the 1857050 4.990658
## 11131 and 1088448 2.925108
## 329675 to 1066100 2.865050
## 1 a 898485 2.414599
## 227837 of 875167 2.351934
## 151317 i 827376 2.223500
## 154280 in 594047 1.596449
## 324114 that 472282 1.269216
## 160850 it 441208 1.185708
## 160121 is 432199 1.161497
## Word Count RelFreq
## 77208 the 151552 5.8700837
## 78286 to 69355 2.6863364
## 2525 and 68258 2.6438461
## 1 a 67227 2.6039123
## 53711 of 59091 2.2887795
## 36830 in 51468 1.9935168
## 27902 for 27116 1.0502876
## 77189 that 26648 1.0321605
## 38332 is 21969 0.8509282
## 54143 on 20586 0.7973603
## Word Count RelFreq
## 399679 the 934321 3.180983
## 408941 to 786676 2.678311
## 187652 i 715096 2.434611
## 1 a 609250 2.074248
## 463001 you 544807 1.854846
## 13439 and 433779 1.476841
## 142567 for 384581 1.309342
## 192651 in 377109 1.283903
## 284178 of 359003 1.222259
## 199346 is 357583 1.217425
The word “the” is the most common in word in all three files, althought it is used less on Twitter. The words “to” , “in”, “and”, and “a” are among the most common in all three files, but those are the only words that are most common in all three text samples.
Note that the platform or application can have a huge effect on word usage. For example the word “rt” in twitter is more commonly used than words we use all the time when not using twitter. My daughter informs me that “rt” means re-tweet :).
## Word Count RelFreq name
## 159604 good 99731 0.3395434 Twitter
## 449042 will 94297 0.3210429 Twitter
## 1048 about 90961 0.3096852 Twitter
## 94613 day 90132 0.3068628 Twitter
## 59796 can 89600 0.3050515 Twitter
## 341826 rt 88778 0.3022529 Twitter
## 108773 dont 88721 0.3020589 Twitter
## 398560 thanks 88707 0.3020112 Twitter
## 147188 from 83715 0.2850155 Twitter
## 280191 now 81989 0.2791392 Twitter
## 446219 when 81805 0.2785127 Twitter
Word lengths might be useful in predicitive algorithms. For example, words that are very long are seldom used, and therefore, should not appear often as a predicted value.
“Mom leans _________”
I removed all words longer than 12 characters from the blog, news, and Twitter files.
With the sample data largely cleaned there’s still work left to do; for example removing profanity. The twitter file contains the most profanity and a search for the “fword” returned almost 26,000 occurrances of the string “fuck” and 860 different words containing “fuck”. It’s fairly entertaining perusing the list to see the creative ways in which people are able to use these 4 letters :).
## TwitterFword Freq
## 129 fuck 12080
## 250 fucking 7218
## 242 fuckin 1819
## 173 fucked 1398
## 363 fucks 440
## 191 fucker 244
## 601 motherfucker 166
## 199 fuckers 165
## 189 fucken 145
## 336 fuckn 109
## 610 motherfucking 79
## 604 motherfuckers 76
## 643 muthafucka 56
## 206 fuckery 54
## 297 fuckk 52
## 653 muthafuckin 44
## 130 fucka 40
## 609 motherfuckin 37
## 579 mothafucka 29
## 300 fuckkkk 26
diffFwords <- sum(FreqTableFwordSort[,2])
diffFwords
## [1] 25719
numFwords <- length(FreqTableFwordSort[,1])
numFwords
## [1] 860
freqFwords<- numFwords/length(TwitterDataClean)
freqFwords *100
## [1] 0.00292795
All kidding aside, there is a fair amount of work involved in enssuring that the PT algorithm does not include profanity as word suggestions. Defining profanity is no small task, and eliminating all variations of a profane word is a real challenge.
More generally, if an author commonly communicates using profanity, is it the resposibility of the algorithm to make it more difficult for the author to communicate as he or she sees fit … ?
Using the cleaned text files, I intend to generate tables of so called n-grams. A 1-gram is a single word, a 2-gram is a pair of words, etc. The frequency of the n-grams can be used to fit a model using the blog, twitter and news files as training data. At this point, I am not sure what type of model makes the most sense. Since the predicted word depends only on a single value (a n-gram), some type of decision tree may be useful.
After I implement and verify the model, I will develop a Shiny app that will allow a user to type words and see what the algorithm suggests as the next word.