## Background

We are going to produce R code that mimics word-prediction algorithms for mobile text messaging. The joint venture between John Hopkins University and Swiftkey provided training data composed of Twitter posts, blogs, and news feeds. The files are also in the following languages: American English, Finnish, German, and Russian. The algorithm presented below will emphasize speed, and yet will hopefully yield similar results—that is, still predict what would the user wants to type next—as the current, memory-intensive methods.

## Data Processing

Let us now look at the data. Each of the 12, given files are about 115 MB in size on average. While building and testing some early algorithms, we can try smaller samples of the data. For instance, here we will look at the first 7 tweets in the English Twitter data set and load a sample of 250 tweets into a variable:

  temp_connection <- file("en_US.twitter.txt", "r")
readLines(temp_connection, 7)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [5] "Words from a complete stranger! Made my birthday even better :)"
## [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [7] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
  close(temp_connection)
quote = "",
sep = "\n",
stringsAsFactors = FALSE)
quote = "",
sep = "\n",
stringsAsFactors = FALSE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input

We can do the same with the blogs,

  temp_connection <- file("en_US.blogs.txt", "r")
readLines(temp_connection, 7)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan âgodsâ."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him." ## [4] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home." ## [5] "With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!" ## [6] "If you have an alternative argument, let's hear it! :)" ## [7] "If I were a bear,"  close(temp_connection) English_blogs_sample <- read.table("en_US.blogs.txt", nrows = 250, quote = "", sep = "\n", stringsAsFactors = FALSE) English_blogs <- read.table("en_US.blogs.txt", quote = "", sep = "\n", stringsAsFactors = FALSE) and the news feeds.  temp_connection <- file("en_US.news.txt", "r") readLines(temp_connection, 7) ## [1] "He wasn't home alone, apparently." ## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s." ## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building." ## [4] "The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of$4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than \$10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15."
## [5] "And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?"
## [6] "There was a certain amount of scoffing going around a few years ago when the NFL decided to move the draft from the weekend to prime time -- eventually splitting off the first round to a separate day."
## [7] "14915 Charlevoix, Detroit"
  close(temp_connection)
English_news_sample <- read.table("en_US.news.txt", nrows = 250,
quote = "",
sep = "\n",
stringsAsFactors = FALSE)
quote = "",
sep = "\n",
stringsAsFactors = FALSE)

From here, we can further separate the data samples into “words”.

  words_in_blogs <- lapply(strsplit(as.character(English_blogs_sample), " "), function(x) x)[[1]]
words_in_news <- lapply(strsplit(as.character(English_news_sample), " "), function(x) x)[[1]]
English_words <- c(words_in_blogs, words_in_news, words_in_Twitter)

From here forward, we will use the term “word” liberally for any group of characters. For instance, “dodger” and “Dodger” will be two different words, while “soon” and “soon?” will also be considered to be different as well (adding a question mark in particular changes an expression from a statement to an interrogative, which should affect the prediction algorithm).

## Exploratory Data Analysis

The nrow function will quickly find the number of lines of text in each data set.

  nrow(English_blogs)
## [1] 898384
  nrow(English_news)
## [1] 77258
  nrow(English_Twitter)
## [1] 2302307

More specificically, here are the number of words in each sample, data set.

  length(words_in_blogs)
## [1] 11503
  length(words_in_news)
## [1] 8179
  length(words_in_Twitter)
## [1] 2925

Here we will look at the patterns in the words themselves. First, we can continue to use the lapply coding to extract the first letter of each word and the length of each word.

  first_letters_of_words <- sapply(substring(English_words, 1, 1), function(x) x,
simplify = "array", USE.NAMES = FALSE)
lengths_of_words <- nchar(English_words)

To get a sense of the distribution of how words start (i.e. their first letters), here is a bar chart. Note: the first letter was converted to lower-case for the sake of brevity for the graph. That forced conversion will not be done anywhere else in this project.

  require(ggplot2)
  qplot(tolower(first_letters_of_words))

We can see “t” starts most words, while the 5th-place “w” might be a surprise.

The following histogram then shows the distribution for the lengths of the words (whose lengths are fewer than 11 characters) in the data sets.

  lengths_of_practical_words <- lengths_of_words*(lengths_of_words < 11)
hist(lengths_of_practical_words)

We can see that 3-letter words are the most common at about 19 percent of the data sets.

  sum(lengths_of_words == 3) / length(lengths_of_words)
## [1] 0.1856062

## Going Forward

From here, I will turn this information about the first letter and the length of a typed word into a predictive algorithm to give a user a choice of words that are likely next. In order to rank the words, I will compare their relative frequency in those first-letter and length categories. This two-dimensional approach was adapted from the work done by Christian Rudder in his book Dataclysm.

For example, if the user types “Where”, the first letter is capital “W” and the word length is 5. Within those categories, hopefuly the algorithm predicts that choices the next word should be something like “is”, “are”, and “do”.

This early emphasis on merely the first letter of a word along with the length of the typed word will allow us to build program (to predict the next word) without having to build an extra-large database. This approach will work in real-time while the user is typing.