Predictive Text Generator

Overview

The objective of this project is to create an app that predicts the next word while typing. We have with us several datasets which have been collected by a web crawler. In this document, we load the datasets, tidy them up and then perform some basic exploratory analyses on the cleaned data.

Loading the Datasets

Download the datasets and store them in the working directory. We will currently work with the English (US) datasets only.

conblogs <- file("./Coursera-SwiftKey/final/en_US/en_US.blogs.txt")
connews <- file("./Coursera-SwiftKey/final/en_US/en_US.news.txt")
contwitter <- file("./Coursera-SwiftKey/final/en_US/en_US.twitter.txt")

suppressWarnings(blogs <- readLines(conblogs))
suppressWarnings(news <- readLines(connews))
suppressWarnings(twitter <- readLines(contwitter))

close(conblogs);
close(connews);
close(contwitter)

Let us have a brief look at the contents of each of the datasets.

blogs[4]

## [1] "so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home."

news[2]

## [1] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

twitter[1:4]

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."                                                                       
## [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"

Now, let us look at some properties of the raw data:

Type	Total posts	Total words	Total Characters	Words per post	Characters per post
Blogs	899288	37334131	208361438	41.51521	231.69601
News	77259	2643969	15683765	34.22215	203.00243
Twitter	2360148	30373543	162384825	12.86934	68.80281

This does make sense. Twitter has the fewest characters and words per post, possibly due to the 140/280 character limit. The average word and character distribution is similar for news articles and blog posts, although the total number of posts is vastly different.

Cleaning the Datasets

The datasets are massive, thus we sample from the datasets. We account for the different number of total posts by sampling as follows, to ensure a similar contribution from each of the datasets:

Select 10% of the tweets
Select 10% of the blogs
Select 100% of the news

set.seed(1234)
samplelines <- c(sample(blogs, length(blogs) * 0.1),
                 sample(news, length(news)),
                 sample(twitter, length(twitter) * 0.1))
samplelines <- gsub("[^a-zA-Z']", " ", samplelines)
samplelines <- gsub(" {2,}", " ", samplelines)
samplelines <- trimws(samplelines)
samplelines <- tolower(samplelines)
samplelines <- strsplit(samplelines, " ")
totallength <- length(samplelines)

Let us now see how this data looks.

print(totallength)

## [1] 403201

head(samplelines, 2)

## [[1]]
##  [1] "he"     "looked" "back"   "at"     "me"     "his"    "eyes"   "were"  
##  [9] "as"     "dark"   "as"     "coal"  
## 
## [[2]]
##  [1] "you've"  "set"     "up"      "a"       "problem" "without" "stakes" 
##  [8] "why"     "does"    "she"     "care"    "who"     "the"     "voice"  
## [15] "on"      "the"     "phone"   "is"      "why"     "would"   "she"    
## [22] "even"    "listen"  "to"      "him"     "past"    "hello"

Creating the Dictionaries

To proceed with the model, we would ideally like a dictionary with every n-gram. Here, we create a dictionary for n = 1, 2, 3 in order to find out the most common unigrams, bigrams and trigrams.

Unigrams

The following code snippet parses through the sample of the dataset. If a word is not already present in the unigrams list, it initializes its first occurrence, otherwise it increments the previous number of cumulative occurrences.

unigram = list()
count <- length(samplelines)/50
for(line in samplelines) {
    count <- count - 1
    if(count < 0) break;
    for(word in line) {
      if(is.null(unigram[[word]]))
        unigram[[word]] = 1
      else 
        unigram[[word]] = unigram[[word]] + 1
    }
}
unigram <- unigram[order(unlist(unigram), decreasing=TRUE)]
barplot(as.numeric(unigram[1:20]), names.arg=names(unigram[1:20]), las=2, col="blue", border="black", density=seq(100, 10, -4), main = "Unigrams", xlab = "Unigram", ylab = "Frequency")

Bigrams

The following code snippet parses through the sample of the dataset. If a bigram is not already present in the bigrams list, it initializes its first occurrence, otherwise it increments the previous number of cumulative occurrences.

bigram = list()
count <- totallength/50
for(line in samplelines) {
    count <- count - 1
    if(count < 0) break;
    for(word in line) {
        if(line[1] != word) {
            create_bigram <- paste(last, word)
            if(is.null(bigram[[create_bigram]]))
                bigram[[create_bigram]] = 1
            else
              bigram[[create_bigram]] = bigram[[create_bigram]] + 1
        }
        last <- word
    }

}
bigram <- bigram[order(unlist(bigram), decreasing=TRUE)]
barplot(as.numeric(bigram[1:20]), names.arg=names(bigram[1:20]), las=2, col="magenta", border="black", density=seq(100, 10, -4), main = "Bigrams", xlab = "Bigram", ylab = "Frequency")

Trigrams

trigram = list()
count <- totallength/50
for(line in samplelines) {
    count <- count - 1
    if(count < 0) break;
    for(word in line) {
        if(line[1] != word) {
            if(line[2] != word) {
                create_trigram <- paste(secondlast, last, word)
                if(is.null(trigram[[create_trigram]]))
                  trigram[[create_trigram]] = 1
                else
                  trigram[[create_trigram]] = trigram[[create_trigram]] + 1
            }
            secondlast <- last
        }
        last <- word
    }
}
trigram <- trigram[order(unlist(trigram), decreasing=TRUE)]
barplot(as.numeric(trigram[1:20]), names.arg=names(trigram[1:20]), las=2, col="green", border="black", density=seq(100, 10, -4), main = "Trigrams", xlab = "Trigram", ylab = "Frequency")

Some Other Questions

First of all, we would like to find out the number of unique words in the dictionary.

uniquewords <- length(unigram)
uniquewords

## [1] 26100

Now, we would like to find out how many unique words account for 90% of the total words occurring in the datasets.

sum <- 0
uniquewords <- 0
for (i in 1:length(unigram)) sum = sum + unigram[[i]]
sum90 <- 0.9*sum
for (i in 1:length(unigram)) {
    sum90 <- sum90 - unigram[[i]]
    uniquewords <- uniquewords + 1
    if(sum90 <= 0) break;
}
print(c(uniquewords, uniquewords/length(unigram)*100))

## [1] 5535.0000   21.2069

Okay, so we see that about 20% of the unique words in the unigram list account for 90% of the total words in the dataset. Seems like the Pareto principle in action, eh? Quantitative linguistics is actually governed by Zipf’s law which states the exact same principle.
How do we know if what we’re doing makes any sense? A good barometer is the Oxford English Corpus. Turns out that our 18 of our top 20 unigrams find a place in the OEC’s top 20. Not bad for a crude first attempt!

What Next?

We have performed basic exploratory analysis. Now, we have a basic idea of the structure of our datasets and what words are more likely to come up while typing. The basic idea for the rest of the project will be to look out for the last typed words in the list of n-grams and return the following word in the n-gram with the greatest occurrence.

Predictive Text Generator - Part I

Aditya Iyengar

18/04/2020