Synopsis

This exploratory analysis is meant to explore some twitter, blog and news corpora - retrieved via Coursera - to create a SwiftKey application. We want to get a first sense how the word counts are distributed, and what the most frequent words (unigrams), pairs (bigrams), triples (trigrams) and quatriples (quadgrams)are. This will serve us to develop an application that suggests, i.e. predicts, the next word while typing.

There are no dependencies as no packages were used.

Raw Data

The data link in the synopsis provides a directory ‘final’ that should be in R’s working directory. Then we can connect to the files and store them in a tweets, news and blogs variable. Finally the connections are closed.

con1 <- file("final/en_US/en_US.twitter.txt", "r")
con2 <- file("final/en_US/en_US.news.txt", "r")
con3 <- file("final/en_US/en_US.blogs.txt", "r")

tweets <- readLines(con1)
news <- readLines(con2)
blogs <- readLines(con3)

close(con1); close(con2); close(con3)

To see what we are working with, let us have a look at a few lines of data.

tweets[1:3]
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
news[2:3]
## [1] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."                        
## [2] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
blogs[3]
## [1] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."

In the table below, I have sumarized some facts about the data with inline code, such as the file size, number of texts, words and characters.

File Number of Texts Total Chars Characters/Text Total Words Words/Text
en_US.twitter.txt 2360148 tweets 162384825 chars 69 chars/tweet 30373543 words 13 words/tweet
en_US.news.txt 77259 newsitems 15683765 chars 203 chars/newsitem 2643969 words 34 words/newsitem
en_US.blogs.txt 899288 blogposts 208361438 chars 232 chars/blogpost 37334131 words 42 words/blogpost

This gives us a first sense of how the data looks like. The tweets are of course generally shorter than the blogs and newsitems, and we get an idea of how much chars and words the three kinds of texts contain.

Tidy Data

Because there is a lot of data, we sample. We do so by concatinating a random 1% of the tweets, 2% of the and blogs and 20% of the news data to get somewhat equal proportions (based on the total number of characters). This distribution is somewhat arbitrary as well, to reduce the big effect of tweets. We set the seed for reproducibility.

set.seed(20)
samplelines <- c(sample(tweets, length(tweets) * 0.01),
                 sample(news, length(news) * 0.2),
                 sample(blogs, length(blogs) * 0.02))

Time to clean. We extract only the letters - as well as the apostroph to preserve words like ‘don’t’. We also remove any redundant whitespace, set everything to lowercase, and split up the words per line. The result is a list of lists. The outer lists still represent the lines as before, the inner lists contain the separated words, as some examples show below.

samplelines <- gsub("[^a-zA-Z']", " ", samplelines)  # Extract text only
samplelines <- gsub(" {2,}", " ", samplelines)       # Remove double spaces
samplelines <- trimws(samplelines)                   # Remove outer whitespace
samplelines <- tolower(samplelines)                  # Set to lower case
samplelines <- strsplit(samplelines, " ")            # Split on spaces

totallength <- length(samplelines)
head(samplelines, 2)
## [[1]]
## [1] "mission"      "accomplished"
## 
## [[2]]
##  [1] "it's"      "good"      "to"        "encourage" "your"     
##  [6] "lil"       "sister"    "to"        "cut"       "class"    
## [11] "for"       "nyg"       "parade"    "but"       "even"     
## [16] "better"    "to"        "convince"  "your"      "mom"      
## [21] "to"        "skip"      "work"

As a result, we have 57037 lines of mixed tweets, blogs and newsitems. The words are now nicely cleaned up and separated.

N-Gram Dictionaries

Ultimately, we want a dictionary for every n-gram, containing the word(s) of the n-gram as a key and the number of occurences as the value. First of all, I initiate four lists that will serve as the dictionaries.

unigrams = list()
bigrams = list()
trigrams = list()
quadgrams = list()

For every n-gram I will now:

Unigrams

For the unigrams, we loop through the words per line. For each word I check the dictionary: if the word doesn’t exist, the value is set to 1. If it does exists, it is incremented by 1. To make the algorithm less computationally intensive, we stop adding new words to the dictionary after 10% of the lines have passed. By that time, common unigrams are very likely to be in the dictionary already. This is what the counter and flag variables are for.

counter = 1
for(line in samplelines) {
    flag = counter < totallength * 0.1
    for(word in line) {
        if(is.null(unigrams[[word]])) {
            if(flag) unigrams[[word]] = 1
        } else unigrams[[word]] = unigrams[[word]] + 1
    }
    counter = counter + 1
}

Sorting and visualizing.

unigrams <- unigrams[order(unlist(unigrams), decreasing=TRUE)]

barplot(as.numeric(unigrams[1:30]), names.arg=names(unigrams[1:30]), las=2, col="darkred", border="black", density=seq(100, 10, -3), main = "Unigrams", ylab = "# of instances")

Bigrams

The next n-grams work analogous, with some sidenotes: for the bigrams the last word is being stored and pasted with the current one to get the bigram names. And we can only start at the second word.

counter = 1
for(line in samplelines) {
    flag = counter < totallength * 0.1
    for(word in line) {
        if(line[1] != word) {
            bigramname <- paste(last, word)
            if(is.null(bigrams[[bigramname]])) {
                if(flag) bigrams[[bigramname]] = 1
            } else bigrams[[bigramname]] = bigrams[[bigramname]] + 1
        }
        last <- word
    }
    counter = counter + 1
}

Sorting and visualizing.

bigrams <- bigrams[order(unlist(bigrams), decreasing=TRUE)]

barplot(as.numeric(bigrams[1:30]), names.arg=names(bigrams[1:30]), las=2, col="darkblue", border="black", density=seq(100, 10, -3), main = "Bigrams", ylab = "# of instances")

Trigrams

For the trigrams, also the penultimate word is stored and pasted to get the name, and we can only start at the third word.

counter = 1
for(line in samplelines) {
    flag = counter < totallength * 0.1
    for(word in line) {
        if(line[1] != word) {
            if(line[2] != word) {
                trigramname <- paste(secondlast, last, word)
                if(is.null(trigrams[[trigramname]])) {
                    if(flag) trigrams[[trigramname]] = 1
                } else trigrams[[trigramname]] = trigrams[[trigramname]] + 1
            }
            secondlast <- last
        }
        last <- word
    }
    counter = counter + 1
}

Sorting and visualizing.

trigrams <- trigrams[order(unlist(trigrams), decreasing=TRUE)]

par(mar=c(8, 4, 4, 2)) # Create space for vertical lables
barplot(as.numeric(trigrams[1:30]), names.arg=names(trigrams[1:30]), las=2, col="darkgreen", border="black", density=seq(100, 10, -3), main = "Trigrams", ylab = "# of instances")

Quadgrams

For the quadgrams, we paste together four words for a name and can only start at the fourth word.

counter = 1
for(line in samplelines) {
    flag = counter < totallength * 0.2
    for(word in line) {
        if(line[1] != word) {
            if(line[2] != word) {
                if(line[3] != word) {
                    quadname <- paste(thirdlast, secondlast, last, word)
                    if(is.null(quadgrams[[quadname]])) {
                        if(flag) quadgrams[[quadname]] = 1
                    } else quadgrams[[quadname]] = quadgrams[[quadname]] + 1
                }
                thirdlast <- secondlast
            }
            secondlast <- last
        }
        last <- word
    }
    counter = counter + 1
}

Just to see how this actually works, here are four quadgrams with their frequencies. See how the words shift accross the sentence. However, they don’t overlap in between the text lines, which makes sense. Apparently, these quadgrams happen to occur only once.

quadgrams[1:4]
## $`it's good to encourage`
## [1] 1
## 
## $`good to encourage your`
## [1] 1
## 
## $`to encourage your lil`
## [1] 1
## 
## $`encourage your lil sister`
## [1] 1

Sorting and visualizing.

quadgrams <- quadgrams[order(unlist(quadgrams), decreasing=TRUE)]

par(mar=c(10, 4, 4, 2)) # Create space for vertical lables
barplot(as.numeric(quadgrams[1:30]), names.arg=names(quadgrams[1:30]), las=2, col="darkorange", border="black", density=seq(100, 10, -3), main = "Quadgrams", ylab = "# of instances")

Occurences vs. Uniques

How many unique words account for 50% and 90% of word instances?

To answer this question, we first need the total amount of words. To get this, we can loop over the unigrams and sum all the wordcounts.

instances <- sum(as.numeric(unigrams))
c(length(unigrams), instances)
## [1]   10724 1381988

We have 10724 unique words in our sample data, occuring 1381988 times in total. Next, We again loop over the unigrams, but stop adding word counts when the total has exceeded 50% of the total number of instances. Meanwhile, we count how many unique words have passed.

uniques = 0
total = 0
fiftymark <- instances * 0.5
for(wordcount in unigrams) {
    total <- total + wordcount
    uniques <- uniques + 1
    if(total > fiftymark) break
}
print(c(uniques, uniques/length(unigrams)*100))
## [1] 80.0000000  0.7459903

We need 80 unique words - only 0.7% of the total unique words - to cover 50% of all word instances. The algorithm for the 90% part is almost identical.

uniques = 0
total = 0
for(wordcount in unigrams) {
    total <- total + wordcount
    uniques <- uniques + 1
    if(total > instances * 0.9) break
}
print(c(uniques, uniques/length(unigrams)*100))
## [1] 2105.00000   19.62887

To get 90% of all word instances, we need 2105 unique words or 19.6% of the total unique words. The message to take away here is that only a select number of words account for the majority of instances, while a whole range of other words appear to have very few instances. This strengthens our choice to don’t take into account new n-grams after 10-20% of the data has passed.

We will use this idea to our advantage and scale the dictionaries down to get better performance. Here, I remove all words from the dictionaries with less than 5 occurences. This would remove a large part of the dictionary, but preserves the majority of instances.

for(word in names(unigrams)) if(unigrams[word] < 5) unigrams[word] <- NULL 
for(word in names(bigrams)) if(bigrams[word] < 5) bigrams[word] <- NULL 
for(word in names(trigrams)) if(trigrams[word] < 5) trigrams[word] <- NULL 
for(word in names(quadgrams)) if(quadgrams[word] < 5) quadgrams[word] <- NULL

Final Thoughts

We explored the data and have gotten a sense of what kind of words should be suggested while typing. The four generated dictionaries will definitely help us to predict next words while typing. One idea would be to see if the combination of last typed words are present in any n-grams and suggest the following word in that n-gram. We could for instance start at the quadgrams, and then ‘fall back’ on the trigrams and bigrams if the combination is not present.

As the dictionaries will be of some good value, we save them to reuse them later.

if(!dir.exists("n-grams")) dir.create("n-grams")
save(unigrams, file = "n-grams/unigrams.rda")   # load(file = "n-grams/unigrams.rda")
save(bigrams, file = "n-grams/bigrams.rda")     # load(file = "n-grams/bigrams.rda")
save(trigrams, file = "n-grams/trigrams.rda")   # load(file = "n-grams/trigrams.rda")
save(quadgrams, file = "n-grams/quadgrams.rda") # load(file = "n-grams/quadgrams.rda")