Swift Key Corpus Exploratory Analysis

The Swift Key corpus is a unstructured data set of freeform text collected from blogs, news and Twitter. The data is available in four languages (English, German, Russian and Finish). This report will only focus on the English corpus. I don’t know how to detect words from foreign languages, since I am not using any language dictionaries at this point. The following report conducts exploratory analysis on the data set, with the eventual goal of developing a prediction model for the next word based on the previous words that the user has typed. The user will see suggestions for the most likely next words to save time on typing. Like this exploratory analysis, the model will likely only work for the English language.

Basic Statistics

files <- c("en_US.twitter.txt", "en_US.blogs.txt", "en_US.news.txt")
for (filename in files) {
  print(paste('Filename:', filename))
  con <- file(filename, "r")
  lines <- suppressWarnings(readLines(con))
  print(paste('Line count:', length(lines)))
  word_count <- 0
  for (line in lines) {
    lower <- tolower(line)
    alpha_only <- gsub("[^a-z ]", "", lower)
    tokens <- strsplit(alpha_only, "\\s+")[[1]]
    word_count <- word_count + length(tokens)
  }
  print(paste('Word count:', word_count))
  close(con)
}

## [1] "Filename: en_US.twitter.txt"
## [1] "Line count: 2360148"
## [1] "Word count: 29417374"
## [1] "Filename: en_US.blogs.txt"
## [1] "Line count: 899288"
## [1] "Word count: 36869796"
## [1] "Filename: en_US.news.txt"
## [1] "Line count: 1010242"
## [1] "Word count: 33507944"

Load the Data

We are interested in preparing a histogram of the most frequent words, bigrams, and trigrams.

suppressMessages(library(dplyr))
set.seed(123)
sample_percentage <- 0.01 # sampling only 1% of the corpus to speed up processing
word_count <- data.frame()
bigram_count <- data.frame()
trigram_count <- data.frame()
for (filename in files) {
  con <- file(filename, "r")
  lines <- suppressWarnings(readLines(con))
  counter <- 0
  for (line in lines) {
    if (rbinom(1, 1, sample_percentage) == 0) {
      # Randomly sample 1% of the data
      next
    }
    lower <- tolower(line)
    alpha_only <- gsub("[^a-z ]", "", lower)
    tokens <- strsplit(alpha_only, "\\s+")[[1]]
    bigrams <- c()
    for (i in 2:length(tokens)) {
      bigrams <- c(bigrams, paste(tokens[i-1], tokens[i]))
    }
    trigrams <- c()
    for (i in 3:length(tokens)) {
      trigrams <- c(trigrams, paste(tokens[i-2], tokens[i-1], tokens[i]))
    }
    df1 <- data.frame(token=tokens, count=rep(1, length(tokens)))
    df2 <- data.frame(bigram=bigrams, count=rep(1, length(bigrams)))
    df3 <- data.frame(trigram=trigrams, count=rep(1, length(trigrams)))
    word_count <- rbind(word_count, df1)
    word_count <- word_count %>%
      group_by(token) %>%
      summarise(count = sum(count))
    bigram_count <- rbind(bigram_count, df2)
    bigram_count <- bigram_count %>%
      group_by(bigram) %>%
      summarise(count = sum(count))
    trigram_count <- rbind(trigram_count, df3)
    trigram_count <- trigram_count %>%
      group_by(trigram) %>%
      summarise(count = sum(count))
  }
  close(con)
}

Barplot of most frequently used words

word_count <- word_count %>%
  arrange(desc(count))
bigram_count <- bigram_count %>%
  arrange(desc(count))
trigram_count <- trigram_count %>%
  arrange(desc(count))

stopping_point <- 0.9
total_unique_words <- sum(word_count$count)
cumsum <- 0
word_count$cdf <- rep(0, nrow(word_count))
for (i in 1:nrow(word_count)) {
  cumsum <- cumsum + word_count$count[i]
  word_count$cdf[i] <- cumsum / total_unique_words
  if (word_count$cdf[i] > stopping_point) {
    break
  }
}
word_count <- word_count %>%
  filter(cdf <= stopping_point)
ninety_coverage <- nrow(word_count)
total_bigrams <- sum(bigram_count$count)
cumsum <- 0
bigram_count$cdf <- rep(0, nrow(bigram_count))
for (i in 1:nrow(bigram_count)) {
  cumsum <- cumsum + bigram_count$count[i]
  bigram_count$cdf[i] <- cumsum / total_bigrams
  if (bigram_count$cdf[i] > stopping_point) {
    break
  }
}
bigram_count <- bigram_count %>%
  filter(cdf <= stopping_point)
bigram_ninety_coverage <- nrow(bigram_count)
total_trigrams <- sum(trigram_count$count)
cumsum <- 0
trigram_count$cdf <- rep(0, nrow(trigram_count))
for (i in 1:nrow(trigram_count)) {
  cumsum <- cumsum + trigram_count$count[i]
  trigram_count$cdf[i] <- cumsum / total_trigrams
  if (trigram_count$cdf[i] > stopping_point) {
    break
  }
}
trigram_count <- trigram_count %>%
  filter(cdf <= stopping_point)
trigram_ninety_coverage <- nrow(trigram_count)

stopping_point <- 0.5
word_count <- word_count %>%
  filter(cdf <= stopping_point)
fifty_coverage <- nrow(word_count)
bigram_count <- bigram_count %>%
  filter(cdf <= stopping_point)
bigram_fifty_coverage <- nrow(bigram_count)
trigram_count <- trigram_count %>%
  filter(cdf <= stopping_point)
trigram_fifty_coverage <- nrow(trigram_count)
barplot(word_count$count[1:20], names.arg = word_count$token[1:20], las=2, main="Twenty most frequent words in the English language")

barplot(bigram_count$count[1:20], names.arg = bigram_count$bigram[1:20], las=2, main="Twenty most frequent bigrams in the English language")

par(mar=c(8,4,4,2)+.1)
barplot(trigram_count$count[1:20], names.arg = trigram_count$trigram[1:20], las=2, main="Twenty most frequent trigrams in the English language")

The number of unique words needed to cover 50% of the English language is 49449. The number of unique words needed to cover 90% of the English language is 56739.

The number of unique bigrams needed to cover 50% of the English language is 128234. The number of unique bigrams to cover 90% of the English language is 440156.

The number of unique trigrams needed to cover 50% of the English language is 403497. The number of unique trigrams to cover 90% of the English language is 769918.

Since there’s more possible combinations of trigrams than bigrams, it requires more unique trigrams to cover a majority of the English language than bigrams. The distribution of trigrams is more uniform and less clustered around certain popular phrases than bigrams. Following a similar logic, the distribution of bigrams is more spread out than individual words.

Future Plans

I plan on using a Markov chain model to predict the next word that the user is going to type, conditional on the previous two words. In simpler terms, I’m going to figure out the most likely candidates for the next word based on what most people have typed on the Internet in the past and suggest that to the user. For simplicity, there will be no customization for individual users and I will focus on the English language only. For the very first word, I’ll just suggest the most common words. For the second word, I will use a bigram model. From the third word onward, I will use the trigram model. I want to create a simple Shiny app with just a text box that has word suggestions in a drop-down menu.

Swift Key Corpus Exploratory Analysis

Toby Huang

4/18/2020

Basic Statistics

Load the Data

Barplot of most frequently used words

Future Plans