Dependencies

Listed below are all the packages required for this project, as well as their build versions and diagnostic info. For more information on the quanteda Natural Language Processing (NLP) package, visit https://quanteda.io/

## Loading required package: quanteda

## Warning: package 'quanteda' was built under R version 4.0.3

## Package version: 2.1.2

## Parallel computing: 2 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

## Loading required package: dplyr

## Loading required package: ggplot2

## Loading required package: ggpubr

## Warning: package 'ggpubr' was built under R version 4.0.3

## Loading required package: readtext

## Warning: package 'readtext' was built under R version 4.0.3

Getting and Cleaning Data

The data for this project come from SwiftKey and consist of unedited text from blogs, Twitter, and news articles in 4 different languages (de-DE, en-US, fi-FI, and ru_RU).

# Download data:
if (!file.exists("final")) {
  dat_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  dest_file <- "Coursera-Swiftkey.zip"
  download.file(dat_url, dest_file)
  unzip(dest_file)
  file.remove(dest_file)
  
  # Clean up workspace:
  rm(dat_url)
  rm(dest_file)
}

Creating Subsamples

The full dataset that SwiftKey has provided for this project is quite large (~100+ MB per file), and including all of it slows processing to a crawl. In order to address this, as well as aggregate our individual text sources in a non-biased way, we can create several random subsamples of our data, then combine them during tokenization to build our text corpora. The following code will do exactly that, creating a number of reduced-size .sample.txt files sitting alongside the original files in the SwiftKey data directory.

# Create random subsample:
build_samples <- function(prob=0.3, replace=FALSE) {
  for (f in list.files(file.path("final"), recursive=TRUE)) {
    if (grepl(pattern = "(?<!sample).txt", f, perl=TRUE)) {
      f <- file.path("final", f)
      output_name <- tools::file_path_sans_ext(basename(f))
      output_file <- file.path(dirname(f), paste0(output_name, ".sample.txt"))
      if (!file.exists(output_file) | replace) {
        set.seed(1234)
        text <- readLines(f)
        in_sample <- sample(text, length(text) * prob)
        write_connection <- file(output_file)
        writeLines(in_sample, write_connection)
        close(write_connection)
      }
    }
  }
}

build_samples(prob=0.3, replace=FALSE)

Tokenization and Profanity Filtering

Tokenization is the process of extracting individual “tokens” (words, numbers, punctuation, etc.) from an input sentence, and is an integral part of nearly all NLP tasks. The same is true in our use case, and before we move on to any exploratory analyses, we need to read in the data and split it into tokens.

This step also represents the best time to perform any desired token filtration (i.e. punctuation/non-word symbols or profanity). Using a list of 1300 profane words (in English) compiled by Carnegie Mellon University, we can do exactly that.

# Tokenization and filtering:
load_tokens <- function(language_code, filter=TRUE) {
  fulltext <- vector()
  for (f in list.files(file.path("final", language_code))) {
    if (grepl(pattern = "\\.sample\\.txt", f, perl=TRUE)) {
      f <- file.path("final", language_code, f)
      text <- readLines(f)
      fulltext <- append(fulltext, text)
    }
  }
  if (filter) {
    if (!file.exists("profanity_list.txt")) {
      profanity_url <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
      download.file(profanity_url, "profanity_list.txt")
      rm(profanity_url)
    }
    profanity_list <- readLines("profanity_list.txt")
    tokens(fulltext, remove_punct=TRUE, remove_symbols=TRUE) %>%
      tokens_remove(pattern=profanity_list)
      #tokens_remove(pattern=stopwords("en"))
  } else {
    tokens(fulltext)
  }
}

Exploratory Analysis

Before we begin, we should first examine some diagnostic data about our randomly-sampled-and-merged dataset to make sure our exploratory analyses are representative of the wider whole.

tok <- load_tokens("en_US", filter=TRUE)
rawdfm <- tok %>%
  dfm(tolower=TRUE)

data.frame(line.count=length(tok),
           word.count=sum(sapply(tok, length)),
           unique.words=ncol(rawdfm))

##   line.count word.count unique.words
## 1    1001007   20985325       376496

The first step in building a predictive model for text is understanding the distribution and relationship between words and phrases in the text. Luckily, quanteda includes functionality to easily create and manipulate Document Feature Matrices (DFMs), which keep track of the frequency ranking of terms in the base text. We will use these to see what the most common words are in our unfiltered dataset, and then compare with the filtered alternative to see what effect filtration has on the base data.

onegram <- tok %>%
  dfm(tolower=TRUE) %>%
  textstat_frequency(n=10)

tok_nofilter <- load_tokens("en_US", filter=FALSE)
onegram_nofilter <- tok_nofilter %>%
  dfm(tolower=TRUE) %>%
  textstat_frequency(n=10)

g1 <- ggplot(onegram, aes(x=reorder(feature, -frequency), y=frequency)) + 
  geom_bar(stat="identity") + 
  labs(title = "",
       subtitle = "Filtered",
       x = "Feature",
       y = "Frequency")

g2 <- ggplot(onegram_nofilter, 
             aes(x=reorder(feature, -frequency), y=frequency)) + 
  geom_bar(stat="identity") + 
  labs(title = "Twitter: Top 10 1-grams",
       subtitle = "No filtering",
       x = "Feature",
       y = "Frequency")

ggarrange(g2, g1, nrow=1, ncol=2)

As we can see, our unfiltered data contains a few punctuation symbols that we will want to exclude from our final prediction algorithm. As such, we will use only the filtered dataset from here on out. Now, let’s examine the most common 2- and 3-grams (combinations of 2 or 3 words respectively).

twogram <- tok %>%
  tokens_ngrams(n=2) %>%
  dfm(tolower=TRUE) %>%
  textstat_frequency(n=10)

threegram <- tok %>%
  tokens_ngrams(n=3) %>%
  dfm(tolower=TRUE) %>%
  textstat_frequency(n=10)

g3 <- ggplot(twogram, aes(x=reorder(feature, -frequency), y=frequency)) + 
  geom_bar(stat="identity") + 
  labs(title = "Twitter: Top 10 2-grams",
       subtitle = "Filtered",
       x = "Feature",
       y = "Count") + 
  theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))

g4 <- ggplot(threegram, aes(x=reorder(feature, -frequency), y=frequency)) + 
  geom_bar(stat="identity") + 
  labs(title = "Twitter: Top 10 3-grams",
       subtitle = "Filtered",
       x = "Feature",
       y = "Count") + 
  theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))

ggarrange(g3, g4, nrow=1, ncol=2)

We can already see a degree of emergent structure becoming visible as we increase the size of our n-grams. This hints at the eventual structure of our prediction algorithm and is an encouraging sign for its accuracy.

As a point of interest, one might ask just how common the most common words in our language actually are. One way someone might phrase this question is to ask how many unique 1-grams we’d need to cover an arbitrary amount (say 50%?) of the text in our corpora. Luckily, we have all the information needed to evaluate this statement.

total_count <- sum(onegram$frequency)
tok %>%
  dfm(tolower=TRUE) %>%
  textstat_frequency() %>%
  mutate(cumprop = cumsum(frequency / total_count)) %>%
  filter(cumprop <= 0.5)

##   feature frequency rank docfreq group   cumprop
## 1     the    881815    1  401159   all 0.2056950
## 2      to    577274    2  343092   all 0.3403518
## 3     and    478869    3  271299   all 0.4520543

It turns out we need just 3. What about the top 90% of all observed words?

tok %>%
  dfm(tolower=TRUE) %>%
  textstat_frequency() %>%
  mutate(cumprop = cumsum(frequency / total_count)) %>%
  filter(cumprop <= 0.9)

##   feature frequency rank docfreq group   cumprop
## 1     the    881815    1  401159   all 0.2056950
## 2      to    577274    2  343092   all 0.3403518
## 3     and    478869    3  271299   all 0.4520543
## 4       a    471879    4  292380   all 0.5621263
## 5       i    450762    5  262758   all 0.6672725
## 6      of    388557    6  230022   all 0.7579085
## 7      in    307104    7  213689   all 0.8295446
## 8     you    254986    8  184409   all 0.8890234

Basic Prediction Algorithm

We’ve already observed that as we increase n from 1 to 3 in our n-gram analysis, a degree of structure emerges that hints at predictive capacity. For n > 1, this can be exploited by constructing an (n-1)-gram and then predicting the final component according to its frequency in the larger n-gram. This task will end up becoming the beating heart of our prediction algorithm, and is a (relatively) straightforward application of prediction via machine learning. As such, our next tasks will be to build training and test datasets, train an accurate model, and verify its accuracy.

Capstone Project: Milestone Report