This notebook is an account of my very first success in a prediction challenge. Yes, it is a moderate success. Yes, I know that with a \(68.4\%\) performance I won’t be calling myself a Kaggle Grandmaster in the short term. Still, you gotta start from somewhere. Most people do not excel at what they do from the get go. It is through hard work and steady baby steps that we shall emerge victorious.

One thing that permeates my coding style is my adherence to this quote attributed to legendary English statistician Ronald Fisher: the statistician cannot evade the responsibility for understanding the process he applies or recommends. For me, a function whose inner workings I do not grasp and whose output I cannot reproduce is a function that might as well be locked inside a treasure chest. If I wish to unlock it I have to embark on a quest in which my worth as a user of that function is going to be put to test. Naturally, that quest consists in reproducing it by my own means and through my own ingenuity.

I’m not the kind of person who would gleefully import a package and trust that it does exactly what I require it to do while I cross my fingers in uplifted, naive expectation. And now, staying true to my principles, I have implemented Burrows’s Delta, Argamon’s Quadratic Delta and Aldridge & Smith’s Cosine Delta methods for authorship attribution from scratch, and avoiding reliance on packages other than plain old tidyverse as much as possible.

The problem that has motivated this notebook is the Spooky Author Identification Competition. This is a text classification task. As part of the competition two data sets were provided —training and test—. The training data set consists of three columns: id —a code that identifies the respective document—, text —the document itself—, and author —a three-letter character string used to identify the author of the document—. The test data set lacks the author column and, to my knowledge, the actual identity of the author of each test document was never disclosed, which renders us unable to measure the accuracy of our predictions.

No worries, though. We can create our own test data by randomly extracting —while also trying (through stratified sampling) to maintain the proportion of rows by author— from the training set the \(30\%\) of its rows.

# Provided that you have already downloaded the train.csv file from
# https://www.kaggle.com/c/spooky-author-identification/data 
# and then loaded it as a data frame called excerpts onto your current R session. 
# Make sure characters are not turned into factors!

# Also, you need to have installed the caret and tidyverse packages.
suppressWarnings(library(tidyverse))

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Unless we execute the following line of code, whenever we use dplyr's summarize function on a 
# grouped tabular data set a message will be displayed, which can be frankly annoying.  
options(dplyr.summarise.inform = FALSE)

# The following line of code is not mandatory. I usually resort to the read.csv function to 
# import tabular data. The output of that function is a data frame. Now, I don't have anything 
# personal against data frames. Data frames are rad. It's just that what is displayed when we 
# print a data frame is not so nice. That's why I'm turning the excerpts data frame into a 
# tibble.
excerpts = as_tibble(excerpts)

# This is it. The following code splits through stratified sampling the excerpts into 
# 70% (corpus tibble) and 30% (test tibble).
set.seed(779)
indices = caret::createDataPartition(excerpts[["author"]], p = 0.7, list = F)

corpus = slice(excerpts, indices)
test = slice(excerpts, -indices)

Pardon me if I’m stating the obvious here, but there are palpable benefits to be enjoyed from the so-called divide and conquer strategy. From the point of view of execution, compartmentalizing a complex task into more manageable bits is a no-brainer. It requires us, though, to think in a systematic, methodical fashion. Getting used to that thought process demands training. From a standpoint of self-education, anyone who is serious about becoming a data scientist should, far from slipping away from such growth opportunities, actively seek tasks and projects which help develop a programmer’s mindset.

Reader, be warned: this notebook is an exercise in methodical thinking —the only admissible kind of thinking when writing code, actually—. Methodical approaches are not necessarily direct nor sequential. In order to make sure that all gears will fit in the desired manner we will have to engage both in foresighting and backtracking. I’m talking about stuff such as first declaring a general-purpose function that takes some other —rather specialized— function as input and only afterwards declaring those specialized functions that will be passed to the former one at a time —this is functional programming, a topic just slightly above beginner’s level—. I will also have to withhold a proper explanation on why we require to do certain steps right until the product of these steps becomes relevant in the grand scheme of things. At times it may seem like we are straying away from the main objective, so you will have to be patient.

These are the scheduled stops in today’s itinerary:

the lungs that breathe air into corpus-related quantitative inquiries: tokenizer functions
a brief detour into the general concepts that are part of everyday chatter in authorship attribution and stylometry
vital components of our system: the ability to create two-way tables and then compute statistics out of them
much-needed functional programming: declaring auxiliary functions
actual calculation of the Deltas (results-oriented readers in a hurry might want to skip right down to this section)

1 Tokenization

The first ingredient we need to collect in order to engage in almost any corpus-related quantitative analysis is a tokenizer function.

The process of tokenization consists in breaking up a text string into tokens. Regarding what entities are to be contemplated as tokens, we have freedom to decide. A text string can either be broken up into sentences, words, syllables or letters. Quite often the most useful tokens are words, and consequently we are going to break up all documents into the words that compose them. To that end the tidytext package makes available to us the unnest_tokens function.

Here is the second document in the corpus.

slice(corpus, 2)

## # A tibble: 1 x 3
##   id      text                                                            author
##   <chr>   <chr>                                                           <chr> 
## 1 id19322 I knew that you could not say to yourself 'stereotomy' without… EAP

And here is that very same document, but after having applied to it the unnest_tokens function.

# Even though its name suggests otherwise, the tidytext package does not belong to the 
# tidyverse. Therefore, you will have to install it before running the code of this chunk.
slice(corpus, 2) %>% tidytext::unnest_tokens(word, text)

## # A tibble: 88 x 3
##    id      author word      
##    <chr>   <chr>  <chr>     
##  1 id19322 EAP    i         
##  2 id19322 EAP    knew      
##  3 id19322 EAP    that      
##  4 id19322 EAP    you       
##  5 id19322 EAP    could     
##  6 id19322 EAP    not       
##  7 id19322 EAP    say       
##  8 id19322 EAP    to        
##  9 id19322 EAP    yourself  
## 10 id19322 EAP    stereotomy
## # … with 78 more rows

What the unnest_tokens function returns is a new tibble —or data frame; they are almost like synonyms, and I’m going to use these words almost indistinctively— where each row represents a word of the document¹ —that is, if the document is composed of \(150\) words then the new tibble is going to have \(150\) rows—. This type of representation of a document has been termed the tidy format by Julia Silge and David Robinson —creators of the tidytext package—.

Notice that the first argument being submitted to the unnest_tokens function is a tibble —slice(corpus, 2)—. The second argument is the name that is to be assigned to the column that will display the individual words in the new tibble, and the third argument is the name of the column of the original tibble which houses the texts that we wish to tokenize.

Now, what happens if rather than feeding the unnest_tokens function a single row of our corpus we feed it the whole corpus instead?

tidytext::unnest_tokens(corpus, word, text)

## # A tibble: 366,607 x 3
##    id      author word   
##    <chr>   <chr>  <chr>  
##  1 id27763 MWS    how    
##  2 id27763 MWS    lovely 
##  3 id27763 MWS    is     
##  4 id27763 MWS    spring 
##  5 id27763 MWS    as     
##  6 id27763 MWS    we     
##  7 id27763 MWS    looked 
##  8 id27763 MWS    from   
##  9 id27763 MWS    windsor
## 10 id27763 MWS    terrace
## # … with 366,597 more rows

We have generated a tibble consisting of \(366'607\) rows. This tibble is the corpus itself having been rendered into tidy format. Each of its rows stands for a word, not for a whole document anymore.

The ability to represent the corpus in tidy format enables us to break up otherwise complex tasks into a tractable series of steps that can be carried out through the use of simple dplyr functions. For instance, here’s an idea on how to obtain the \(d\) most frequent words:

obtain the absolute frequencies of the words across all documents in the corpus
then sort the words in descending order according to their frequencies —so that the most frequent ones are going to be on top—
then select only those rows that contain distinct words —so as to not choose the same word more than once in the next step—
finally, select the first \(d\) rows of this tibble

I cannot stress this enough: all of these steps are \(\boldsymbol{100\%}\) doable with nothing more than core dplyr functions. Neither rocket science, black boxes, nor imported extraterrestrial technology here. It’s just fair play done ingeniously. And, in fact, this is exactly what we are going to do right at the beginning of Section 3. You can go there, briefly check how we create a data frame called most_frequent, and then come back here.

2 Concepts, terminology, and setting

Patrick Juola —a renowned world-class stylometrician— is the author of a lengthy article titled Authorship Attribution. It is a review of the history and state of the art —up to the year \(2008\), when it was published— of the discipline of stylometry.

Juola defines authorship attribution as any attempt to infer the characteristics of the creator of a piece of linguistic data [1]. The word stylometry is often used as a synonym for authorship attribution despite some researchers’ suggestion to reserve that word to denote the efforts made in a broader set of inquiries.

Juola identifies three main types of problems that are of concern to this discipline:

given a particular sample of text known to be by one of a set of authors, determine which one
given a particular sample of text believed to be by one of a set of authors, determine which one —if any—
(the aforementioned broader set of inquiries) determine not (only) the identity label of the author but rather aspects of their profile such as their gender, the age they had when they wrote the document (or rather at which point in their career did they write it), whether the language in which the document was written is their native language, whether they suffer from a mental illness, etc.

What gathers us today is the easiest among these types of problems —the first one—. Our set of candidate authors comprises only Edgar Allan Poe, Howard Phillips Lovecraft and Mary Wollstonecraft Shelley —neé Godwin—. Pick a document at random from either of our data sets —be it the corpus or the test data set—. We know for sure that one of our candidate writers is the creator of that document. The problem stems from having to point the finger at the one who did actually write it.

That would be all in regards to terminology and setting. Now, what pertains to the conceptual aspect is the definition of word vector. While doing the required bibliographic survey for the elaboration of this notebook I came across that term a few times, but unfortunately not once was it explicitly defined. What ensues is my best attempt at defining and then explaining it. Admittedly, I may have incurred in the use of an unorthodox lexicon, but I believe I have managed to make myself be understood.

The word vector is a tuple associated to a document or an author. Each of its entries is some sort of measurement —tipically the relative frequency— of a word in the given document (or in the average document of the given author). It is up to us to decide the dimensionality of the word vectors —that is, the hyperparameter \(d\); the amount of entries they will have— and the meaning of each one of their entries —say, the first entry could be associated to the word ‘birthday’, and therefore the measurement featured in the first entry of each word vector would come to mean the relative frequency of the word ‘birthday’ in the corresponding document that the word vector is associated with—.

Hopefully the following example will make this clear. Let us choose as our words of interest the words ‘the’, ‘and’, ‘of’, ‘him’, in that order —mind you: making all word vectors stick to the same ordering arrangement of the words of interest is important—. Let us contemplate three documents —id22354, id16607 and id19936—. There’s nothing special about them. They’re just three documents —all of them belonging to the corpus— that we’re focusing on for illustrative purposes.

Given our set of words of interest, the word vectors of the three documents we are contemplating are these²:

## $id16607
## # A tibble: 4 x 2
##   word   freq
##   <chr> <dbl>
## 1 the     0.1
## 2 and     0.1
## 3 of      0  
## 4 him     0  
## 
## $id19936
## # A tibble: 4 x 2
##   word    freq
##   <chr>  <dbl>
## 1 the   0.0625
## 2 and   0.0625
## 3 of    0.0625
## 4 him   0     
## 
## $id22354
## # A tibble: 4 x 2
##   word    freq
##   <chr>  <dbl>
## 1 the   0.0294
## 2 and   0.0294
## 3 of    0.0588
## 4 him   0

Regarding document id16607, what this tells us is that the relative frequency of the word ‘the’ in that document is \(0.1\). The relative frequency of the word ‘and’ in that same document is \(0.1\). The words ‘of’ and ‘him’ exhibit a relative frequency of \(0\) in that document —implying that both of these words are absent from document id16607—.

I don’t fancy the idea of just giving out numbers without providing the source they were calculated from, so here is the actual content of document id16607:

filter(corpus, id == "id16607")$text

## [1] "Here we barricaded ourselves, and, for the present were secure."

This document consists of \(10\) words. The word ‘the’ appears once. That is why its relative frequency in this document turns out to be \(0.1\). The same applies to the word ‘and’. On the other hand, the words ‘of’ and ‘him’ do not appear in this document —hence, their relative frequencies in this document certainly are \(0\)—.

Quick question: what is the relative frequency of the word ‘of’ in document id22354? We just read the value displayed on the entry corresponding to that word on the word vector associated to that document. That value is \(\boldsymbol{0.0588}\).

# Let's see document id22354 for ourselves. 
filter(corpus, id == "id22354")$text

## [1] "Should I yield to your entreaties and, I may add, to the pleadings of my own bosom would I not be entitled to demand of you a very a very little boon in return?\""

Document id22354 consists —as can be seen above— of \(34\) words, and the word ‘of’ appears twice in it, confirming that the relative frequency of that word in that document indeed is \(\boldsymbol{0.0588}\).

A vital, already mentioned but not yet sufficiently emphasized aspect of the concept of the word vector is that setting the entries of all word vectors to follow a given ordering arrangement is important.

Earlier, I have referred to our words of interest as components of a set. From a strict perspective, though, the words of interest conform an ordered tuple. That is, changing the order of the elements of the tuple of words of interest signifies contemplating a different words of interest tuple altogether. Choosing one particular ordered tuple among all the possible permutations of the \(d\) —with \(d = 4\) in this example— words of interest is, borrowing terminology from linear algebra, akin to choosing a particular ordered basis of the \(d\)-dimensional space where the word vectors dwell. Once an ordered basis is decided upon, the coordinates of all the word vectors must be expressed with respect to that ordered basis.

The ordered tuple of words of interest that was considered for this example is \((the, and, of, him)\). Consequently, within each word vector the entries are arranged so that the relative frequency displayed on the first entry is that of the word ‘the’, the relative frequency displayed on the second one is that of the word ‘and’, the relative frequency displayed on the third one is that of the word ‘of’, and the relative frequency displayed on the fourth entry is the one corresponding to the word ‘him’.

The choice of the ordered tuple \((the, and, of, him)\) was arbitrary in the sense that it did not obey any sorting rule. The most natural rule for the choice of the words of interest ordered tuple is to stick to alphabetical order. Therefore, hereafter, given a set of \(d\) words of interest, the first coordinate of the word vectors will always represent the relative frequency of the word of interest that comes first in alphabetical order. The second coordinate will always represent the relative frequency of the word of interest that comes second in alphabetical order, and so on.

3 Two-way tables and matrix operations

Continuing from where Section 1 ended, we are now going to obtain the \(500\) most frequent words across the whole corpus.

most_frequent = tidytext::unnest_tokens(corpus, word, text) %>% add_count(word, sort = T) %>%
                distinct(word) %>% slice(1:500)

These are going to be our words of interest throughout the remainder of the notebook³.

3.1 Creating two-way tables

Recall the definition of word vector that was given in Section 2 —if you came from the link left at the end of Section 1 and skipped Section 2 altogether, now would be a good idea to return—. A word vector can either be associated to a document or to an author. In case we are dealing with the word vector associated to a given author, then what its components are going to display are the relative frequencies of the words of interest in that author’s typical document.

The typical document is, of course, an abstraction. We don’t know —nor do we care about— what its actual content is. Regarding an author’s typical —or average— document, what we do care about is its characterization in terms of the word vector that would presumably be associated to it. The \(\boldsymbol{j}\)-th component of the word vector of an author’s typical document is the mean of the relative frequency of the \(\boldsymbol{j}\)-th word of interest, calculated across all the documents belonging to the author’s subcorpus —excluding those documents that do not contain any of our words of interest—.

A prerequisite for obtaining the word vectors associated to each of our three authors is, in light of the ongoing discussion, to put the corpus in tidy format —i.e. tokenize it— and then split it by author.

subcorpora = corpus %>% group_by(author)

subcorpora = subcorpora %>% group_map(~ tidytext::unnest_tokens(.x, word, text), .keep = T) %>%
             set_names(subcorpora %>% group_keys() %>% unlist())

This new object we just created —subcorpora— is a list. It contains three separate elements, each of them being the tokenized subcorpus belonging to one distinct author.

Given an author’s subcorpus and given our alphabetically-ordered tuple of words of interest, we can —again, disregarding the documents that do not contain any of these words— create a two-way table where the rows represent our words of interest and the columns represent the documents belonging to that subcorpus. That table’s entry on row \(j\), column \(i\) will be the relative frequency of the \(j\)-th word of interest in the subcorpus’s \(i\)-th document. The purpose of the following function is the creation of such a table.

# Argument x is the tokenized corpus (or subcorpus). Argument y is a tibble which displays the 
# words of interest on one of its columns (a separate word per each row).  
two_way = function(x, y) {
  documents = (x %>% filter(word %in% y[["word"]]))$id %>% unique()

  lenghts = (x %>% group_by(id) %>% summarize(len = n()) %>% 
            ungroup() %>% filter(id %in% documents))$len  
  
  x %>% filter(word %in% y[["word"]]) %>%
  with(table(word, id)) %>% as.data.frame.matrix() %>% sweep(2, lenghts, "/") %>% 
  rownames_to_column("word")
}

We will apply this function to all three elements of the subcorpora list, one at a time.

sample_EAP = two_way(subcorpora[["EAP"]], most_frequent)
sample_HPL = two_way(subcorpora[["HPL"]], most_frequent)
sample_MWS = two_way(subcorpora[["MWS"]], most_frequent)

Let me illustrate what are the three objects we just created in the chunk above. As an example, we will focus on sample_EAP.

# Just so it gets printed nicely, I'm momentaneously turning it into a tibble.
sample_EAP %>% as_tibble() %>% print(n = 15, n_extra = 0)

## # A tibble: 496 x 5,522
##    word  id00003 id00006 id00007 id00012 id00021 id00027 id00030 id00032 id00034
##    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 a      0            0       0       0  0            0       0       0  0.0323
##  2 about  0            0       0       0  0            0       0       0  0.0323
##  3 above  0.0556       0       0       0  0            0       0       0  0     
##  4 acco…  0            0       0       0  0            0       0       0  0     
##  5 after  0            0       0       0  0            0       0       0  0     
##  6 again  0            0       0       0  0            0       0       0  0     
##  7 agai…  0            0       0       0  0            0       0       0  0     
##  8 age    0            0       0       0  0            0       0       0  0     
##  9 air    0            0       0       0  0            0       0       0  0     
## 10 all    0.0556       0       0       0  0.0312       0       0       0  0.0645
## 11 almo…  0            0       0       0  0            0       0       0  0     
## 12 alone  0            0       0       0  0            0       0       0  0     
## 13 along  0            0       0       0  0            0       0       0  0     
## 14 alre…  0            0       0       0  0            0       0       0  0     
## 15 also   0            0       0       0  0            0       0       0  0     
## # … with 481 more rows, and 5,512 more variables

Just like I had told you, this new object is a two-way table where each row represents a word and each column —except for the first one, which displays the words of interest themselves— represents a document. The entry where a certain row and a certain column coincide displays the relative frequency of the corresponding word in the corresponding document. A more useful way of thinking about this object is as the word vectors of Poe’s documents, all of them bound together column-wise into a single data frame. Yet another —statistics-savvy— equivalent way to think about this object is as a sample where the relative frequencies of the words of interest are random variables and each document constitutes an observation belonging to the sample. Hence the name —sample_EAP—.

Ok. So, the word vector is a multivariate random variable —or random vector—. We have a sample of this random vector. And whenever we have a sample at hand we should be able to compute statistics. In particular, we must compute the sample mean. Doing so would be the same as obtaining the author’s word vector. Let’s get on to it, then.

3.2 Calculating the sample mean and the vector of sample standard deviations through matrix multiplication

Let us denote by \(X_i\) the \(i\)-th observation of the word vector sample. The vectors \(X_1, X_2, \ldots, X_n\) are going to be bound together column-wise into a single (mathematical)⁴ matrix which we will denote by \(\mathbb{X}\).

\[\begin{equation} \mathbb{X} = \begin{bmatrix} X_1^{(1)} & X_2^{(1)} & \ldots & X_n^{(1)} \\ X_1^{(2)} & X_2^{(2)} & \ldots & X_n^{(2)} \\ \vdots & \vdots & & \vdots \\ X_1^{(d)} & X_2^{(d)} & \ldots & X_n^{(d)} \end{bmatrix} \end{equation}\]

Here, the superindex \((j)\) is used to denote the \(j\)-th coordinate.

Consider \(\normalsize \mathbb{1} \small = \begin{pmatrix}1& \ldots & 1\end{pmatrix}_{1 \times n }^T\) —the \(n \times 1\) matrix whose elements are the number \(1\) repeated \(n\) times—.

It is immediate that the following equation holds:

\[\begin{equation} \frac{1}{n}\mathbb{X{\normalsize 1}} = \begin{pmatrix} \large \frac{\sum_{i = 1}^nX_i^{(1)}}{n} \\ \large \frac{\sum_{i = 1}^nX_i^{(2)}}{n} \\ \large \vdots \\ \large \frac{\sum_{i = 1}^nX_i^{(d)}}{n} \end{pmatrix} \tag{3.1} \end{equation}\]

With that we have the affair of obtaining the word vector’s sample mean already covered.

In order to implement the Delta methods for authorship attribution we will also need to calculate the vector of sample standard deviations.

Let us denote \(\large \frac{\sum_{i = 1}^nX_i^{(j)}}{n}\) —i.e. the sample mean of the relative frequency of the \(j\)-th word of interest— by \(\overline{X^{(j)}}\). Let \(\text{I}\) denote —as is customary— the \(n \times n\) identity matrix.

Admittedly, the following equation might not be that evident at first glance, but I assure you it does hold⁵:

\[\begin{equation} \mathbb{X}\bigg({\normalsize \text{I}} - \frac{1}{n} {\normalsize \mathbb{11}}^T\bigg) = \begin{bmatrix} X_1^{(1)} - \overline{X^{(1)}} & X_2^{(1)} - \overline{X^{(1)}} & \ldots & X_n^{(1)} - \overline{X^{(1)}} \\ X_1^{(2)} - \overline{X^{(2)}} & X_2^{(2)} - \overline{X^{(2)}} & \ldots & X_n^{(2)} - \overline{X^{(2)}} \\ \vdots & \vdots & & \vdots \\ X_1^{(d)} - \overline{X^{(d)}} & X_2^{(d)} - \overline{X^{(d)}} & \ldots & X_n^{(d)} - \overline{X^{(d)}} \end{bmatrix} \end{equation}\]

So as to save ourselves the hassle of writing it over and over, let us denote the matrix \(\hspace{0.5mm} \Big(\text{I} - {\normalsize \frac{1}{n} \mathbb{11}}^T \Big) \hspace{0.5mm}\) by \(\text{H}\).

Up next, we have to raise all elements of our matrix \(\mathbb{X}\text{H}\) to the power of \(2\). In R this is trivial, as raising an entire matrix to the power of \(2\) is equivalent to raising each of its elements to the power of \(2\). However, keep in mind that this is not true in Mathematics. The certainly analogous but also —unlike the former— mathematically sound procedure would be to compute the Hadamard product —entry-wise multiplication, denoted by the symbol \({\large \circ}\)— of \(\mathbb{X}\text{H}\) with itself.

In reconciling the language of Mathematics with the R language let us define \(\hspace{0.5mm} (\mathbb{X}\text{H})^2 \hspace{1mm} {\normalsize \triangleq} \hspace{1.4mm} (\mathbb{X}\text{H}) \hspace{0.5mm} {\large \circ} \hspace{0.5mm} (\mathbb{X}\text{H})\). Thereupon, the following equality is immediate:

\[\begin{equation} \frac{1}{n-1}(\mathbb{X}\text{H})^2{\normalsize \mathbb{1}} = \begin{bmatrix} \large \frac{\sum_{i = 1}^n\big(X_i^{(1)} - \overline{X^{(1)}}\big)^2}{n-1} \\ \large \frac{\sum_{i = 1}^n\big(X_i^{(2)} - \overline{X^{(2)}}\big)^2}{n-1} \\ \large \vdots \\ \large \frac{\sum_{i = 1}^n\big(X_i^{(d)} - \overline{X^{(d)}}\big)^2}{n-1} \end{bmatrix} \tag{3.2} \end{equation}\]

The \(d \times 1\) matrix on the right-hand side of the equation above is composed of sample variances, not of sample standard deviations. This puts us only one step behind from calculating the standard deviations: we need to take entry-wise square roots. To my knowledge, in Mathematics there is no agreed-upon notation for representing that operation —nor is it formally defined, either—. R does perform entry-wise calculation of square roots, but since the language of Mathematics doesn’t support that operation we are not able to represent it in an equation. We will carry that operation out anyway, it’s just that we can’t write it down.

The following function incorporates Equations (3.1) and (3.2) in the computation of the sample mean and the vector of sample standard deviations.

statistics = function(x, stat, name = NULL) {
# The number of observations in the sample (n) is equal to the number of columns minus one 
# because the first column displays the words of interest, and therefore does not count.  
  n = ncol(x) - 1
  ones = rep(1, n)
  dim(ones) = c(n, 1)
  
  if (stat == "mean") {
    output = (x %>% select(-word) %>% data.matrix() %*% ones)/n
  }
  else if (stat == "st_dev") {
    H = diag(n) - (1/n)*(ones %*% t(ones))
    output = (((x %>% select(-word) %>% data.matrix() %*% H)^2 %*% ones)/(n-1)) %>% sqrt()
  }
  else {
    stop('Only two stats are supported: mean (string "mean") and standard deviation (string "st_dev")')
  }
  
  output = output %>% data.frame() %>% bind_cols(x["word"], .)
    
  if (is.character(name) & length(name) == 1) {
    colnames(output)[2] = name
  }  
    
  return(output) 
}

3.3 Application: each author’s typical document

Our brand-new statistics function, in conjunction with the two-way tables we obtained thanks to our two_way function, enables us to calculate the vector of average relative frequencies of the words of interest for each author.

averages_EAP = statistics(sample_EAP, stat = "mean", name = "EAP")
averages_HPL = statistics(sample_HPL, stat = "mean", name = "HPL")
averages_MWS = statistics(sample_MWS, stat = "mean", name = "MWS")

Here are the relative frequencies we would expect the words of interest to have in a typical document of Poe’s authorship:

averages_EAP %>% as_tibble() %>% print(n = 10)

## # A tibble: 496 x 2
##    word         EAP
##    <chr>      <dbl>
##  1 a       0.0224  
##  2 about   0.00191 
##  3 above   0.000380
##  4 account 0.000271
##  5 after   0.00115 
##  6 again   0.00102 
##  7 against 0.000292
##  8 age     0.000185
##  9 air     0.000479
## 10 all     0.00466 
## # … with 486 more rows

It would be very convenient to have all three word vectors —associated to the respective typical documents— bound together into a single table.

# The tidyverse comprises the purrr package, which is where we borrow the reduce function from. 
# Therefore, we are not breaking the (self-imposed) rule of limiting ourselves to the tools 
# provided by the tidyverse. 
main = list(averages_EAP, averages_HPL, averages_MWS) %>% reduce(full_join, by = "word") 

# Some of the 500 most frequent words are used either only by one or only by two out of the 
# three authors. Therefore, missing values are going to show up. We have to turn them into 
# zeros.
main[is.na(main)] = 0

# Lastly, as was discussed in Section 2, we have to arrange the rows in alphabetical order.
main = main %>% arrange(word)

By observing that we have assigned the name main to this data frame you could expect it to have a central role in the calculations that we will eventually have to do —and you would be absolutely right—. In order to make yourself an idea of how this data frame looks, here are its first six rows:

head(main)

##      word          EAP          HPL          MWS
## 1       a 0.0223503998 0.0205136970 0.0160924976
## 2   about 0.0019122759 0.0016246241 0.0007646336
## 3   above 0.0003804170 0.0004818120 0.0002463547
## 4 account 0.0002709037 0.0001754568 0.0003061284
## 5  adrian 0.0000000000 0.0000000000 0.0011444948
## 6   after 0.0011487072 0.0016391758 0.0011558627

Having already obtained this data frame, we have no further use for the two-way tables sample_EAP, sample_HPL, and sample_MWS. The three of them account for \(56.16\) megabytes in total, so let’s just eliminate them and free some space up.

rm(sample_EAP, sample_HPL, sample_MWS)

Now that we’re at it, we will not have much use for the subcorpora list, either, and it takes up \(11.7\) megabytes, so let’s remove it as well.

rm(subcorpora)

4 Auxiliary functions

Remember that warning I made in the introduction, concerning the necessity of engaging in foresighting and backtracking? What I had in mind when I wrote that was essentially this section. We haven’t even discussed what exactly are Burrows’s Delta and its variants, but right now we will assume ourselves on the situation that we have already computed them. What comes after that?

Given a certain methodology —either Burrows’s, Argamon’s or Aldridge & Smith’s— and a single document, three Delta measurements are going to be calculated —one per each author—. Upon observing these measurements we are going to decide who is the author of the document.

The underlying idea is that the word vector associated to a document is that document’s representation within the \(d\)-dimensional word vector space. Both Burrows’s and Argamon’s Deltas are distance measures. Thus, when doing either of these methodologies, a larger Delta value signifies a larger distance between the document and the author’s typical document. Conversely, a smaller Delta value signifies closeness between the document and the author’s typical document. Therefore, we are going to attribute the authorship of the document to the writer who minimizes the value of Delta.

Here’s something to watch out for: what happens in the case of an outlier document in which neither of the \(\boldsymbol{500}\) most frequent words appear? There are instances of that happening. Check the following, for example:

filter(test, id == "id23846")

## # A tibble: 1 x 3
##   id      text                     author
##   <chr>   <chr>                    <chr> 
## 1 id23846 L'histoire en est brève. EAP

Since the three writers under study wrote predominantly in English, the most frequent words across the entire corpus must obviously be English words. Therefore, a document written entirely in another language —just like the one above, written in French— is sure to constitute an outlier.

Another kind of outliers are straight up gibberish, just like this one:

filter(test, id == "id02209")

## # A tibble: 1 x 3
##   id      text                                       author
##   <chr>   <chr>                                      <chr> 
## 1 id02209 "\"Eh ya ya ya yahaah e'yayayayaaaa . . ." HPL

What all outliers have in common is that their coordinates are going to lie in the exact same spot: the origin of the word vector space. Hence, all outliers —irrespective of what makes them outliers (whether they are written in another language or whether they are jibber-jabber)— are going to be attributed to the author whose typical document’s representation on the word vector space is the least distant from the origin. This is unfortunate for a number of reasons. I will provide two. First: all three of the authors under study —not just that one whose typical document’s coordinates are the closest to the origin— are as capable as any other person of writing gibberish. Secondly: the language in which a document is written does give away clues on who the writer is likely to be. By inspecting the corpus a little, I have realized that it isn’t that unusual to come across bits written in French by Poe. I wouldn’t feel confident in asserting that Shelley and Lovecraft do not ever include bits of French here and there, but at the very least it is true that they are not as eager as Poe to write in French. Certainly, the word vector representation doesn’t allow us to discern between different types of outliers, and that is hands-down an inherent weakness of the methodologies exposed in this notebook. It is because of these reasons that I judge reasonable to leave outlier documents unattributed.

Another eventuality in which I would rather leave the corresponding document unattributed is when all three authors are tied as the minimizer of Delta. In such a case, the attribution decision we would arrive at through the Delta methodologies is as good as making a pick completely at random. Had we settled from the start for making uninformed, criterion-less picks this notebook wouldn’t have been elaborated at all. Better, then, for our algorithm to just return NAs in these cases.

And what about those other cases where two out of the three authors are tied? Well, it is better to pick one author at random from the two most likely suspects than from the pool of all three authors. Therefore, in those instances we are going to allow our attribution algorithm to choose either of the two most likely authors.

Aldridge & Smith’s version of Delta is not a distance measure. Rather, it is a measure of similarity —cosine similarity, to be precise—. In practical terms, what this translates into is that when Aldridge & Smith’s methodology is the one being used the attribution criterion is reversed: no longer is the author who minimizes the value of Delta the author the text should be attributed to. On the contrary, the chosen author will be that who maximizes Delta. Aside from that, our criteria for when to leave a document unattributed stay the same —when the document is an outlier or when there is a tie between all three authors as the (in this case) maximizer of Delta—.

The first of our auxiliary functions is one that takes as input a row of a not-yet tokenized tibble —namely, our test tibble— and returns the coordinates that represent it in the word vector space.

word_vector = function(x) {
  dimensions = main %>% select(word)

  output = x %>% tidytext::unnest_tokens(word, text) %>%
           add_count(word, name = "freq") %>% distinct(word, .keep_all = T) %>%
           mutate(freq = freq/sum(freq)) %>% 
           right_join(dimensions, by = "word") %>% 
           arrange(word) %>%
           select(freq)
  
  output[is.na(output)] = 0
  return(output)         
}

The second is one that implements all the ideas we discussed in this section —when to look for the author that minimizes Delta, when to look for the one that maximizes it, when to leave a document unattributed—.

algorithm = function(x, FUN) {
  coordinates = word_vector(x) 

  if (any(coordinates != 0)) {
    
      measures = vector(length = 3)
      names(measures) = c("EAP", "HPL", "MWS")
    
      for (author in names(measures)) {
      measures[author] = FUN(x, author)
      }
    
      if (!identical(FUN, Delta_Cos)) {
          least_distant = measures[measures == min(measures)]
          
          if (length(least_distant) == 1) {
            return(least_distant %>% names())
          }
          else if (length(least_distant) == 2) {
            paste("Draw among two authors encountered in document", x$id) %>% print()
            return(least_distant %>% names() %>% sample(1))
          }
          else {
            paste("Failed to attribute document", x$id, "due to draw among three authors.") %>% 
            print()
            return(NA)
          }
      }
      else {
          most_similar = measures[measures == max(measures)]
          
          if (length(most_similar) == 1) {
            return(most_similar %>% names())
          }
          else if (length(most_similar) == 2) {
            paste("Draw among two authors encountered in document", x$id) %>% print()
            return(most_similar %>% names() %>% sample(1))
          }
          else {
            paste("Failed to attribute document", x$id, "due to draw among three authors.") %>% 
            print()
            return(NA)
          }
      }
  }
  else {
      paste("Failed to attribute document", x$id, "due to it being an outlier.") %>% print()
      return(NA)
  }
}

The function above implements our algorithm for a single row. This other function builds upon the previous one and essentially vectorizes it —i.e. applies it to all the rows of the test tibble—. This output is then appended to the test tibble as a new column.

prediction = function(FUN = Delta_B) {
  output = vector(length = nrow(test))
  
  if(identical(FUN, Delta_B) & !("Burrows" %in% colnames(test))) {
      output = output %>% data.frame("Burrows" = .)
  }
  else if(identical(FUN, Delta_Q) & !("Argamon" %in% colnames(test))) {
      output = output %>% data.frame("Argamon" = .)
  }
  else if(identical(FUN, Delta_Cos) & !("Cosine" %in% colnames(test))) {
      output = output %>% data.frame("Cosine" = .)
  }
  else{
      stop("Invalid argument.")
  }
  
  for(i in 1:nrow(output)) {
      output[i,] = slice(test, i) %>% algorithm(FUN) 
  }

  test <<- bind_cols(test, output)
}

Also, when calculating Aldridge & Smith’s Delta we are going to need to perform inner products —also known as dot products— and norms. The following functions implement these operations.

dot = function(x, y) {
  sum(x*y)
}

norm = function(x) {
  sum(x^2) %>% sqrt()
}

5 The Delta methods

Our custom two_way function is akin to the pumping heart of all this procedure. The blood being pumped by it is its output: a sample of each author’s word vector. In the preparation of this output also intervenes tidytext’s unnest_tokens function —the lungs—, providing rich oxygen —the subcorpora list and the most_frequent tibble; arguments we feed to the two_way function—. This blood is then depurated by our very own statistics function —the kidneys of the system—, which crafts each author’s typical word vector and renders the samples —which by this point have become residual waste— ready for disposal. The bone and brawn are the auxiliary functions that were declared in the previous section. And now behold the skin that sits on top of it all: the Deltas (please, just suspend disbelief and pretend that our system can endure the lack of digestive organs and all other sorts of bodily machinery).

What I want to convey with this analogy is that, just like the skin conceals beneath it an intricate apparatus, the functions in this section are backed by concealed gears that we had to design thoroughly. This section deals with what the main concern of this notebook is —the computation of the Delta measurements for authorship attribution—, and it turns out that once we have secured all the underlying machinery this task becomes quite simple.

Perhaps this analogy has given you a broader sense of all the previous steps and now you may want to go back and take a stroll down the —hopefully— now clearer path that was traced. If that’s the case then here’s your ticket back to the first stop in our itinerary.

5.1 Burrows’s Delta

Australian Emeritus Professor John Frederick Burrows became in \(2001\) the second recipient of the Roberto Busa Award —an award issued by the Alliance of Digital Humanities Organisations in recognition of outstanding lifetime achievements in the application of information and communications technologies to humanistic research—. As part of the award ceremony, it is customary for the laureate to offer a lecture to the attendees. It was in Burrows’s Busa Award lecture that he first proposed his Delta measure.

As per Patrick Juola: performance of Burrows’s Delta has generally been considered to be very good among attribution specialists, and it has in many cases come to represent the baseline against which new methods are compared [1]. To be fair, Juola wrote this around \(2008\). I bear no doubt that nowadays in the niche of competitive authorship attribution it is not that big of a deal to outperform Burrows’s Delta —moreover, as early as \(2004\) there already existed more accurate techniques—, but it nonetheless deserves a special pedestal because no previous technique that I’m aware of exhibits consistently an accuracy better than \(50\%\). Attribution based on Burrows’s Delta is —again, to my knowledge— the first attribution technique ever created that you can actually expect to work any better than “eeny, meeny, miny, moe” —and if for some reason it doesn’t then you can always resort to its buffed variants—.

Without further ado, let \(\text{D}\) be some document. Let \(\mathcal{D}_{\text{Author}}\) be a given author’s typical document. Let \(f_j\) be a (mathematical) function that takes a document as input and returns the relative frequency of the \(j\)-th word of interest in the input document. Let \(\sigma_j\) be the standard deviation of the relative frequency of the \(j\)-th word of interest computed out of a sample consisting of all non-outlier documents in the whole corpus —that is, without segregating by author—.

Burrows’s Delta measure for a given pair \((\text{D}, \mathcal{D}_{\text{Author}})\) is, then, defined as follows:

\[\begin{equation} \large \Delta(\text{D}, \mathcal{D}_{\text{Author}}) = \sum_{j = 1}^d \Bigg \lvert \frac{f_j(D) - f_j(\mathcal{D}_{\text{Author}})}{\sigma_j} \Bigg \rvert \tag{5.1} \end{equation}\]

Before jumping into the implementation aspect, allow me to state an interpretation of what Burrows’s Delta stands for. Following our discussion in Sections 2 through 4, \(d\) can be thought of as the number of dimensions of the word vector space. The output of the \(f_j\) function we defined earlier can be thought of as the \(j\)-th component of the word vector representation . Notice that if \(\boldsymbol{\sigma_j = 1}\) for \(\boldsymbol{j = 1, 2, \ldots, d}\) then Burrows’s Delta becomes the Manhattan distance between the coordinates of \(\boldsymbol{\text{D}}\) and \(\boldsymbol{\mathcal{D}_{\text{Author}}}\). In most cases, though, the standard deviations will not be all equal to \(1\), and therefore Burrows’s Delta will not be exactly equal to the Manhattan distance between these coordinates. What is the rationale, then, for dividing by \(\sigma_j\)? If the standard deviation of a given word of interest is big —that is, if the relative frequency of that word varies a lot across documents— then, even though there might be a considerable difference between the relative frequency that word exhibits in the document \(\boldsymbol{\text{D}}\) and the relative frequency that word is expected to exhibit in the average document \(\boldsymbol{\mathcal{D}_{\text{Author}}}\), that difference will not contribute too much to the magnitude of the Manhattan distance between the corresponding points in the word vector space.

Burrows’s Delta is, in succinct words, a sort of weighted Manhattan distance, and the attribution algorithm based on it is an axis-weighted form of ‘nearest neighbor’ classification [2]. Again, as we said in Section 4, the author that minimizes this weighted Manhattan distance for a given document \(\text{D}\) is the author we are going to attribute \(\text{D}\) to.

Regarding implementation, the first thing in our to do list is the obtention of the standard deviations of the relative frequencies of our \(500\) words of interest. That’s what the stat = "st_dev" argument of our statistics function is good for. We will use it to add to our main data frame the column of standard deviations —which we are going to assign the befitting name of sigma—.

# Execution of this chunk will take some minutes.
main = tidytext::unnest_tokens(corpus, word, text) %>% two_way(most_frequent) %>% 
       statistics(stat = "st_dev", name = "sigma") %>% full_join(main, ., by = "word")

So far so good. Now we just have to declare a function that implements Equation (5.1). This is very straightforward stuff.

Delta_B = function(x, author) {
  ((word_vector(x) - main[author])/main["sigma"]) %>% abs() %>% sum()
}

5.2 Argamon’s Quadratic Delta

Shlomo Argamon —Professor of Computer Science at the Illinois Institute of Technology— is another heavyweight in authorship attribution and stylometry. In fact, to him we owe the current theoretical understanding surrounding Burrows’s Delta. For a handful of years the reason why Burrows’s Delta works was a complete mistery. John Burrows himself did not think of Delta as a weighted Manhattan distance within the space where the word vectors dwell. It was not until late \(2007\), when Argamon released an article titled Interpreting Burrows’s Delta: Geometric and Probabilistic Foundations that not one but two formal interpretations of Burrows’s Delta were offered —though we are focusing only on the geometric one—.

Having realized that Burrows’s Delta is a weighted Manhattan distance, it is only natural to ask oneself why don’t we calculate a weighted euclidean distance instead. That is exactly what Argamon’s Quadratic Delta is about.

\[\begin{equation} \large \Delta_{\text{Q}}(\text{D}, \mathcal{D}_{\text{Author}}) = \sum_{j = 1}^d \Bigg( \frac{f_j(D) - f_j(\mathcal{D}_{\text{Author}})}{\sigma_j} \Bigg)^2 \tag{5.2} \end{equation}\]

I figure some readers may have gotten upset by this point due to the expression on the right-hand side not being surrounded by a radical symbol. Certainly, the definition of the euclidean distance among two points in space implies that we should take the square root of that expression, but the thing is that for a given pair \((\text{D}, \mathcal{D}_{\text{Author}})\) we don’t very much care about the precise number that is returned. What we do care about is the order of that number when compared against the numbers corresponding to the two other candidate authors. The square root is a monotonic transformation —that is, a function that preserves ordering—. Therefore, taking the square root doesn’t contribute to our attribution algorithm, really.

Implementation is easy: the formula is exactly the same as Burrows’s Delta’s except that in lieu of taking the absolute value we are going to elevate to the power of \(2\).

Delta_Q = function(x, author) {
  ((word_vector(x) - main[author])/main["sigma"])^2 %>% sum()
}

5.3 Aldridge and Smith’s Cosine Delta

Aldridge and Smith released in \(2011\) an article titled Improving Authorship Attribution: Optimizing Burrows’s Delta Method. The title is pretty much self-explanatory: what Aldridge and Smith offer are insights on what can be done in order to increase the accuracy of Burrows’s Delta. Namely, in the abstract they promise to demonstrate a dramatic improvement in accuracy by adapting Burrows’s Delta to the cosine similarity measure [3]. At the end, they report a clear improvement in authorship attribution using cosine similarity rather than Manhattan distance, with the difference in accuracy being significant at the \(5\%\) significance level. They claim their findings suggest quite strongly that Burrows’s Delta should be adopted [sic] to use the cosine similarity measure [3]. Unfortunately, though, they do not explicitly show how to do that adaptation. No explicit formula is disclosed.

A formula is exposed in another article —released in \(2017\)—: Understanding and explaining Delta measures for authorship attribution, by Stefan Evert and several other authors —a collaboration among researchers from two German universities—. What ensues is consistent with their treatment of the method —we only diverge a little bit in terms of notation—.

The vector of z-scores of the word vector associated to an arbitrary document \(\text{D}\) will be denoted by \(z(\text{D})\) and is going to be defined as follows:

\[\begin{equation} z\big(\text{D}\big) = \bigg(\frac{f_1(\text{D}) - \mu_1}{\sigma_1}, \frac{f_2(\text{D}) - \mu_2}{\sigma_2}, \ldots, \frac{f_d(\text{D}) - \mu_d}{\sigma_d}\bigg) \end{equation}\]

In the equation above, \(\mu_j\) stands for the mean of the relative frequency of the \(j\)-th word of interest computed —just like \(\sigma_j\)— out of a sample consisting of all non-outlier documents in the whole corpus; without segregating by author.

In a similar manner to how we dealt with the standard deviations, we are now going to calculate these means and attach them to our main data frame as a new column.

# Execution of this chunk will take some minutes as well.
main = tidytext::unnest_tokens(corpus, word, text) %>% two_way(most_frequent) %>% 
       statistics(stat = "mean", name = "mu") %>% full_join(main, ., by = "word")

That issue being already addressed, Argamon & Smith’s Cosine Delta is defined as follows:

\[\begin{equation} \large \Delta_{\angle}(\text{D}, \mathcal{D}_{\text{Author}}) = \frac{\langle z(\text{D}), z(\mathcal{D}_{\text{Author}}) \rangle}{||z(\text{D})||\cdot ||z(\mathcal{D}_{\text{Author}})||} \tag{5.3} \end{equation}\]

The denominator’s role is to obtain unit vectors that preserve the direction in which each of our original vectors —i.e. \(z(\text{D})\) and \(z(\mathcal{D}_{\text{Author}})\)— point. It is pertinent to think of these unit vectors as radii of a \(d\)-dimensional sphere. Then, by taking the inner product of these vectors we are essentially calculating the cosine of the angle that lies between them. The cosine is well known to be a number lying on the interval \([-1, 1]\). It reaches its maximum value if and only if the corresponding angle is \(0\). The bigger the value of the cosine the lesser the measure of the angle between the vectors, and therefore the more similar those vectors are. That is why, as we established in Section 4, we are going to attribute the document \(\text{D}\) to the author that maximizes Aldridge & Smith’s Cosine Delta.

You may recall that at the very end of Section 4 we had declared two functions —dot and norm—. For the purpose of implementing Equation (5.3) these functions fit our hands like a glove.

Delta_Cos = function(x, author) {
  z_document = (word_vector(x) - main["mu"])/main["sigma"] 
  z_author = (main[author] - main["mu"])/main["sigma"]

  dot(z_document, z_author)/(norm(z_document)*norm(z_author))
}

5.4 Prediction and performance

The function that was assigned the name algorithm checks —once it has already checked whether the document being analyzed is an outlier or not— whether the type of Delta we would like to base the authorship attribution on is Aldridge & Smith’s Cosine Delta. It does this in order to determine whether the relevant author is the one that minimizes or the one that maximizes the chosen measurement. As a side effect, however, we can’t run that function —and neither the prediction function, which vectorizes the former— until the Delta_Cos function is declared.

Having declared all three Delta functions, we now proceed to predict the author labels for the test tibble. This is a lengthy process. In case you do execute it, I warn you that it will take around an hour.

# By default, if no argument is passed to it, the prediction function will try to run the 
# attribution algorithm based on Burrows's Delta. I say it will "try" to run it because in
# case the test tibble already has a column named Burrows it will return an error message and 
# interrupt the execution.
prediction()

## [1] "Failed to attribute document id02209 due to it being an outlier."
## [1] "Failed to attribute document id23846 due to it being an outlier."
## [1] "Failed to attribute document id04456 due to it being an outlier."
## [1] "Failed to attribute document id23640 due to it being an outlier."
## [1] "Failed to attribute document id09973 due to it being an outlier."
## [1] "Failed to attribute document id12085 due to it being an outlier."

prediction(Delta_Q)

## [1] "Failed to attribute document id02209 due to it being an outlier."
## [1] "Failed to attribute document id23846 due to it being an outlier."
## [1] "Failed to attribute document id04456 due to it being an outlier."
## [1] "Failed to attribute document id23640 due to it being an outlier."
## [1] "Failed to attribute document id09973 due to it being an outlier."
## [1] "Failed to attribute document id12085 due to it being an outlier."

prediction(Delta_Cos)

## [1] "Failed to attribute document id02209 due to it being an outlier."
## [1] "Failed to attribute document id23846 due to it being an outlier."
## [1] "Failed to attribute document id04456 due to it being an outlier."
## [1] "Failed to attribute document id23640 due to it being an outlier."
## [1] "Failed to attribute document id09973 due to it being an outlier."
## [1] "Failed to attribute document id12085 due to it being an outlier."

With the execution of the three previous lines of code three new columns are added to the test data set. Neither of the methods yield instances of draws among authors. Six documents could not be attributed to either of the candidate authors because they were outliers.

No prediction task should be considered to be over without calculating some measurement of how good the predictions turned out to be. The simplest measure anyone can come up with is the proportion of successful guesses made by the algorithm —disregarding outlier test documents—. The implementation of a function that calculates this isn’t hard, either.

performance = function(method) {
  if (method %in% c("Burrows", "Argamon", "Cosine")) {
    sum(test[method] == test["author"], na.rm = T)/sum(!is.na(test[method]))
  }
  else {
    stop('Legitimate arguments are "Burrows", "Argamon", and "Cosine".')
  }
}

The following is the accuracy attained through Burrows’s Delta.

performance("Burrows")

## [1] 0.5350264

The following is the accuracy attained through Argamon’s Quadratic Delta.

performance("Argamon")

## [1] 0.6838248

Lastly, the following is the accuracy attained through Aldridge & Smith’s Cosine Delta.

performance("Cosine")

## [1] 0.6799046

Though it is marginal, when working with these particular data that we are analyzing there is a gain from using Argamon’s Quadratic Delta rather than Aldridge & Smith’s Cosine Delta. It is with the former that I attain my current best mark of \(\boldsymbol{68.38\%}\) accuracy.

Afterword

In \(2004\) the Association for Literary and Linguistic Computing⁶ and the Association for Computers and the Humanities held the Ad-hoc Authorship Attribution Competition, a contest organized by none other than our recurrent quotee Patrick Juola. Up to at least the year \(2008\) this has been the largest-scale comparative evaluation of authorship attribution technology [1]. It consisted of \(13\) problems, some of them dealing with documents written in current-day English, others dealing with documents written in French, one of them dealing with documents written in \(15\)th century English, another one dealing with documents written in Slavonic-Serbian, and else. I’m not willing to go into detail regarding what each problem was about. You can read about that in Juola’s article. The thing is that these problems were much more difficult than the one we worked on throughout this humble notebook. In fact, they were so difficult that seasoned researchers on authorship attribution techniques opted to refuse to participate in the contest [1]. Those who did participate, though, demonstrated that the authorship attribution technology already available in the year \(2004\) was absolutely capable of attaining very high accuracy in such difficult problems. Namely, the winners of the contest managed to attain \(100\%\) accuracy on problem F —the one that deals with \(15\)th century English; one of the two hardest problems of the competition according to the participants—.

The five algorithms on top in terms of performance on the Ad-hoc Authorship Attribution Competition were

SVM with unstable words
common n-grams
linguistic profiling
linguistic cross-entropy (this technique is an invention of Juola himself)
contextual network graphs

Burrows’s Delta is conspicuous by its absence from this list. Whether its variants —which only came into being some years after the competition— would have managed to come out on top in that contest is something I cannot vouch for, as I haven’t applied them for that competition’s problems. In fact, to my knowledge, Argamon’s and Aldridge & Smith’s versions of Delta have only been compared between them and also against Burrows’s original incarnation of Delta, but never against other algorithms —in all fairness, though, my bibliographic survey wasn’t thorough in the search of such investigations—.

It is only natural for us to record in our to do agenda the pending implementation —and consequent application on Kaggle’s Spooky Author Identification Competition— of algorithms unrelated to Burrows’s Delta, so as to try to settle what technique suits this particular problem best.

Authorship attribution is a broad, far from exhausted field. Many issues inherent to whichever attribution problem we have at hand may factor in the performance that the algorithms end up achieving. Moreover, what works best for a certain type of attribution problems could nonetheless not be suited for other types. This is something Juola himself acknowledges [1]. Aspects such as whether the texts being analyzed are pieces of scientific or fictitious literature, whether all candidate authors write fiction within the same genre —and therefore all documents under scrutiny belong to the same genre—, or even the sheer length of the documents —i.e. how many words or how many sentences each document encompasses— could have an impact on the performance of our algorithms. Researchers have yet to develop recipes for practitioners to know what are the most adequate techniques for each particular situation.

There is work to be done. That is undeniable. Burrows’s Delta and its variants constitute, with the relative ease they exhibit both in concept and in implementation, a stepping stone from which we can continue traversing down the path of data science and authorship attribution. However, that path is long —and it never ceases to elongate, as new methodologies continue to arise—. More intricate and possibly more accurate algorithms lie ahead of us. And that is exciting. And I must say I’m very keen on exploring this path and beating my own mark of \(68.38\%\) accuracy on Kaggle’s Spooky Author Identification Competition.

Bibliography

[1] JUOLA P. Authorship attribution. Foundations and Trends in Information Retrieval 2008; 1: 233–334.

[2] ARGAMON S. Interpreting burrows’s delta: Geometric and probabilistic foundations. Literary and linguistic computing 2007; 23: 131–147.

[3] SMITH P, ALDRIDGE W. Improving authorship attribution: Optimizing burrows’ delta method. Journal of Quantitative Linguistics 2011; 18: 63–88.

[4] BURROWS JF. ’Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 2002; 17: 267–287.

[5] EVERT S, others. Understanding and explaining delta measures for authorship attribution. Digital Scholarship in the Humanities 2017; 32: 4–16.

[6] SILGE J, ROBINSON D. Text mining with r: A tidy approach. 1st ed. O’Reilly Media, Inc., 2017.

Also, from behind the curtains, the unnest_tokens function is automatically taking care of turning all letters to lowercase, as well as stripping away all punctuation symbols.↩︎
I refrain from showing the code that generates these word vectors because in this section the focus is on developing a conceptual understanding regarding the word vectors, not on how can we craft them.↩︎
An interesting exercise would be to consider the \(d\) —with \(d \neq 500\)— most frequent words. Perhaps you may find a value of \(d\) that outperforms the accuracy we have attained.↩︎
The objects sample_EAP, sample_HPL, and sample_MWS do satisfy the format of \(\mathbb{X}\), as they are column-wise-bound word vectors. However, they cannot be treated as mathematical matrices, for they are data frames and R doesn’t allow data frames to be multiplied one by another in the same fashion as matrices are. Hence, we will have to turn them into matrices. Keep this in mind when reading the code with which we will implement the statistics function.↩︎
Do not just take my word for it. If you’re not convinced work it out by hand.↩︎
Nowadays it goes by the name of European Association for Digital Humanities.↩︎

Argamon’s Quadratic Delta or: How I just attained an accuracy of 68.4% in an Authorship Attribution task

Fernando Pazos Ruiz

December 10, 2020