NLP Models for Author Style Detection in Historical British Prose

Analysis of authors’ stylistic features and classification of texts by author, based on a corpus of 28 works of British prose from the late 18th-19th century using the tidymodels framework in R

Author

Karina Chadaeva

Published

17.06.2025

The full and detailed version with code is available here (it’s in russian): https://rpubs.com/chadaevakarina/1317153

Introduction

In this project, I implement two models for author attribution. Each model uses different types of predictors. The models are built using texts that are split into chunks of 1000 tokens. This helps to keep the data structure consistent for training and cross-validation.

Model based on linguistic features. This model uses aggregated quantitative characteristics of texts, like average word length, lexical diversity index (TTR), relative frequency of different parts of speech (verbs, nouns, adjectives, etc.), average sentence length, and more.
Model based on frequent n-grams and stopwords. The second model uses 1000 most frequent bigrams and trigrams in the corpus and the relative frequency of stopwords in each chunk.

The plot shows distribution of texts by authors included in the corpus. On the X-axis is the number of texts for each author, and on the Y-axis - authors’ names. The biggest number of texts is from Trollope, Thackeray, Eliot, and Dickens. The smallest - from EBronte.

Model Based on Linguistic Features

To train the model, we use 18 predictors - different quantitative linguistic characteristics: average word length, TTR, relative frequency of different parts of speech, relative frequency of verb forms (past tense, present tense, infinitive), and comparative forms of adjectives and adverbs.

Correlation Matrix

On the correlation matrix, we can see important relationships between features:

past_ratio and present_ratio (–0.78): Authors tend to use either past or present tense, but not both.
past_ratio and infinitive_ratio (–0.7): When past tense is used more, infinitives are used less.
part_freq and infinitive_ratio (0.65): Particles appear much more often in texts with a high number of infinitives.
det_freq and pron_freq (–0.78), and det_freq and noun_freq (0.7): In texts with more determiners (like the, a), there are fewer pronouns. Also, nouns are often used together with determiners.

Positive correlations:

avg_word_length and TTR (0.70): The longer the words, the more lexical diversity. We can assume that more rare words are often longer.
noun_freq and TTR (0.64); noun_freq and avg_word_length (0.66): Authors who use more nouns usually have richer vocabulary and prefer longer words.

Feature Distributions

Lexical and syntactic features:

avg_sentence_length - peak values probably appear because average sentence length was calculated for the whole author and then assigned to each observation. I think, it’s better to exclude this predictor.
TTR - left-skewed distribution
avg_word_length - almost normal distribution

Part-of-speech frequencies:

noun_freq, verb_freq, adj_freq, adv_freq, pron_freq, det_freq, part_freq - almost normal distribution
num_freq, punct_freq - strong left skew

Grammatical features:

comparative_ratio, superlative_ratio - left-skewed
infinitive_ratio, present_ratio, past_ratio - rather wide distributions; past_ratio is dominant

PCA for Exploratory Analysis

We can see that the points overlap a lot — authors do not form clear clusters when projected onto the first two principal components. Since PCA is a linear method, maybe this happens because the data is not linearly distributed.

UMAP for Exploratory Analysis

UMAP is a non-linear dimensionality reduction method that tries to preserve local structure of the data when projecting into two-dimensional space. In this case, it gave a better result compared to PCA, but still not perfect. Some clusters start to appear, but there is still a lot of overlap.

Model Building

Since we have only 18 features, and they are dense, numeric, and not sparse, we will use the following models: Support Vector Machine (SVM), Single-layer Neural Network (MLP), Bagging with Decision Trees, Logistic Regression, Extreme Gradient Boosting (XGBoost), and Random Forest.

Model Evaluation and Selection

On the plot we can see that the best results were shown by models trained on the basic recipe without using dimensionality reduction methods (PCA or UMAP). The top models in terms of accuracy were Logistic Regression, XGBoost, and MLP.

We will focus on Logistic Regression. The model’s metric values:

f_meas - 0.714
accuracy - 0.717
roc_auc - 0.959

Let’s build the confusion matrix. On the heatmap plot, we can see a clear diagonal, but still, the model made quite a lot of mistakes.

All ROC curves are placed well above the diagonal of random guessing, which means the classification quality is good. The curves are tightly overlapping, and there are no clearly standing-out lines - this shows that no single class dominates in performance, and the differences between authors are quite evenly expressed.

Interpretation of Results

Most Important Features

On the plot, we can see the top 10 most important features for each author in the logistic regression model. Each panel corresponds to one author and shows which features (linguistic characteristics) had the biggest influence on the probability that a text belongs to this author (in a one-vs-rest scheme).

pron_freq (pronoun frequency) and cconj_freq (coordinating conjunctions) appear among the top features for almost all authors.
avg_word_length and TTR (lexical diversity) are also frequent - they reflect the overall complexity of the author’s vocabulary.
For Thackeray, Dickens, and Trollope, verb tense ratios and infinitive_ratio are especially important.
The highest lexical diversity is shown by EBronte and CBronte.

Conclusion

Logistic Regression and XGBoost models showed the best performance among all tested models: Logistic Regression reached accuracy = 0.717 and f-measure = 0.714, while XGBoost had similar results - accuracy = 0.708 and f-measure = 0.691. Both models had very high ROC AUC (> 0.95), which means they are good at distinguishing between classes.

The features based on linguistic characteristics - part-of-speech frequencies, word and sentence length, TTR, and grammar form ratios - turned out to be informative and enough for successful author attribution.

Model Based on Frequent N-grams and Stopwords

In the second model, the corpus is split into chunks of 1000 tokens. From each chunk, we extract frequency-based features related to common word combinations (bigrams and trigrams) and stopwords. Unlike the first model, here we do not use any grammatical or syntactic characteristics.

Exploratory Analysis

We plan to train the model using 2702 predictors - these are stopwords frequencies and n-grams. Our data is a sparse matrix. Let’s try to apply dimensionality reduction methods: PCA and UMAP.

PCA for Exploratory Analysis

On the plot we can see that Richardson is quite well separated along PC1. The other authors overlap a lot, which means that the classes are not linearly separable based on the selected predictors.

UMAP for Exploratory Analysis

Unlike PCA, here we see a clearer separation between authors. Richardson, Fielding, and Austen are especially well separated. The other authors still overlap quite a lot.

Model Building

We create specifications for two models - Linear SVM and Logistic Regression with Lasso regularization, which are good choices for working with sparse high-dimensional text features. These models are combined with three preprocessing options: no dimensionality reduction, PCA, and UMAP. This allows us to compare the effect of linear vs non-linear dimensionality reduction. All combinations are gathered into one workflow_set for further tuning and evaluation.

Model Evaluation and Selection

We visualize the accuracy of models from the workflow_set.

The best results were shown by models trained on the basic recipe without dimensionality reduction.

We choose Logistic Regression with Lasso regularization. The model’s metrics:

f_meas - 0.970
accuracy - 0.980

Confusion matrix. The model shows high accuracy across most classes, which can be seen from the clearly defined diagonal in the confusion matrix.

Most important features by author

Each author has distinct stable word forms and constructions that the Lasso model considered most helpful for distinguishing their texts.

Austen stands out with frequent expressions like any.thing, every.thing, don.t, very, soon, could
CBronte, Eliot, and EBronte almost never use upon, unlike other authors such as Dickens, Richardson, and Sterne
Dickens is marked by vocative forms like mr, my.dear, which might reflect his dialog-rich narrative style
The word which, often used to start relative clauses, is a strong marker for Fielding, Sterne, and Thackeray, and an anti-feature for ABronte and EBronte
The conjunction and (and, as a result, coordinate structures) is more typical for ABronte and Thackeray, but is an anti-feature for Fielding and Trollope
The conjunction but is one of the most important and frequent features for ABronte, and also has some weight in Richardson’s style
Sterne is characterized by constructions like my.uncle, my.father

Conclusion

The Lasso model helped identify interpretable lexical-grammatical features that reflect individual stylistic traits of the authors. The visualization clearly shows that most authors have stable linguistic markers - both positive and negative. The chosen predictors - stopword and n-gram frequencies - provide reliable differentiation between authors and lead to high classification accuracy.

General Conclusions

Using the tidymodels framework and two types of linguistic features - quantitative-linguistic and frequency-based (n-grams and stopwords) - we were able to build interpretable models that successfully distinguish the unique styles of authors of classic British prose. The best-performing models were Logistic Regression and Lasso, both achieving accuracy above 0.95. Feature visualization confirmed that there are clear stylistic differences between the authors.