Text Analysis with the `stylo` R package

Conner Robertson

What is ‘stylo’?

stylo is a flexible R package used in the study of computational text analysis and stylometry:

Stylometry (computational stylistics) is concerned with the quantitative study of writing style, and is particularly useful in the exploratory statistical analysis of texts with respect to authorial writing style.

`stylos` Processing Precedures

Preprocessing → Feature extraction → Statistical Analysis → Visualization

This example is an unsupervised analysis of:

Gospels
Matthew	Mark	Luke	John

Preparing the Gospel Texts

gospels_for_stylo <- gospels_clean |>
  group_by(gospel) |>
  summarize(
    full_text = paste(text, collapse = " "),
    .groups = "drop"
  )

This combines all rows for each Gospel into one full document.
Each Gospel needs to become one full text document before stylo can compare them.

Creating a Text `Corpus`

Creating a folder named “corpus” where the text files can be stored and used in analysis

dir.create("corpus", showWarnings = FALSE)

stylo() then needs each Gospel saved as its own .txt file

gospels_for_stylo |>
  mutate(
    file_name = paste0("corpus/", gospel, "_KJV.txt")
  ) |>
  pwalk(function(gospel, full_text, file_name) {
    write_lines(full_text, file_name)
  })

list.files("corpus")

This code creates one text file for each Gospel and saves it inside the corresponding corpus folder.

Output for this example:

Gospel	File created
Matthew	`corpus/Matthew_KJV.txt`
Mark	`corpus/Mark_KJV.txt`
Luke	`corpus/Luke_KJV.txt`
John	`corpus/John_KJV.txt`

Primary `stylo` function

Function	What it does
`stylo()`	Primary function used to compare writing style across texts

“It is quite a long story what this function does. Basically, it is an all-in-one tool for a variety of experiments in computational stylistics”

`stylo()` Sub-Arguments

Sub-Argument	What it does
`analysis.type`	Decides what stylometric analysis we want to run
`mfw.min` / `mfw.max` / `mfw.number`	Set how many “most frequent words” are included in the analysis
`mfw.incr`	Controls the step size when testing a range of most frequent words
`corpus.dir`	Tells `stylo` where the text files are stored (typically in a corpus folder)
`gui`	Controls whether stylo() opens an interactive menu or runs directly from written code

Inputting `stylo()` function

selected_mfw <- 300 sets the feature for comparison (300 most frequent words)

selected_mfw <- 300

run_stylo() storing the stylo() setup in one reusable function for later analysis

run_stylo <- function(mfw_number, analysis_type) {
  
  stylo(
    gui = FALSE,              
    corpus.dir = "corpus",    
    mfw.min = mfw_number,
    mfw.max = mfw_number,     
    analysis.type = analysis_type
  )
}

Running Analysis

Now, the R par function can be used to set up a plotting environment for visualizing the results of the stylo algorithm.

par(
  bg = "grey96",
  mar = c(3, 3, 2, 2)
)

stylo_ca <- run_stylo(
  mfw.number = selected_mfw,
  analysis.type = "CA"
)

Other `stylo` plotting argumnets

stylo_ca <- run_stylo(...
                      
  # removes stylo's automatic titles                  
  titles.on.graphs = FALSE,

  # using stylo's automatic colors for labels
  colors.on.graphs = "colors",

  # adjusts the plot size
  plot.custom.width = 8,
  plot.custom.height = 5,
  plot.font.size = 11,
  plot.line.thickness = 2,

  # displays the dendrogram horizontally
  dendrogram.layout.horizontal = TRUE

title("CA Plot of the Gospels Using 300 Most Frequent Words")

Text Analysis with the stylo R package

What is ‘stylo’?

stylos Processing Precedures

Preparing the Gospel Texts

Creating a Text Corpus

Primary stylo function

stylo() Sub-Arguments

Inputting stylo() function