Text Analysis with the stylo R package

Conner Robertson

What is ‘stylo’?

stylo is a flexible R package used in the study of computational text analysis and stylometry:

Stylometry (computational stylistics) is concerned with the quantitative study of writing style, and is particularly useful in the exploratory statistical analysis of texts with respect to authorial writing style.

stylos Processing Precedures

PreprocessingFeature extractionStatistical AnalysisVisualization

This example is an unsupervised analysis of:

Gospels
Matthew Mark Luke John

Preparing the Gospel Texts

gospels_for_stylo <- gospels_clean |>
  group_by(gospel) |>
  summarize(
    full_text = paste(text, collapse = " "),
    .groups = "drop"
  )
  • This combines all rows for each Gospel into one full document.
  • Each Gospel needs to become one full text document before stylo can compare them.

Creating a Text Corpus

Creating a folder named “corpus” where the text files can be stored and used in analysis

dir.create("corpus", showWarnings = FALSE)

stylo() then needs each Gospel saved as its own .txt file

gospels_for_stylo |>
  mutate(
    file_name = paste0("corpus/", gospel, "_KJV.txt")
  ) |>
  pwalk(function(gospel, full_text, file_name) {
    write_lines(full_text, file_name)
  })

list.files("corpus")

This code creates one text file for each Gospel and saves it inside the corresponding corpus folder.

Output for this example:

Gospel File created
Matthew corpus/Matthew_KJV.txt
Mark corpus/Mark_KJV.txt
Luke corpus/Luke_KJV.txt
John corpus/John_KJV.txt

Primary stylo function

Function What it does
stylo() Primary function used to compare writing style across texts

“It is quite a long story what this function does. Basically, it is an all-in-one tool for a variety of experiments in computational stylistics”

stylo() Sub-Arguments

Sub-Argument What it does
analysis.type Decides what stylometric analysis we want to run
mfw.min / mfw.max / mfw.number Set how many “most frequent words” are included in the analysis
mfw.incr Controls the step size when testing a range of most frequent words
corpus.dir Tells stylo where the text files are stored (typically in a corpus folder)
gui Controls whether stylo() opens an interactive menu or runs directly from written code

Inputting stylo() function

selected_mfw <- 300 sets the feature for comparison (300 most frequent words)

selected_mfw <- 300

run_stylo() storing the stylo() setup in one reusable function for later analysis

run_stylo <- function(mfw_number, analysis_type) {
  
  stylo(
    gui = FALSE,              
    corpus.dir = "corpus",    
    mfw.min = mfw_number,
    mfw.max = mfw_number,     
    analysis.type = analysis_type
  )
}

Running Analysis

Now, the R par function can be used to set up a plotting environment for visualizing the results of the stylo algorithm.

par(
  bg = "grey96",
  mar = c(3, 3, 2, 2)
)
stylo_ca <- run_stylo(
  mfw.number = selected_mfw,
  analysis.type = "CA"
)

Other stylo plotting argumnets

stylo_ca <- run_stylo(...
                      
  # removes stylo's automatic titles                  
  titles.on.graphs = FALSE,

  # using stylo's automatic colors for labels
  colors.on.graphs = "colors",

  # adjusts the plot size
  plot.custom.width = 8,
  plot.custom.height = 5,
  plot.font.size = 11,
  plot.line.thickness = 2,

  # displays the dendrogram horizontally
  dendrogram.layout.horizontal = TRUE

title("CA Plot of the Gospels Using 300 Most Frequent Words")

Final Plotted Cluster Analysis