Домашнее задание – Консенсусная сеть

Анализ по датасету

Author

Ярослав Курганов

Published

05.12.26

Abstract

Домашнее задание

#добавляем необходимые библиотеки
library(dendextend)


---------------------
Welcome to dendextend version 1.19.1
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags: 
     https://stackoverflow.com/questions/tagged/dendextend

    To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: 'dendextend'

The following object is masked from 'package:stats':

    cutree

library(stylo)


### stylo version: 0.7.5 ###

If you plan to cite this software (please do!), use the following reference:
    Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R:
    a package for computational text analysis. R Journal 8(1): 107-121.
    <https://journal.r-project.org/archive/2016/RJ-2016-007/index.html>

To get full BibTeX entry, type: citation("stylo")

library(ape)


Attaching package: 'ape'

The following objects are masked from 'package:dendextend':

    ladderize, rotate

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::where()  masks ape::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggsci)

Warning: package 'ggsci' was built under R version 4.5.2

library(phangorn)
library(TreeTools)

Warning: package 'TreeTools' was built under R version 4.5.2

Registered S3 method overwritten by 'TreeTools':
  method   from    
  [.phyDat phangorn

library(igraph)

Warning: package 'igraph' was built under R version 4.5.2


Attaching package: 'igraph'

The following object is masked from 'package:phangorn':

    diversity

The following objects are masked from 'package:lubridate':

    %--%, union

The following objects are masked from 'package:dplyr':

    as_data_frame, groups, union

The following objects are masked from 'package:purrr':

    compose, simplify

The following object is masked from 'package:tidyr':

    crossing

The following object is masked from 'package:tibble':

    as_data_frame

The following objects are masked from 'package:ape':

    degree, edges, mst, ring

The following objects are masked from 'package:stats':

    decompose, spectrum

The following object is masked from 'package:base':

    union

library(ggraph)
library(philentropy)

url <- "https://raw.githubusercontent.com/locusclassicus/text_analysis_2024/main/files/table_with_frequencies.txt"
lines <- readLines(url)
texts <- gsub('"', '', strsplit(lines[1], " ")[[1]])
words <- gsub('"', '', sapply(lines[-1], function(l) strsplit(l, " ")[[1]][1]))
freqs <- lapply(lines[-1], function(l) as.numeric(strsplit(l, " ")[[1]][-1]))
df <- do.call(rbind, freqs)
rownames(df) <- words
colnames(df) <- texts
df <- t(df)

#находим топ-500 слов по средней частоте
mean_freq <- colMeans(df)
top_words <- names(sort(mean_freq, decreasing = TRUE)[1:500])
df_selected <- df[, top_words]

#пишем функцию для дерева
get_tree <- function(df) {
  sampled_cols <- sample(ncol(df), size = ncol(df), replace = TRUE)
  df_boot <- df[, sampled_cols]
  df_z <- scale(df_boot)
  dist_mx <- dist(df_z, method = "manhattan")
  hc <- hclust(dist_mx, method = "average")
  as.phylo(hc)
}

set.seed(42)
trees <- map(1:100, ~get_tree(df_selected))

# делаем консенсусное дерево
cons <- consensus(trees, p = 0.5, rooted = FALSE)

# распределяем цвета по авторам
authors <- str_remove(cons$tip.label, "_.+")
unique_authors <- unique(authors)
pal <- pal_igv()(length(unique_authors))
col_map <- setNames(pal, unique_authors)
tip_colors <- col_map[authors]

#визаулизация -- делаем древо
par(mar = c(0,0,3,0))

plot(cons,
     tip.color = tip_colors,
     main = "Консенсусное дерево",
     font = 1,
     cex = 0.7,
     align.tip.label = TRUE)

#добавляем легенду
legend("topleft", 
       legend = unique_authors,
       col = pal,
       pch = 15,
       cex = 0.7,
       title = "Авторы")

По итогу

В анализе использованы 500 наиболее частотных слов

Консенснусное дерево – на основе 100 деревьев.

Как мы видим, “Тихий Дон” находится в “нижнем” кластере дерева вместе с другими произведениями Шолохова: “Поднятая целина”, “Донские рассказы” — но в то же время рядом остается “Они сражались за Родину” Островского. Нахождение “Тихого Дона” в одном месте с другими шолоховскими произведениями показывает их схожесть.

Также занятно, что другое произведение писателя “Судьба человека” от этого массива текстов отдалилось – и оказалось стилистически ближе к Севскому, отчасти Платонову.