Análisis del guión de GTA San Andreas

Repositorio del proyecto: https://github.com/Gjaset/Natural-Language-Processing-applied-to-the-GTA-San-Andreas-script

Referencias guión: https://gamefaqs.gamespot.com/ps2/914983-grand-theft-auto-san-andreas/faqs/36175

Introducción

El análisis de frecuencia de palabras en un corpus textual, como los diálogos y guiones de GTA: San Andreas, permite observar patrones lingüísticos que reflejan tanto el diseño narrativo del juego como la construcción de sus personajes y escenarios.

Más allá de la simple aparición de términos, el estudio de sus conexiones ofrece una ventana para comprender qué conceptos se relacionan entre sí, revelando conexiones semánticas y temáticas que fortalecen la ambientación y el mensaje narrativo.

En este contexto, el uso de técnicas estadísticas y de conteo sistemático brinda una forma objetiva de explorar un universo cultural y narrativo tan amplio como el de este videojuego.

Justificación

El análisis de texto aplicado a GTA: San Andreas resulta relevante porque permite transformar un producto de entretenimiento en un objeto de investigación cuantitativa.

Mientras que la experiencia del jugador se centra en lo narrativo y lo visual, el estudio de la frecuencia y correlación de palabras ofrece una visión estructurada y replicable del trasfondo lingüístico del juego.

Además, este enfoque no requiere herramientas de inteligencia artificial complejas: con técnicas de conteo, minería de texto y análisis de redes, es posible generar evidencia empírica clara y accesible que enriquezca la interpretación del guion del juego.

Objetivos

Detectar y cuantificar la frecuencia de términos clave dentro del corpus textual de GTA: San Andreas.
Identificar relaciones entre palabras para revelar patrones narrativos y temáticos.
Presentar los resultados mediante gráficos y visualizaciones que faciliten la interpretación de los datos.

Personajes Principales

A continuación se presentan los personajes principales de GTA San Andreas que serán analizados a lo largo de este estudio. Estos personajes fueron seleccionados por su relevancia narrativa y frecuencia de aparición en el guion del juego.

Carl Johnson

Protagonista

SWEET

Sweet Johnson

Hermano de CJ

BIG SMOKE

Big Smoke

Traidor

RYDER

Ryder

Miembro Grove Street

KENDL

Kendl Johnson

Hermana de CJ

CESAR

Cesar Vialpando

Novio de Kendl

CATALINA

Catalina

Ex-novia de CJ

WOOZIE

Wu Zi Mu

Líder de las Tríadas

TENPENNY

Officer Tenpenny

Policía corrupto

PULASKI

Officer Pulaski

Policía corrupto

TRUTH

The Truth

Hippie traficante

TORENO

Mike Toreno

Agente secreto

MADD DOGG

Madd Dogg

Rapero

ZERO

Zero

Técnico

Configuración inicial

Antes de realizar cualquier análisis, fue necesario preparar y estructurar el corpus textual de GTA: San Andreas. Este proceso incluyó la recopilación de diálogos y guiones, su normalización—eliminación de símbolos, corrección de formatos y estandarización de mayúsculas/minúsculas—y la segmentación en palabras individuales.

La configuración inicial garantiza que los datos estén limpios y organizados, permitiendo que las técnicas posteriores de conteo y análisis se apliquen de manera consistente y confiable.

# Cargar librerías necesarias
library(reader)
library(tidyverse)
library(tidytext)
library(magrittr)
library(gridExtra)
library(wordcloud)
library(boot)
library(RColorBrewer)
library(reshape2)
library(igraph)
library(purrr)
library(networkD3)
library(kableExtra)

cat("Inicializando entorno de análisis GTA San Andreas...\n")

## Inicializando entorno de análisis GTA San Andreas...

# Lectura y tokenización base
gta_script <- read_lines("guionGTA.txt") %>% unlist()
gta_script <- tibble(line = seq_along(gta_script), text = gta_script) %>%
  unnest_tokens(input = text, output = word) %>%
  filter(!is.na(word)) %>%
  filter(!grepl(pattern = "[0-9]", x = word))

# Nombres de personajes a preservar
character_names <- c("CJ", "CARL", "SWEET", "RYDER", "BIG SMOKE", "CESAR", 
                     "KENDL", "WOOZIE", "CATALINA", "TENPENNY", "PULASKI", 
                     "ZERO", "TRUTH", "TORENO", "MADD", "DOGG")
character_names_lower <- tolower(character_names)
data(stop_words)
filtered_stop_words <- stop_words %>% filter(!word %in% character_names_lower)
gta_script %<>% anti_join(filtered_stop_words, by = "word")

# Filtrado ofensivo (NRC + AFINN) excluyendo personajes
nrc_negative <- get_sentiments("nrc") %>%
  filter(sentiment %in% c("anger", "disgust", "fear")) %>% select(word)
afinn_negative <- get_sentiments("afinn") %>%
  filter(value <= -3) %>% select(word)
offensive_words <- bind_rows(nrc_negative, afinn_negative) %>% distinct(word) %>%
  filter(!tolower(word) %in% tolower(character_names))
cat("Tokens antes filtrado ofensivo:", nrow(gta_script), "\n")

## Tokens antes filtrado ofensivo: 15971

gta_script %<>% anti_join(offensive_words, by = "word")
cat("Tokens después filtrado ofensivo:", nrow(gta_script), "\n")

## Tokens después filtrado ofensivo: 14323

# Frecuencias rápidas
top_words <- gta_script %>% count(word, sort = TRUE) %>% head(10)
knitr::kable(top_words, caption = "Top 10 palabras tras preparación y filtrado")

Top 10 palabras tras preparación y filtrado
word	n
cj	1327
mission	371
sweet	288
hey	251
smoke	201
yeah	188
cesar	175
carl	154
woozie	152
ryder	149


## Análisis de frecuencia de palabras

El análisis de frecuencia consiste en identificar cuántas veces aparece un término específico dentro del corpus. 

Esta técnica, aunque sencilla, constituye la base para estudios más complejos, pues permite reconocer cuáles son los temas, personajes o conceptos más recurrentes en el guion del videojuego. 

En este proyecto, la frecuencia de palabras no se limita a un conteo aislado, sino que también se contrasta con correlaciones entre términos. Esto ofrece una visión más profunda sobre cómo se conectan las ideas y narrativas dentro de GTA: San Andreas. 


### Palabras más frecuentes

``` r
# Cargar y procesar datos
gta_script <- read_lines("guionGTA.txt")
gta_script <- tibble(
  line = seq_along(gta_script),
  text = gta_script
) %>%
  unnest_tokens(input = text, output = word) %>%
  filter(!is.na(word)) %>%
  filter(!grepl(pattern = "[0-9]", x = word))

# Remover stop words mientras se preservan nombres de personajes
character_names <- c("CJ", "CARL", "SWEET", "RYDER", "BIG SMOKE", "CESAR", 
                    "KENDL", "WOOZIE", "CATALINA", "TENPENNY", "PULASKI", 
                    "ZERO", "TRUTH", "TORENO", "MADD DOGG")

character_names_lower <- tolower(character_names)
filtered_stop_words <- stop_words %>%
  filter(!word %in% character_names_lower)

gta_script %<>%
  anti_join(x = ., y = filtered_stop_words)

# Visualización de palabras más frecuentes
gta_script %>%
  count(word, sort = TRUE) %>%
  filter(n >= 100) %>%
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(x = word, y = n)) + 
  geom_col(fill="#335f3f", alpha = 0.8) +
  theme_light() +
  coord_flip() +
  xlab(NULL) +
  ylab("Frecuencia") +
  ggtitle("GTA SA: Conteo de Palabras")

Filtrado de contenido ofensivo/violento (según lexicones)

# Usar NRC (emociones negativas fuertes) y AFINN (valores muy negativos)
cat("Usando lexicones NRC y AFINN para filtrar términos ofensivos/violentos...\n")

## Usando lexicones NRC y AFINN para filtrar términos ofensivos/violentos...

nrc_negative <- get_sentiments("nrc") %>%
  filter(sentiment %in% c("anger", "disgust", "fear")) %>%
  select(word)

afinn_negative <- get_sentiments("afinn") %>%
  filter(value <= -3) %>%
  select(word)

offensive_words <- bind_rows(nrc_negative, afinn_negative) %>%
  distinct(word) %>%
  # Excluir nombres de personajes
  filter(!tolower(word) %in% tolower(character_names))

cat("Términos marcados para filtrar:", nrow(offensive_words), "\n")

## Términos marcados para filtrar: 2591

cat("Tokens antes del filtrado:", nrow(gta_script), "\n")

## Tokens antes del filtrado: 15971

gta_script %<>% anti_join(offensive_words, by = "word")

cat("Tokens después del filtrado:", nrow(gta_script), "\n")

## Tokens después del filtrado: 14323

# Vista rápida de términos frecuentes luego del filtrado
gta_script %>% count(word, sort = TRUE) %>% head(10) %>% knitr::kable()

word	n
cj	1327
mission	371
sweet	288
hey	251
smoke	201
yeah	188
cesar	175
carl	154
woozie	152
ryder	149

El conteo de palabras por personaje permite observar qué tan central es cada figura dentro de la narrativa del juego. Analizar el vocabulario individual muestra diferencias en estilo, frecuencia de aparición y relevancia dentro de la historia.

Cuando se analizan las palabras usadas por cada personaje, aparecen diferencias claras en el estilo y en el rol que cumplen dentro de la historia. Por ejemplo, personajes principales como CJ tienden a tener un vocabulario amplio y relacionado con la acción directa, mientras que otros personajes pueden centrarse en temas más específicos, como negocios, violencia o familia. Este análisis muestra cómo cada voz dentro del juego refuerza su identidad y aporta a la construcción general de la trama.

Nube de palabras

par(mfrow = c(1,1), mar=c(2,2,4,2), mgp=c(2,1,0))
set.seed(123)
gta_script %>%
  count(word, sort = TRUE) %>%
  with(wordcloud(
    words = word,
    freq = n,
    max.words = 50,
    colors = "#335f3f",
    scale = c(5, 3),  # Aumenta el contraste de tamaños
    min.freq = 2,       # Establece frecuencia mínima
    rot.per = 0.35,     # 35% de palabras rotadas
    random.order = FALSE # Palabras más frecuentes en el centro
  ))
title(main = "Nube de Palabras - GTA SA", cex.main = 2)

La nube de palabras ofrece una forma visual de entender esas frecuencias: los términos más importantes aparecen en un tamaño mayor. En el contexto de GTA: San Andreas, esta representación permite ver rápidamente qué conceptos dominan las conversaciones, como referencias a homie, Cj, Casino o a las misiones. Así, la nube de palabras no solo resume el conteo, sino que también ayuda a identificar de un vistazo los ejes narrativos más repetidos en el juego.

Análisis de Frecuencia por Personajes y Correlaciones

Frecuencia de palabras por personaje

# Función para identificar personajes principales (ampliada)
get_main_characters <- function(word){
  word_upper <- toupper(word)
  case_when(
    word_upper %in% c("CJ", "CARL") ~ "cj",
    word_upper == "SWEET" ~ "sweet",
    word_upper == "RYDER" ~ "ryder",
    word_upper == "CESAR" ~ "cesar",
    word_upper == "KENDL" ~ "kendl",
    word_upper %in% c("BIG SMOKE") ~ "big_smoke",
    word_upper == "CATALINA" ~ "catalina",
    word_upper == "WOOZIE" ~ "woozie",
    word_upper == "TENPENNY" ~ "tenpenny",
    word_upper == "PULASKI" ~ "pulaski",
    word_upper == "TRUTH" ~ "truth",
    word_upper == "TORENO" ~ "toreno",
    word_upper %in% c("MADD DOGG") ~ "madd_dogg",
    word_upper == "ZERO" ~ "zero",
    TRUE ~ NA_character_
  )
}

# Procesar el script para obtener diálogos por personaje
script_with_characters <- gta_script %>%
  mutate(
    is_character = !is.na(get_main_characters(word)),
    character = get_main_characters(word)
  ) %>%
  fill(character, .direction = "down") %>%
  filter(!is_character, !is.na(character)) %>%
  select(word, character)

# Crear datasets por personaje
create_character_dataset <- function(character_name){
  script_with_characters %>% 
    filter(character == character_name) %>%
    select(word)
}

# Crear datasets para todos los personajes
script_cj <- create_character_dataset("cj")
script_sweet <- create_character_dataset("sweet")
script_ryder <- create_character_dataset("ryder")
script_cesar <- create_character_dataset("cesar")
script_kendl <- create_character_dataset("kendl")
script_big_smoke <- create_character_dataset("big_smoke")
script_catalina <- create_character_dataset("catalina")
script_woozie <- create_character_dataset("woozie")
script_tenpenny <- create_character_dataset("tenpenny")
script_pulaski <- create_character_dataset("pulaski")
script_truth <- create_character_dataset("truth")
script_toreno <- create_character_dataset("toreno")
script_madd_dogg <- create_character_dataset("madd_dogg")
script_zero <- create_character_dataset("zero")

# Mostrar conteo de palabras por personaje
cat("Palabras por personaje:\n")

## Palabras por personaje:

script_with_characters %>%
  count(character, sort = TRUE) %>%
  knitr::kable()

character	n
cj	6550
sweet	958
woozie	619
cesar	614
toreno	525
truth	496
ryder	427
tenpenny	307
zero	298
catalina	271
pulaski	179
kendl	156

# Crear tabla de frecuencias relativas
bind_rows(
  mutate(.data = script_cj, author = "cj"),
  mutate(.data = script_sweet, author = "sweet")
) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>%
  select(-n) %>%
  spread(author, proportion, fill = 0) -> word_frequencies

# Mostrar palabras más frecuentes compartidas
word_frequencies %>%
  filter(cj != 0, sweet != 0) %>%
  arrange(desc(cj), desc(sweet)) %>%
  head(10) %>%
  knitr::kable(caption = "Top 10 palabras compartidas entre CJ y Sweet")

Top 10 palabras compartidas entre CJ y Sweet
word	cj	sweet
mission	0.0322137	0.0344468
hey	0.0287023	0.0156576
smoke	0.0201527	0.0344468
yeah	0.0183206	0.0187891
loc	0.0149618	0.0020877
og	0.0131298	0.0031315
madd	0.0108397	0.0041754
screen	0.0096183	0.0125261
dogg	0.0096183	0.0020877
car	0.0090076	0.0093946

Análisis de Correlaciones

# Correlación sobre todo el vocabulario
cor_all <- cor.test(x=word_frequencies$sweet, y=word_frequencies$cj)
cat("Correlación sobre todo el vocabulario:\n")

## Correlación sobre todo el vocabulario:

print(cor_all)

## 
##  Pearson's product-moment correlation
## 
## data:  word_frequencies$sweet and word_frequencies$cj
## t = 40.815, df = 2081, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6422248 0.6899714
## sample estimates:
##       cor 
## 0.6667818

# Correlación basada solo en palabras compartidas
shared_words <- word_frequencies %>%
  filter(cj != 0, sweet != 0)
cor_shared <- cor.test(x=shared_words$sweet, y=shared_words$cj)
cat("\nCorrelación basada en palabras compartidas:\n")

## 
## Correlación basada en palabras compartidas:

print(cor_shared)

## 
##  Pearson's product-moment correlation
## 
## data:  shared_words$sweet and shared_words$cj
## t = 15.591, df = 277, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6156489 0.7415525
## sample estimates:
##       cor 
## 0.6836546

Correlaciones

# Bootstrap para la relación entre frecuencias relativas

suppressMessages(suppressWarnings(library(boot)))

correlation_function <- function(data, indices){
  d <- data[indices, ]
  return(cor(d$sweet,d$cj))
}

correlation_data <- word_frequencies %>%
  select(sweet, cj)

set.seed(123)
bootstrap_correlation <- boot(data = correlation_data, statistic = correlation_function, R=2000)
boot.ci(bootstrap_correlation,  type = "perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_correlation, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 0.4675,  0.7901 )  
## Calculations and Intervals on Original Scale

hist (bootstrap_correlation$t,
      main="Bootstrap Correlation Distribution",
      xlab="r",
      col = "#039BE5",
      border = "white")

#bootstrap analysis for the correlation of shared words

shared_correlation_function <- function(data, indices){
  d <- data[indices, ]
  return(cor(d$sweet, d$cj))
}
shared_correlation_data <- shared_words %>%
  select(sweet, cj)

set.seed(123)
bootstrap_shared_correlation <- boot(data = shared_correlation_data, statistic = shared_correlation_function, R = 2000)
boot.ci(bootstrap_shared_correlation, type = "perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootstrap_shared_correlation, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 0.4339,  0.8249 )  
## Calculations and Intervals on Original Scale

hist(bootstrap_shared_correlation$t,
     main = "Bootstrap Distribution (shared words)",
     xlab ="r",
     col ="#039BE5",
     border ="white")

El análisis con bootstrap consiste en aplicar técnicas de remuestreo para comprobar la estabilidad de los resultados. De esta manera se asegura que las correlaciones encontradas entre palabras no sean producto del azar, sino que reflejen patrones consistentes en el corpus.

En el caso de GTA: San Andreas, esta técnica confirma que ciertas asociaciones —por ejemplo, entre violencia y dinero, o drogas y pandillas— son consistentes y representan temas estructurales del guion, no resultados accidentales.

Análisis de Sentimientos

Palabras con carga emocional

# Análisis de sentimientos
positive_words <- get_sentiments("bing") %>%
  filter(sentiment == "positive") %>%
  mutate(sentiment = "Positive")
negative_words <- get_sentiments("bing") %>%
  filter(sentiment == "negative") %>%
  mutate(sentiment = "Negative")

sentiment_words <- bind_rows(positive_words, negative_words)

# Visualización de palabras emocionalmente cargadas
gta_script %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  filter(n > 8) %>%
  mutate(n = ifelse(sentiment == "Negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y=n, fill = sentiment)) +
  geom_col() +
  scale_fill_manual(values = brewer.pal(8,"Dark2")[c(2,5)]) +
  coord_flip(ylim=c(-7,7)) +
  labs(
    title = "GTA SA: Conteo de Sentimientos",
    y = "Frecuencia",
    x = NULL
  ) +
  theme_minimal()

El análisis de sentimientos mediante conteo clasifica las palabras según su carga positiva, negativa o neutral. Este procedimiento ayuda a medir el tono general de los diálogos y a identificar escenas con mayor carga emocional.

En GTA: San Andreas, predominan las palabras negativas, lo cual refleja el tono conflictivo y violento de la historia. Sin embargo, también aparecen expresiones más neutrales o positivas en momentos de diálogo familiar o de camaradería, lo que da un contraste interesante en la narrativa del juego.

Nube de palabras por sentimiento

par(mfrow = c(1,1), mar = c(2,2,4,2), mgp = c(2,1,0))
set.seed(123)
gta_script %>%
  inner_join(sentiment_words) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(
    colors = brewer.pal(8, "Dark2")[c(2, 5)],
    max.words = 50,
    title.size = 2.5,    # Título más grande
    scale = c(4, 3),   # Mayor contraste de tamaños
    rot.per = 0.35,      # 35% de palabras rotadas
    random.order = FALSE  # Palabras más frecuentes en el centro
  )
title("Análisis de Sentimientos - GTA SA", cex.main = 2)

La nube de sentimientos representa visualmente las palabras positivas y negativas más frecuentes. Esta herramienta permite observar de forma rápida la distribución de emociones dentro del juego y facilita la comparación entre diferentes tipos de vocabulario.

Análisis de Red de Palabras

Red de bigramas

# Procesamiento de bigramas
gta_raw_script <- read_lines("guionGTA.txt")
gta_raw_script <- tibble(
  line = 1:length(gta_raw_script),
  text = gta_raw_script
)

gta_script_bigrams <- gta_raw_script %>%
  unnest_tokens(input = text, output = bigram, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

# Crear red de bigramas
replacement_list <- list(
  'á' = 'a',
  'é' = 'e',
  'í' = 'i',
  'ó' = 'o',
  'ú' = 'u'
)

gta_bigram_counts <- gta_script_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!grepl(pattern = '[0-9]', x = word1)) %>%
  filter(!grepl(pattern = '[0-9]', x = word2)) %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  mutate(word1 = chartr(
    old = names(replacement_list) %>% str_c(collapse = ''),
    new = replacement_list %>% str_c(collapse = ''),
    x = word1)) %>%
  mutate(word2 = chartr(
    old = names(replacement_list) %>% str_c(collapse = ''),
    new = replacement_list %>% str_c(collapse = ''),
    x = word2)) %>%
  filter(!is.na(word1)) %>%
  filter(!is.na(word2)) %>%
  count(word1, word2, sort = TRUE) %>%
  rename(weight = n)

# Visualizar red con umbral alto
g <- gta_bigram_counts %>%
  filter(weight > 16) %>%
  graph_from_data_frame(directed = FALSE)

set.seed(123)
plot(
  g,
  layout = layout_with_fr,
  vertex.color = 1,
  vertex.frame.color = 1,
  vertex.size = 3,
  vertex.label.color = "black",
  vertex.label.cex = 1,
  vertex.label.dist = 1,
  main = "Red de palabras (Umbral = 16)"
)

La red de palabras con un umbral alto (16) muestra únicamente las conexiones más fuertes entre términos. Esto permite identificar grupos temáticos claros y relaciones sólidas dentro del discurso, ofreciendo una visión concentrada de los patrones lingüísticos.

En GTA: San Andreas, estas conexiones muestran grupos temáticos bien definidos, como la relación entre drogas y violencia, o entre dinero y poder. Este nivel de análisis ayuda a identificar los núcleos centrales del discurso, es decir, las ideas que sostienen la historia y que aparecen constantemente vinculadas entre sí.

# Visualizar red con umbral bajo
g <- gta_bigram_counts %>%
  filter(weight > 2) %>%
  graph_from_data_frame(directed = FALSE)

set.seed(123)
plot(
  g,
  layout = layout_with_kk,      
  vertex.color = 1,
  vertex.frame.color = 1,
  vertex.size = 3,
  vertex.label = NA,            
  main = "Red de palabras (Umbral = 2)"           
)

La red de palabras con un umbral bajo (2) incluye muchas más conexiones, lo que permite observar asociaciones más amplias entre términos. Aunque puede contener más ruido, esta representación revela una visión más completa de las posibles relaciones en el lenguaje del juego.

En este caso, el gráfico refleja un panorama más amplio de las asociaciones posibles, mostrando cómo los diferentes temas del juego —como familia, pandillas, policía o negocios— terminan entrelazándose. Aunque aparecen más relaciones débiles, esta perspectiva ayuda a comprender la complejidad del guion y cómo múltiples conceptos se conectan en la narrativa de GTA: San Andreas.

Centralidad en la red de palabras (bigramas)

1. Centralidad de Cercanía (Closeness)

# Crear grafo de bigramas para medidas de centralidad (sin umbral)
bigram_graph <- gta_bigram_counts %>%
  select(word1, word2, weight) %>%
  graph_from_data_frame(directed = FALSE)

# Closeness
closeness_centrality <- closeness(bigram_graph, mode = "all", normalized = TRUE)
closeness_df <- data.frame(
  word = names(closeness_centrality),
  closeness = as.numeric(closeness_centrality),
  stringsAsFactors = FALSE
) %>% arrange(desc(closeness)) %>% head(15)

ggplot(closeness_df, aes(x = reorder(word, closeness), y = closeness, fill = closeness)) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_c(option = "plasma") +
  coord_flip() +
  labs(title = "Centralidad de Cercanía - Top 15", x = "Palabra", y = "Closeness (norm)") +
  theme_minimal()

2. Centralidad de Intermediación (Betweenness)

# Betweenness
betweenness_centrality <- betweenness(bigram_graph, directed = FALSE, normalized = TRUE)
betweenness_df <- data.frame(
  word = names(betweenness_centrality),
  betweenness = as.numeric(betweenness_centrality)
) %>% arrange(desc(betweenness)) %>% head(15)

ggplot(betweenness_df, aes(x = reorder(word, betweenness), y = betweenness, fill = betweenness)) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_c(option = "inferno") +
  coord_flip() +
  labs(title = "Centralidad de Intermediación - Top 15", x = "Palabra", y = "Betweenness (norm)") +
  theme_minimal()

3. Centralidad de Grado (Degree)

# Degree
degree_centrality <- degree(bigram_graph, mode = "all", normalized = TRUE)
degree_df <- data.frame(
  word = names(degree_centrality),
  degree = as.numeric(degree_centrality)
) %>% arrange(desc(degree)) %>% head(15)

ggplot(degree_df, aes(x = reorder(word, degree), y = degree, fill = degree)) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_c(option = "cividis") +
  coord_flip() +
  labs(title = "Centralidad de Grado - Top 15", x = "Palabra", y = "Degree (norm)") +
  theme_minimal()

4. Centralidad de Vector Propio (Eigenvector)

# Eigenvector
eigen_centrality_vec <- eigen_centrality(bigram_graph, directed = FALSE)
eigen_df <- data.frame(
  word = names(eigen_centrality_vec$vector),
  eigenvector = as.numeric(eigen_centrality_vec$vector)
) %>% arrange(desc(eigenvector)) %>% head(15)

ggplot(eigen_df, aes(x = reorder(word, eigenvector), y = eigenvector, fill = eigenvector)) +
  geom_col(show.legend = FALSE) +
  scale_fill_viridis_c(option = "mako") +
  coord_flip() +
  labs(title = "Centralidad de Vector Propio - Top 15", x = "Palabra", y = "Eigenvector") +
  theme_minimal()

Cohesión de la red (palabras)

network_density <- edge_density(bigram_graph)
transitivity_global <- transitivity(bigram_graph, type = "global")
transitivity_local <- transitivity(bigram_graph, type = "local")
components_list <- components(bigram_graph)
num_components <- components_list$no
component_sizes <- components_list$csize

gc_nodes <- which(components_list$membership == which.max(component_sizes))
gc_subgraph <- induced_subgraph(bigram_graph, gc_nodes)
network_diameter <- diameter(gc_subgraph)
avg_path_length <- mean_distance(gc_subgraph)
assortativity_degree_val <- assortativity_degree(bigram_graph)

cat("Resumen de cohesión de red (palabras):\n\n")

## Resumen de cohesión de red (palabras):

print(data.frame(
  Densidad = round(network_density, 4),
  Transitividad_Global = round(transitivity_global, 4),
  Componentes = num_components,
  Diametro = network_diameter,
  Camino_Promedio = round(avg_path_length, 4),
  Asortatividad_Grado = round(assortativity_degree_val, 4)
))

##   Densidad Transitividad_Global Componentes Diametro Camino_Promedio
## 1   0.0014               0.0255         189       35          5.4472
##   Asortatividad_Grado
## 1             -0.0686

# Top clustering local
clustering_df <- data.frame(
  word = names(transitivity_local),
  local_clustering = transitivity_local
) %>% filter(!is.na(local_clustering)) %>% arrange(desc(local_clustering)) %>% head(20)

ggplot(clustering_df, aes(x = reorder(word, local_clustering), y = local_clustering, fill = local_clustering)) +
  geom_col() + coord_flip() + theme_minimal() +
  scale_fill_viridis_c(option = "mako", direction = -1) +
  labs(title = "Top 20 por Clustering Local", x = "Palabra", y = "Clustering")

Clustering (comunidades) en la red de palabras

# Algoritmos de comunidades
comunidades_louvain <- cluster_louvain(bigram_graph)
comunidades_labelprop <- cluster_label_prop(bigram_graph)
comunidades_walktrap <- cluster_walktrap(bigram_graph, steps = 4)

comparison_algorithms <- data.frame(
  Algoritmo = c("Louvain", "Label Propagation", "Walktrap"),
  Num_Comunidades = c(length(comunidades_louvain), length(comunidades_labelprop), length(comunidades_walktrap)),
  Modularidad = round(c(
    modularity(bigram_graph, membership(comunidades_louvain)),
    modularity(bigram_graph, membership(comunidades_labelprop)),
    modularity(bigram_graph, membership(comunidades_walktrap))
  ), 4)
)

knitr::kable(comparison_algorithms, caption = "Comparación de algoritmos de comunidades")

Comparación de algoritmos de comunidades
Algoritmo	Num_Comunidades	Modularidad
Louvain	217	0.6917
Label Propagation	357	0.6222
Walktrap	545	0.5632

# Visualización por comunidad (Louvain)
layout_fr <- layout_with_fr(bigram_graph, niter = 300)
colors_comm <- rainbow(length(comunidades_louvain))
vertex_cols <- colors_comm[membership(comunidades_louvain)]

plot(bigram_graph,
     layout = layout_fr,
     vertex.size = 3,
     vertex.label = NA,
     vertex.color = vertex_cols,
     vertex.frame.color = NA,
     edge.width = 0.2,
     edge.color = rgb(0,0,0,0.05),
     main = "Red de Palabras por Comunidades (Louvain)")

Análisis de Red de Personajes e Interacciones

Este bloque replica y amplía el análisis de main.R para construir redes entre personajes a partir de coapariciones por línea y clasificarlas por tipo de interacción (misión, conversación, conflicto).

# Lista ampliada de personajes principales
main_characters <- c(
  "CJ",                   # Protagonista (unifica CJ y CARL)
  "SWEET",                # Hermano de CJ
  "CESAR",                # Cuñado de CJ
  "KENDL",                # Hermana de CJ
  "RYDER",                # Amigo/traidor
  "BIG SMOKE",            # Amigo/traidor principal
  "CATALINA",             # Novia loca
  "WOOZIE",               # Líder de las Triadas
  "TENPENNY",             # Policía corrupto principal
  "PULASKI",              # Policía corrupto
  "TRUTH",                # Hippie conspiracionista
  "TORENO",               # Agente gubernamental
  "MADD DOGG",            # Rapero
  "ZERO"                  # Ingeniero/técnico
)

# Función para detectar coapariciones por línea
find_character_mentions <- function(text, characters) {
  mentions <- matrix(0, nrow = length(characters), ncol = length(characters))
  rownames(mentions) <- characters
  colnames(mentions) <- characters
  text <- toupper(text)
  for(i in seq_along(text)) {
    present_chars <- characters[sapply(characters, function(x) grepl(x, text[i]))]
    if(length(present_chars) > 1) {
      for(j in 1:(length(present_chars)-1)) {
        for(k in (j+1):length(present_chars)) {
          mentions[present_chars[j], present_chars[k]] <- mentions[present_chars[j], present_chars[k]] + 1
          mentions[present_chars[k], present_chars[j]] <- mentions[present_chars[k], present_chars[j]] + 1
        }
      }
    }
  }
  mentions
}

gta_script_raw <- read_lines("guionGTA.txt")

# Unificar variantes de nombres de personajes
gta_script_raw <- str_replace_all(gta_script_raw, "\\bCARL\\b", "CJ")

adjacency_matrix <- find_character_mentions(gta_script_raw, main_characters)

character_graph <- graph_from_adjacency_matrix(adjacency_matrix, mode = "undirected", weighted = TRUE)

# Métricas básicas
print(list(
  densidad = edge_density(character_graph),
  diametro = diameter(character_graph),
  camino_promedio = mean_distance(character_graph),
  clustering_global = transitivity(character_graph)
))

## $densidad
## [1] 0.4285714
## 
## $diametro
## [1] 41
## 
## $camino_promedio
## [1] 10.12088
## 
## $clustering_global
## [1] 0.5822785

set.seed(123)
plot(character_graph,
     vertex.color = "#335f3f",
     vertex.size = 30,
     vertex.label.color = "white",
     vertex.label.cex = 0.8,
     edge.width = E(character_graph)$weight/5,
     main = "Red de Interacciones entre Personajes - GTA SA")

Interacciones por tipo (misión, conversación, conflicto)

classify_interactions <- function(text) {
  mission_keywords <- c("MISSION", "OBJECTIVE", "TASK", "GO", "GET", "BRING", "FIND", 
                        "COMPLETE", "FINISH", "START", "BEGIN", "DO")
  conversation_keywords <- c("TALK", "SAY", "TELL", "SPEAK", "ASK", "ANSWER", 
                             "CHAT", "DISCUSS", "EXPLAIN", "MENTION", "REPLY")
  conflict_keywords <- c("FIGHT", "KILL", "SHOOT", "ATTACK", "ANGRY", "MAD",
                         "HATE", "ENEMY", "BATTLE", "WAR", "VIOLENCE", "GANG")
  txt <- toupper(text)
  case_when(
    any(sapply(mission_keywords, grepl, x = txt)) ~ "MISSION",
    any(sapply(conversation_keywords, grepl, x = txt)) ~ "CONVERSATION",
    any(sapply(conflict_keywords, grepl, x = txt)) ~ "CONFLICT",
    TRUE ~ "OTHER"
  )
}

create_typed_mentions <- function(text, characters, interaction_type) {
  mentions <- matrix(0, nrow = length(characters), ncol = length(characters))
  rownames(mentions) <- characters
  colnames(mentions) <- characters
  text <- toupper(text)
  for(i in seq_along(text)) {
    if(classify_interactions(text[i]) == interaction_type) {
      present_chars <- characters[sapply(characters, function(x) grepl(x, text[i]))]
      if(length(present_chars) > 1) {
        for(j in 1:(length(present_chars)-1)) {
          for(k in (j+1):length(present_chars)) {
            mentions[present_chars[j], present_chars[k]] <- mentions[present_chars[j], present_chars[k]] + 1
            mentions[present_chars[k], present_chars[j]] <- mentions[present_chars[k], present_chars[j]] + 1
          }
        }
      }
    }
  }
  mentions
}

mission_graph <- graph_from_adjacency_matrix(create_typed_mentions(gta_script_raw, main_characters, "MISSION"), mode = "undirected", weighted = TRUE)
conversation_graph <- graph_from_adjacency_matrix(create_typed_mentions(gta_script_raw, main_characters, "CONVERSATION"), mode = "undirected", weighted = TRUE)
conflict_graph <- graph_from_adjacency_matrix(create_typed_mentions(gta_script_raw, main_characters, "CONFLICT"), mode = "undirected", weighted = TRUE)

par(mfrow = c(1,3), mar = c(1,1,2,1))
plot(mission_graph, vertex.color = "#335f3f", vertex.size = 22, vertex.label.color = "white", edge.width = pmax(1, E(mission_graph)$weight/3), main = "Misiones")
plot(conversation_graph, vertex.color = "#335f3f", vertex.size = 22, vertex.label.color = "white", edge.width = pmax(1, E(conversation_graph)$weight/3), main = "Conversaciones")
plot(conflict_graph, vertex.color = "#335f3f", vertex.size = 22, vertex.label.color = "white", edge.width = pmax(1, E(conflict_graph)$weight/3), main = "Conflictos")

par(mfrow = c(1,1))

# Métricas por tipo
calc_metrics <- function(g){
  data.frame(
    character = V(g)$name,
    degree = degree(g),
    betweenness = betweenness(g),
    closeness = closeness(g),
    eigenvector = eigen_centrality(g)$vector
  )
}

knitr::kable(calc_metrics(character_graph) %>% arrange(desc(degree)) %>% head(10), caption = "Top 10 Personajes - Red General (grado)")

Top 10 Personajes - Red General (grado)
	character	degree	betweenness	closeness	eigenvector
CJ	CJ	13	12.0	0.0062112	1.0000000
SWEET	SWEET	8	0.0	0.0098039	0.6218662
CESAR	CESAR	8	15.0	0.0121951	0.4885145
KENDL	KENDL	7	27.5	0.0126582	0.1866408
TENPENNY	TENPENNY	7	29.0	0.0123457	0.2606088
TRUTH	TRUTH	6	2.5	0.0108696	0.2679901
MADD DOGG	MADD DOGG	6	21.5	0.0121951	0.1321540
BIG SMOKE	BIG SMOKE	5	10.0	0.0077519	0.3955531
RYDER	RYDER	4	0.0	0.0085470	0.3631199
WOOZIE	WOOZIE	4	9.5	0.0116279	0.2081957

### Matriz de adyacencia de interacciones entre personajes

# Convertir la matriz de adyacencia a data frame para visualización ordenada
adj_df <- as.data.frame(adjacency_matrix)
adj_df <- tibble(personaje = rownames(adjacency_matrix)) %>% bind_cols(adj_df)

# Mostrar tabla compacta con estilo personalizado
knitr::kable(
  adj_df, 
  caption = "Matriz de Adyacencia - Coapariciones por Línea entre Personajes",
  align = c('l', rep('c', ncol(adj_df) - 1)),
  format = "html"
) %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center",
    font_size = 11
  ) %>%
  kableExtra::column_spec(1, bold = TRUE, width = "8em") %>%
  kableExtra::scroll_box(width = "100%", height = "500px")

Matriz de Adyacencia - Coapariciones por Línea entre Personajes
personaje	CJ	SWEET	CESAR	KENDL	RYDER	BIG SMOKE	CATALINA	WOOZIE	TENPENNY	PULASKI	TRUTH	TORENO	MADD DOGG	ZERO
CJ	0	48	42	9	25	27	27	20	19	11	22	12	11	14
SWEET	48	0	6	6	11	12	0	0	3	0	3	0	2	0
CESAR	42	6	0	9	0	0	0	1	1	0	4	1	1	0
KENDL	9	6	9	0	0	0	0	1	1	0	4	0	1	0
RYDER	25	11	0	0	0	10	0	0	3	0	0	0	0	0
BIG SMOKE	27	12	0	0	10	0	0	0	5	3	0	0	0	0
CATALINA	27	0	0	0	0	0	0	0	0	0	0	0	0	0
WOOZIE	20	0	1	1	0	0	0	0	0	0	0	0	0	2
TENPENNY	19	3	1	1	3	5	0	0	0	11	0	0	0	0
PULASKI	11	0	0	0	0	3	0	0	11	0	0	0	0	0
TRUTH	22	3	4	4	0	0	0	0	0	0	0	0	1	2
TORENO	12	0	1	0	0	0	0	0	0	0	0	0	1	0
MADD DOGG	11	2	1	1	0	0	0	0	0	0	1	1	0	0
ZERO	14	0	0	0	0	0	0	2	0	0	2	0	0	0

# Exportar también a CSV para uso externo
write.csv(adj_df, "character_interactions_adjacency.csv", row.names = FALSE)
cat("✓ Exportado: character_interactions_adjacency.csv\n")

## ✓ Exportado: character_interactions_adjacency.csv

Heatmap de la Matriz de Adyacencia

# Reutilizar función si no existe (para robustez cuando se ejecuta chunk aislado)
if(!exists("find_character_mentions")) {
  find_character_mentions <- function(text, characters) {
    mentions <- matrix(0, nrow = length(characters), ncol = length(characters))
    rownames(mentions) <- characters
    colnames(mentions) <- characters
    text <- toupper(text)
    for(i in seq_along(text)) {
      present_chars <- characters[sapply(characters, function(x) grepl(x, text[i]))]
      if(length(present_chars) > 1) {
        for(j in 1:(length(present_chars)-1)) {
          for(k in (j+1):length(present_chars)) {
            mentions[present_chars[j], present_chars[k]] <- mentions[present_chars[j], present_chars[k]] + 1
            mentions[present_chars[k], present_chars[j]] <- mentions[present_chars[k], present_chars[j]] + 1
          }
        }
      }
    }
    return(mentions)
  }
}

# Asegurar líneas crudas del guion
raw_lines <- read_lines("guionGTA.txt")
adjacency_matrix <- find_character_mentions(raw_lines, main_characters)
matrix_data <- adjacency_matrix

if(requireNamespace("pheatmap", quietly = TRUE)) {
  pheatmap::pheatmap(
    matrix_data,
    color = colorRampPalette(c("#FFFFFF", "#E8F5E9", "#A5D6A7", "#4CAF50", "#2E7D32"))(100),
    main = "Matriz de Adyacencia - Interacciones entre Personajes",
    fontsize = 12,
    fontsize_row = 12,
    fontsize_col = 12,
    cluster_rows = FALSE,
    cluster_cols = FALSE,
    display_numbers = TRUE,
    number_color = "black",
    border_color = "grey60",
    cellwidth = 30,
    cellheight = 30
  )
} else {
  # Fallback simple si pheatmap no está instalado
  image(matrix_data,
        main = "Matriz de Adyacencia (fallback sin pheatmap)",
        col = colorRampPalette(c("#FFFFFF", "#E8F5E9", "#A5D6A7", "#4CAF50", "#2E7D32"))(50),
        axes = FALSE)
  axis(1, at = seq(0,1,length.out = ncol(matrix_data)), labels = colnames(matrix_data), las = 2, cex.axis = 0.6)
  axis(2, at = seq(0,1,length.out = nrow(matrix_data)), labels = rownames(matrix_data), las = 2, cex.axis = 0.6)
}

# Exportar nuevamente (heatmap puede haber re-generado la matriz)
adj_df_heat <- tibble(personaje = rownames(matrix_data)) %>% bind_cols(as.data.frame(matrix_data))
write.csv(adj_df_heat, "character_interactions_adjacency.csv", row.names = FALSE)
cat("✓ Heatmap generado y matriz exportada: character_interactions_adjacency.csv\n")

## ✓ Heatmap generado y matriz exportada: character_interactions_adjacency.csv

cat("\n### Interpretación de la Matriz de Adyacencia\n\n")

## 
## ### Interpretación de la Matriz de Adyacencia

cat("* Los números en cada celda representan la cantidad de coapariciones por línea entre dos personajes.\n")

## * Los números en cada celda representan la cantidad de coapariciones por línea entre dos personajes.

cat("* La intensidad del color verde indica la mayor frecuencia de interacción.\n")

## * La intensidad del color verde indica la mayor frecuencia de interacción.

cat("* La matriz es simétrica: interacción A-B es igual a B-A.\n")

## * La matriz es simétrica: interacción A-B es igual a B-A.

cat("* La diagonal principal se omite (no auto-interacciones).\n")

## * La diagonal principal se omite (no auto-interacciones).

cat("* Valores altos pueden indicar escenas compartidas recurrentes o vínculos narrativos fuertes.\n")

## * Valores altos pueden indicar escenas compartidas recurrentes o vínculos narrativos fuertes.


### Red Interactiva de Interacciones entre Personajes

``` r
if (requireNamespace("networkD3", quietly = TRUE)) {
  library(networkD3)
  library(htmltools)
  
  # Usar la matriz de adyacencia ya calculada
  if (!exists("adjacency_matrix")) {
    raw_lines <- read_lines("guionGTA.txt")
    adjacency_matrix <- find_character_mentions(raw_lines, main_characters)
  }
  
  # Crear grafo desde la matriz
  char_graph <- graph_from_adjacency_matrix(adjacency_matrix, mode = "undirected", weighted = TRUE)
  
  # Convertir a formato networkD3
  char_d3 <- igraph_to_networkD3(char_graph)
  
  # Calcular métricas para tamaño y agrupación
  deg <- degree(char_graph, mode = "all")
  betw <- betweenness(char_graph, directed = FALSE)
  
  node_metrics <- tibble(
    name = V(char_graph)$name,
    degree = as.numeric(deg),
    betweenness = as.numeric(betw)
  )
  
  # Enriquecer nodos con métricas
  char_d3$nodes <- char_d3$nodes %>%
    mutate(name = as.character(name)) %>%
    left_join(node_metrics, by = "name") %>%
    mutate(
      degree = replace_na(degree, 0),
      betweenness = replace_na(betweenness, 0),
      # Tamaño basado en grado (más conexiones = más grande)
      size = 15 + ifelse(max(degree) > 0, 25 * degree / max(degree), 0),
      # Grupo basado en betweenness (centralidad de intermediación)
      group = ifelse(max(betweenness) > 0, 
                    1 + floor(3 * betweenness / max(betweenness)), 
                    1)
    )
  
  # Ajustar pesos de enlaces (normalizar para visualización)
  char_d3$links <- char_d3$links %>%
    mutate(value = pmax(1, value / 3))
  
  # Crear visualización interactiva
  viz_characters <- forceNetwork(
    Links = char_d3$links,
    Nodes = char_d3$nodes,
    Source = "source",
    Target = "target",
    NodeID = "name",
    Group = "group",
    Nodesize = "size",
    Value = "value",
    opacity = 0.95,
    fontSize = 18,
    zoom = TRUE,
    linkDistance = 150,
    charge = -500,
    linkWidth = 2,
    linkColour = "#66666680",
    bounded = FALSE,
    legend = TRUE,
    opacityNoHover = 1,
    fontFamily = "Arial, sans-serif"
  )
  
  # Guardar archivo independiente
  saveNetwork(viz_characters, "interactive_character_interactions.html")
  message("✓ Guardado interactive_character_interactions.html")
  
  # Mostrar con título y descripción
  tagList(
    tags$h3("Red Interactiva de Interacciones entre Personajes"),
    tags$p(paste0(
      "Visualización de ", nrow(char_d3$nodes), " personajes principales con ",
      nrow(char_d3$links), " conexiones. El tamaño del nodo representa el número de interacciones ",
      "totales, y el grosor de las líneas indica la frecuencia de coapariciones."
    )),
    viz_characters
  )
} else {
  cat("**networkD3 no disponible:** Para visualizar la red interactiva, instala el paquete con `install.packages('networkD3')`\n")
}

Red Interactiva de Interacciones entre Personajes

Visualización de 14 personajes principales con 39 conexiones. El tamaño del nodo representa el número de interacciones totales, y el grosor de las líneas indica la frecuencia de coapariciones.

Análisis Skipgram (ventanas de contexto)

Para evitar tiempos de cómputo elevados, se incluye un límite opcional de tokens. Ajuste max_tokens según recursos disponibles.

max_tokens <- Inf  # Puede fijarse, por ejemplo, a 15000 para acelerar

gta_skipgram_indexed <- gta_script %>%
  select(word) %>%
  mutate(word_id = row_number()) %>%
  { if(is.finite(max_tokens)) dplyr::slice(., 1:min(n(), max_tokens)) else . }

window_size <- 4
gta_skipgrams <- gta_skipgram_indexed %>%
  mutate(window_start = pmax(1, word_id - window_size),
         window_end   = pmin(n(), word_id + window_size))

skipgram_pairs <- purrr::map_dfr(seq_len(nrow(gta_skipgrams)), function(i){
  focus <- gta_skipgrams$word[i]
  rng <- gta_skipgrams$window_start[i]:gta_skipgrams$window_end[i]
  rng <- rng[rng != i]
  tibble(focus_word = focus, context_word = gta_skipgrams$word[rng], distance = abs(i - rng))
})

skipgram_weighted <- skipgram_pairs %>%
  mutate(weight = 1 / distance) %>%
  group_by(focus_word, context_word) %>%
  summarise(total_weight = sum(weight), count = n(), mean_distance = mean(distance), .groups = 'drop') %>%
  arrange(desc(total_weight))

head(skipgram_weighted, 20) %>% knitr::kable(caption = "Top 20 skipgrams por peso")

Top 20 skipgrams por peso
focus_word	context_word	total_weight	count	mean_distance
cj	mission	256.41667	331	1.731118
mission	cj	256.41667	331	1.731118
cj	cj	244.83333	592	2.915541
cj	hey	169.50000	260	2.080769
hey	cj	169.50000	260	2.080769
cj	sweet	132.91667	264	2.587121
sweet	cj	132.91667	264	2.587121
cj	yeah	117.58333	182	2.082418
yeah	cj	117.58333	182	2.082418
loc	og	103.50000	129	1.666667
og	loc	103.50000	129	1.666667
cesar	cj	93.75000	175	2.388571
cj	cesar	93.75000	175	2.388571
cj	smoke	86.58333	168	2.505952
smoke	cj	86.58333	168	2.505952
dogg	madd	84.08333	107	1.691589
madd	dogg	84.08333	107	1.691589
cj	woozie	82.91667	156	2.397436
woozie	cj	82.91667	156	2.397436
cj	ryder	82.75000	158	2.512658

Red de Skipgrams Interactiva

if (requireNamespace("networkD3", quietly = TRUE)) {
  library(networkD3)
  library(htmltools)
  
  # Umbral local (usa el existente si está definido)
  th <- if (exists("threshold")) threshold else 10
  
  # Preparar datos de aristas
  sg_df <- skipgram_weighted %>%
    filter(total_weight > th) %>%
    transmute(source = focus_word, target = context_word, value = total_weight)
  
  if (nrow(sg_df) > 0) {
    # Grafo y conversión a networkD3
    sg_graph <- graph_from_data_frame(sg_df, directed = TRUE)
    sg_d3 <- igraph_to_networkD3(sg_graph)
    
    # Tamaño basado en fuerza saliente y agrupar por betweenness
    out_w <- strength(sg_graph, mode = "out")
    betw <- betweenness(sg_graph, directed = TRUE)
    
    node_map <- tibble(
      name = igraph::V(sg_graph)$name, 
      out_weight = as.numeric(out_w),
      betweenness = as.numeric(betw)
    )
    
    sg_d3$nodes <- sg_d3$nodes %>%
      mutate(name = as.character(name)) %>%
      left_join(node_map, by = "name") %>%
      mutate(
        out_weight = replace_na(out_weight, 0),
        betweenness = replace_na(betweenness, 0),
        size = 6 + ifelse(max(out_weight) > 0, 20 * out_weight / max(out_weight), 0),
        group = ifelse(max(betweenness) > 0, 
                      1 + floor(4 * betweenness / max(betweenness)), 
                      1)
      )
    
    # Escalar pesos de enlaces
    sg_d3$links <- sg_d3$links %>% 
      mutate(value = pmax(0.8, value / 8))
    
    # Visualización interactiva con título
    viz <- forceNetwork(
      Links = sg_d3$links, 
      Nodes = sg_d3$nodes,
      Source = "source", 
      Target = "target", 
      NodeID = "name",
      Group = "group", 
      Nodesize = "size", 
      Value = "value",
      opacity = 0.9, 
      fontSize = 16, 
      zoom = TRUE,
      linkDistance = 90, 
      charge = -350, 
      linkWidth = 1.3,
      linkColour = "#88888880",
      bounded = FALSE,
      legend = FALSE,
      opacityNoHover = 1,
      fontFamily = "Arial, sans-serif"
    )
    
    saveNetwork(viz, "interactive_skipgram.html")
    message("✓ Guardado interactive_skipgram.html")
    
    # Retornar con título para visualización en el documento
    tagList(
      tags$h3("Red Interactiva de Skipgrams"),
      tags$p(paste0("Visualización interactiva de ", nrow(sg_d3$nodes), 
                    " palabras con ", nrow(sg_d3$links), 
                    " conexiones contextuales (umbral = ", th, ")")),
      viz
    )
  } else {
    tagList(
      tags$h3("Red Interactiva de Skipgrams"),
      tags$p(tags$em(paste0(
        "No hay suficientes aristas por encima del umbral (", th, 
        ") para generar la visualización interactiva. ",
        "Intenta reducir el umbral o aumentar max_tokens."
      )))
    )
  }
} else {
  cat("**networkD3 no disponible:** Se omite la visualización interactiva de skipgrams.\n\n")
  cat("Para habilitar esta función, instala el paquete con: `install.packages('networkD3')`\n")
}

Red Interactiva de Skipgrams

Visualización interactiva de 103 palabras con 351 conexiones contextuales (umbral = 10)

Conclusión

El conjunto de análisis presentados permite comprender de manera más profunda el lenguaje y los temas que estructuran GTA: San Andreas. Desde la frecuencia de palabras hasta las redes de asociaciones, los resultados muestran cómo la narrativa se construye alrededor de la violencia, las pandillas, el dinero y el poder, pero también incorpora elementos de familia, amistad y pertenencia. Estas técnicas no solo evidencian los patrones del guion, sino que también ofrecen una mirada crítica sobre el trasfondo social y cultural que inspira la historia del juego.