Introducción

Este análisis busca desarrollar un modelo predictivo para detectar intrusiones cibernéticas utilizando el algoritmo LightGBM. El contenido del trabajo incluye:

Análisis exploratorio de datos (EDA)
Preprocesamiento
Modelado con LightGBM
Evaluación del modelo
Interpretación de resultados

Descripción Detallada de las Variables del Dataset

Variables de Red

1. `network_packet_size` (Tamaño del paquete de red)

Tipo: Cuantitativa continua
Rango: 64-1270 bytes
Significado: Tamaño en bytes de los paquetes transmitidos en la red
Importancia para seguridad:
- Paquetes muy pequeños pueden indicar sondeos de red
- Paquetes inusualmente grandes pueden ser señal de intentos de desbordamiento de buffer

2. `protocol_type` (Tipo de protocolo)

Tipo: Cualitativa nominal
Valores: TCP, UDP, ICMP
Descripción:
- TCP: Conexiones confiables (web, correo, transferencia archivos)
- UDP: Transmisiones rápidas (video, voz, DNS)
- ICMP: Mensajería de red (ping, traceroute)
Riesgos asociados:
- ICMP frecuentemente usado en ataques DoS (Denial Of Service)
- UDP vulnerable a ataques de amplificación

3. `encryption_used` (Tipo de encriptación)

Tipo: Cualitativa nominal
Valores: AES, DES, None
Niveles de seguridad:
- AES: Estándar actual (256-bit recomendado)
- DES: Obsoleto (vulnerable a ataques de fuerza bruta)
- None: Sin protección (riesgo de interceptación)

Variables de Comportamiento de Usuario

4. `login_attempts` (Intentos de inicio de sesión)

Tipo: Cuantitativa discreta
Rango: 1-13
Rango típico: 1-3 (usuarios legítimos)
Umbral de riesgo: >5 intentos por sesión

5. `failed_logins` (Intentos fallidos de inicio de sesión)

Tipo: Cuantitativa discreta
Rango: 0-5
Valores esperados: 0-1 para usuarios normales
Indicador crítico: >3 fallos consecutivos

6. `session_duration` (Duración de sesión)

Tipo: Cuantitativa continua (segundos)
Distribución típica:
- Sesiones web: 2-15 minutos
- Conexiones SSH: hasta horas
Bandera roja: Sesiones >8 horas sin actividad

7. `unusual_time_access` (Acceso en horario inusual)

Tipo: Binaria (0/1)
Definición: Acceso fuera del horario laboral típico (7am-6pm)
Relevancia: 85% de ataques ocurren en horario no laboral

8. `ip_reputation_score` (Reputación de IP)

Tipo: Cuantitativa continua
Escala: 0 (limpia) a 1 (maliciosa)
Umbral de alerta: >0.7

9. `browser_type` (Tipo de navegador)

Tipo: Cualitativa nominal
Valores comunes: Chrome, Firefox, Edge, Safari, Unknown
Anomalías:
- User-Agents modificados
- Navegadores obsoletos
- Valor “Unknown” (posible automatización)

Variable Objetivo

10. `attack_detected` (Ataque detectado)

Tipo: Binaria (0/1)
Definición:
- 0: Actividad normal validada
- 1: Intrusión confirmada

Configuración Inicial

# Librerías principales
library(readr)
library(dplyr)
library(ggplot2)
library(readxl)
library(caret)

# Visualización de datos
library(summarytools)

# Preprocesamiento
library(fastDummies)

# Modelado
library(lightgbm)

# Evaluación
library(pROC)

# Estilo gráfico
theme_set(theme_minimal() + 
          theme(plot.title = element_text(hjust = 0.5, face = "bold"),
                legend.position = "bottom"))

Carga de Datos

cybersecurity_data <- read_excel("C:/Users/Katherine/Documents/UNIVERSIDAD/2025/I SEMESTRE 2025/Seminario/Semestral/cybersecurity_intrusion_data.xlsx")

# Partición del dataset (80% train, 20% test)
set.seed(123) 
index <- createDataPartition(cybersecurity_data$attack_detected, 
                           p = 0.8, 
                           list = FALSE)

train_data <- cybersecurity_data[index, ]
test_data <- cybersecurity_data[-index, ]

1. Análisis Exploratorio de Datos (EDA)

Distribución de la Variable Objetivo

train_data |>
  mutate(attack_detected = as.factor(attack_detected)) |>
  count(attack_detected) |>
  ggplot(aes(x = attack_detected, y = n, fill = attack_detected)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = n), vjust = -0.5) +
  labs(title = "Distribución de Ataques Detectados",
       x = "Ataque Detectado",
       y = "Conteo") +
  scale_fill_manual(values = c("#f8bbd0", "#bbdefb")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

La gráfica muestra la distribución de la variable objetivo attack_detected en el conjunto de entrenamiento, donde se observa que el 55.4 % de las sesiones (4,231 casos) corresponden a actividad normal (valor 0), mientras que el 44.6 % (3,399 casos) representan intrusiones detectadas (valor 1).

Esta proporción evidencia una distribución moderadamente balanceada entre clases, lo cual es favorable para el entrenamiento del modelo predictivo, ya que reduce el riesgo de sesgo hacia la clase mayoritaria y permite una mejor capacidad de generalización.

Análisis de Variables Categóricas

# Función para gráficos de barras
plot_categorical <- function(var, title, colors) {
  train_data |>
    mutate(!!sym(var) := as.factor(!!sym(var))) |>
    count(!!sym(var)) |>
    ggplot(aes(x = !!sym(var), y = n, fill = !!sym(var))) +
    geom_col(show.legend = FALSE) +
    geom_text(aes(label = n), vjust = -0.5) +
    labs(title = title,
         x = var,
         y = "Conteo") +
    scale_fill_manual(values = colors) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
}

# Protocol Type
p1 <- plot_categorical("protocol_type", "Distribución por Tipo de Protocolo",
                      c("#e8eaf6", "#fce4ec", "#f3e5f5"))

# Encryption Used
p2 <- plot_categorical("encryption_used", "Distribución por Tipo de Encriptación",
                      c("#b3e5fc", "#b2dfdb", "#fff3e0"))

# Browser Type
p3 <- plot_categorical("browser_type", "Distribución por Navegador",
                      c("#ffcdd2", "#f8bbd0", "#e1bee7", "#d1c4e9", "#c5cae9"))

# Unusual Time Access
p4 <- plot_categorical("unusual_time_access", "Accesos en Horarios Inusuales",
                      c("#fff9c4", "#ffe0b2"))

(p1)

Se observa que el protocolo TCP es el más utilizado, con 5,302 registros, seguido por UDP con 1,924 registros y, en menor medida, ICMP con 404 registros. La presencia de UDP e ICMP también es relevante, ya que pueden ser vectores de ataque, como en los casos de amplificación (UDP) o ataques DoS (ICMP).

(p2)

El tipo de cifrado AES es el más utilizado, con 3,743 registros, seguido por DES con 2,295 registros, y finalmente None, es decir, sin encriptación, con 1,592 registros.

Esta distribución sugiere que una parte considerable del tráfico cuenta con medidas de seguridad adecuadas (AES), mientras que aún existe un volumen significativo de datos protegidos con DES, un algoritmo considerado obsoleto y vulnerable.

(p3)

El navegador Chrome domina con 4127 registros, seguido por Firefox con 1557 y Edge con 1173. En contraste, los navegadores Safari y Unknown presentan una frecuencia mucho menor, con 385 y 388 registros respectivamente.

La categoría “Unknown” puede ser indicativa de accesos mediante agentes automatizados o herramientas que ocultan el User-Agent, lo cual representa una posible señal de actividad anómala o maliciosa.

(p4)

Muestra la frecuencia de sesiones según si ocurrieron dentro (0) o fuera (1) del horario laboral típico (7am–6pm).

Se observa que la gran mayoría de accesos (6,484 casos) ocurre en horarios normales, mientras que 1,146 accesos se registraron en horarios inusuales.

Aunque representan una minoría, estos accesos fuera del horario laboral pueden ser indicativos de comportamientos anómalos o intentos de intrusión.

Análisis de Variables Numéricas

# Estadísticas descriptivas
vars_continuas <- train_data %>%
  select(network_packet_size, login_attempts, failed_logins,
         session_duration, ip_reputation_score)

descr(vars_continuas,
      stats = "common",
      transpose = TRUE,
      round.digits = 2) |>
  tb() |>
  kableExtra::kable(align = "c") |>
  kableExtra::kable_styling(full_width = FALSE)

variable	mean	sd	min	med	max	n.valid	n	pct.valid
failed_logins	1.5056356	1.0332814	0.0000000	1.0000000	5.0000000	7630	7630	100
ip_reputation_score	0.3319164	0.1769069	0.0024968	0.3160605	0.9242992	7630	7630	100
login_attempts	4.0394495	1.9693258	1.0000000	4.0000000	13.0000000	7630	7630	100
network_packet_size	502.3728702	198.4153965	64.0000000	502.0000000	1270.0000000	7630	7630	100
session_duration	789.7117430	780.7463393	0.5000000	554.8337738	7190.3922126	7630	7630	100

En promedio, los usuarios realizaron 4.04 intentos de inicio de sesión, con una media de 1.51 intentos fallidos, lo que podría sugerir una actividad potencialmente sospechosa, considerando que más de 5 intentos por sesión son considerados riesgosos.

El score de reputación de IP tiene una media de 0.33, indicando que la mayoría de las IPs tienen una reputación moderadamente confiable, aunque algunos valores se acercan a niveles maliciosos (máximo de 0.92).

El tamaño promedio de los paquetes de red es de 502 bytes, lo cual es consistente con un tráfico típico, aunque existen casos extremos de hasta 1270 bytes.

Finalmente, la duración promedio de sesión es de aproximadamente 790 segundos (13 minutos), con valores que van desde fracciones de segundo hasta casi 2 horas, lo cual puede indicar sesiones automatizadas o potencialmente comprometidas.

2. Preprocesamiento de Datos

preprocess_data <- function(data) {
  processed <- data %>% 
    select(-session_id, -attack_detected) %>%
    mutate(across(c(protocol_type, encryption_used, browser_type), as.factor))
  
  dummy_cols(
    processed,
    select_columns = c("protocol_type", "encryption_used", "browser_type"),
    remove_selected_columns = TRUE,
    remove_first_dummy = TRUE
  ) %>% 
    as.matrix()
}

X_train <- preprocess_data(train_data)
X_test <- preprocess_data(test_data)

y_train <- train_data$attack_detected
y_test <- test_data$attack_detected

3. Modelado con LightGBM

Configuración del Modelo

# Datasets para LightGBM
dtrain <- lgb.Dataset(data = X_train, label = y_train)
dtest <- lgb.Dataset(data = X_test, label = y_test)

# Parámetros del modelo
params <- list(
  objective = "binary",
  metric = "auc",
  boosting = "gbdt",
  num_leaves = 25,
  max_depth = 4,
  min_data_in_leaf = 200,
  learning_rate = 0.01,
  feature_fraction = 0.5,
  lambda_l1 = 10,
  lambda_l2 = 10,
  seed = 123
)

# Entrenamiento
model <- lgb.train(
  params = params,
  data = dtrain,
  valids = list(train = dtrain, test = dtest),
  nrounds = 500,
  early_stopping_rounds = 30,
  verbose = 1
)

## [LightGBM] [Info] Number of positive: 3399, number of negative: 4231
## [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001374 seconds.
## You can set `force_col_wise=true` to remove the overhead.
## [LightGBM] [Info] Total Bins 803
## [LightGBM] [Info] Number of data points in the train set: 7630, number of used features: 14
## [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.445478 -> initscore=-0.218957
## [LightGBM] [Info] Start training from score -0.218957
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [1]:  train's auc:0.854516  test's auc:0.863584 
## Will train until there is no improvement in 30 rounds.
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [2]:  train's auc:0.87605  test's auc:0.882832 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [3]:  train's auc:0.878798  test's auc:0.883336 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [4]:  train's auc:0.882154  test's auc:0.874524 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [5]:  train's auc:0.881681  test's auc:0.873121 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [6]:  train's auc:0.881338  test's auc:0.872695 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [7]:  train's auc:0.882075  test's auc:0.873614 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [8]:  train's auc:0.882075  test's auc:0.873614 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [9]:  train's auc:0.881937  test's auc:0.873534 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [10]:  train's auc:0.88518  test's auc:0.876854 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [11]:  train's auc:0.88518  test's auc:0.876854 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [12]:  train's auc:0.885389  test's auc:0.87736 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [13]:  train's auc:0.886622  test's auc:0.882983 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [14]:  train's auc:0.886622  test's auc:0.882983 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [15]:  train's auc:0.886622  test's auc:0.882983 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [16]:  train's auc:0.886841  test's auc:0.884254 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [17]:  train's auc:0.886841  test's auc:0.884254 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [18]:  train's auc:0.887008  test's auc:0.884649 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [19]:  train's auc:0.887561  test's auc:0.884783 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [20]:  train's auc:0.887561  test's auc:0.884907 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [21]:  train's auc:0.887561  test's auc:0.884907 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [22]:  train's auc:0.887572  test's auc:0.885225 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [23]:  train's auc:0.88756  test's auc:0.885481 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [24]:  train's auc:0.888211  test's auc:0.885451 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [25]:  train's auc:0.888211  test's auc:0.885451 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [26]:  train's auc:0.889309  test's auc:0.887346 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [27]:  train's auc:0.889488  test's auc:0.887283 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [28]:  train's auc:0.889487  test's auc:0.887295 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [29]:  train's auc:0.889487  test's auc:0.887295 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [30]:  train's auc:0.890028  test's auc:0.887704 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [31]:  train's auc:0.890141  test's auc:0.887802 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [32]:  train's auc:0.890141  test's auc:0.887802 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [33]:  train's auc:0.890141  test's auc:0.887802 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [34]:  train's auc:0.890204  test's auc:0.887714 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [35]:  train's auc:0.890095  test's auc:0.887735 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [36]:  train's auc:0.890095  test's auc:0.887735 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [37]:  train's auc:0.8901  test's auc:0.887732 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [38]:  train's auc:0.890087  test's auc:0.887677 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [39]:  train's auc:0.890235  test's auc:0.887366 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [40]:  train's auc:0.8901  test's auc:0.887732 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [41]:  train's auc:0.890053  test's auc:0.887659 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [42]:  train's auc:0.890121  test's auc:0.887906 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [43]:  train's auc:0.890115  test's auc:0.887709 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [44]:  train's auc:0.890115  test's auc:0.887709 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [45]:  train's auc:0.890115  test's auc:0.887709 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [46]:  train's auc:0.890073  test's auc:0.888179 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [47]:  train's auc:0.890045  test's auc:0.887948 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [48]:  train's auc:0.89005  test's auc:0.887368 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [49]:  train's auc:0.890111  test's auc:0.887735 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [50]:  train's auc:0.890209  test's auc:0.888006 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [51]:  train's auc:0.890209  test's auc:0.888006 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [52]:  train's auc:0.89019  test's auc:0.88805 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [53]:  train's auc:0.89019  test's auc:0.88805 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [54]:  train's auc:0.890189  test's auc:0.88805 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [55]:  train's auc:0.890189  test's auc:0.88805 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [56]:  train's auc:0.89014  test's auc:0.887936 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [57]:  train's auc:0.89014  test's auc:0.887936 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [58]:  train's auc:0.890108  test's auc:0.887834 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [59]:  train's auc:0.890212  test's auc:0.887851 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [60]:  train's auc:0.890251  test's auc:0.887674 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [61]:  train's auc:0.890251  test's auc:0.887674 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [62]:  train's auc:0.890251  test's auc:0.887674 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [63]:  train's auc:0.890263  test's auc:0.887435 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [64]:  train's auc:0.890263  test's auc:0.887435 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [65]:  train's auc:0.890263  test's auc:0.887435 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [66]:  train's auc:0.890217  test's auc:0.887356 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [67]:  train's auc:0.890217  test's auc:0.887356 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [68]:  train's auc:0.890271  test's auc:0.887411 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [69]:  train's auc:0.890271  test's auc:0.887411 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [70]:  train's auc:0.890257  test's auc:0.887238 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [71]:  train's auc:0.890257  test's auc:0.887238 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [72]:  train's auc:0.890386  test's auc:0.887284 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [73]:  train's auc:0.890386  test's auc:0.887284 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [74]:  train's auc:0.890386  test's auc:0.887284 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [75]:  train's auc:0.890386  test's auc:0.887284 
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [76]:  train's auc:0.890386  test's auc:0.887284 
## Early stopping, best iteration is: [46]:  train's auc:0.890073  test's auc:0.888179

4. Evaluación del Modelo

Evolución del Gini

# Cálculo de métricas
auc_train <- model$record_evals$train$auc$eval
auc_test <- model$record_evals$test$auc$eval
gini_train <- 2 * unlist(auc_train) - 1
gini_test <- 2 * unlist(auc_test) - 1

gini_df <- data.frame(
  iter = seq_along(gini_train),
  train = gini_train,
  test = gini_test
)

# Visualización
ggplot(gini_df, aes(x = iter)) +
  ylim(0.5, 1) +
  geom_line(aes(y = train, color = "Train"), size = 1.2) +
  geom_line(aes(y = test, color = "Test"), size = 1.2) +
  scale_color_manual(values = c("Train" = "#9fa8da", "Test" = "#f48fb1")) +
  labs(
    title = "Evolución del Coeficiente Gini durante el Entrenamiento",
    x = "Iteracion",
    y = "Gini",
    color = "Conjunto"
  )+
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5))

Mide qué tan bien un modelo separa las clases, en este caso intrusión vs. actividad normal.

Se observa que, tras unas pocas iteraciones iniciales, el Gini alcanza rápidamente una meseta en ambos conjuntos, estabilizándose alrededor de 0.77 para Train y 0.76 para Test.

Un Gini por encima de 0.75 indica un alto poder predictivo del modelo para distinguir entre sesiones normales e intrusiones.

Curva ROC

# Cálculo de curvas ROC
roc_train <- roc(y_train, predict(model, X_train))
roc_test <- roc(y_test, predict(model, X_test))

# Data frame para ggplot
roc_df <- bind_rows(
  data.frame(
    fpr = 1 - roc_train$specificities,
    tpr = roc_train$sensitivities,
    dataset = "Train"
  ),
  data.frame(
    fpr = 1 - roc_test$specificities,
    tpr = roc_test$sensitivities,
    dataset = "Test"
  )
)

# Gráfico
ggplot(roc_df, aes(x = fpr, y = tpr, color = dataset)) +
  geom_line(size = 1) +
  geom_abline(linetype = "dashed", color = "gray") +
  labs(title = "Curvas ROC",
       x = "Tasa de Falsos Positivos (1 - Especificidad)",
       y = "Tasa de Verdaderos Positivos (Sensibilidad)",
       caption = paste("AUC Train:", round(auc(roc_train), 3),
                      "| AUC Test:", round(auc(roc_test), 3))) +
  scale_color_manual(values = c("#d1c4e9", "#e57373"))

Ambas curvas se elevan muy por encima de la línea diagonal (que representa una clasificación aleatoria), lo que indica que el modelo tiene un buen rendimiento.

Los valores del área bajo la curva (AUC) son 0.89 para Train y 0.888 para Test, lo cual refleja una capacidad consistente y robusta del modelo para distinguir entre sesiones normales e intrusiones.

Matriz de Confusión

# Matriz de confusión para test
conf_matrix <- table(
  Predicted = factor(round(predict(model, X_test)), levels = c(1, 0)),
  Actual = factor(y_test, levels = c(0, 1))
) 

# Visualización
ggplot(as.data.frame(conf_matrix), 
       aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 8, color = "black") +
  scale_fill_gradient(low = "white", high = "#f8bbd0") +
  labs(title = "Matriz de Confusión (Conjunto de Test)",
       x = "Valor Real",
       y = "Predicción") +
  theme(legend.position = "none")

Verdaderos Negativos (TN): 1042 → sesiones normales correctamente clasificadas como normales.
Falsos Positivos (FP): 0 → no hubo sesiones normales clasificadas erróneamente como intrusiones.
Falsos Negativos (FP): 236 → intrusiones incorrectamente clasificadas como normales.
Verdaderos Positivos (TP): 629 → intrusiones correctamente identificadas.

Al presentar unos falsos negativos de 236, significa que alrededor de 1 de cada 4 intrusiones no es detectada.

En entornos de ciberseguridad, esto podría representar un riesgo, ya que algunas amenazas pasan desapercibidas.

Aun así, el rendimiento general es sólido, y el modelo tiene potencial si se ajusta para reducir los falsos negativos.

Métricas de Clasificación

# Train
conf_matrix_train <- table(
  Actual    = y_train,
  Predicted = predict(model, X_train) |> round()
) |> as.matrix()

# Test
conf_matrix_test <- table(
  Actual    = y_test,
  Predicted = predict(model, X_test) |> round()
) |> as.matrix()

# Cálculo de métricas

# Accuracy 
accuracy_train <- sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
accuracy_test  <- sum(diag(conf_matrix_test))  / sum(conf_matrix_test)

# Precision
precision_train <- conf_matrix_train["1","1"] / sum(conf_matrix_train["1",])
precision_test  <- conf_matrix_test["1","1"]   / sum(conf_matrix_test["1",])

# Recall
recall_train <- conf_matrix_train["1","1"] / sum(conf_matrix_train[,"1"])
recall_test  <- conf_matrix_test["1","1"]   / sum(conf_matrix_test[,"1"])

# F1-Score
f1_train <- 2 * precision_train * recall_train / (precision_train + recall_train)
f1_test  <- 2 * precision_test * recall_test / (precision_test + recall_test)

# Resultados
metrics <- data.frame(
  Dataset = c("Train", "Test"),
  Accuracy = c(
    round(accuracy_train * 100, 2), 
    round(accuracy_test * 100, 2)
  ),
  Precision = c(
    round(precision_train * 100, 2), 
    round(precision_test * 100, 2)
  ),
  Recall = c(
    round(recall_train * 100, 2), 
    round(recall_test * 100, 2)
  ),
  F1 = c(
    round(f1_train, 4), 
    round(f1_test, 4)
  )
)

metrics

##   Dataset Accuracy Precision Recall     F1
## 1   Train    87.04     70.90    100 0.8297
## 2    Test    87.62     72.72    100 0.8420

Accuracy: El modelo acierta en aproximadamente el 87 % de los casos, tanto en entrenamiento (87.04 %) como en prueba (87.62 %), lo cual indica buena generalización.
Precision: De todas las sesiones clasificadas como intrusión, el 70.90 % (Train) y el 72.72 % (Test) realmente lo eran. Esto refleja un pequeño porcentaje de falsos positivos, pero aceptable porque alerta de más antes que pasar por alto un ataque.
Recall: Es perfecto en ambos conjuntos, lo que significa que el modelo detecta todas las intrusiones sin dejar escapar ninguna.
F1 Score: El balance entre precisión y recall es alto (0.83 en Train, 0.84 en Test), confirmando que el modelo logra un equilibrio óptimo entre alertar correctamente e identificar todas las amenazas.

5. Interpretación del Modelo

Importancia de Variables

# Gráfico de importancia
lgb.importance(model)

##                 Feature         Gain      Cover  Frequency
##                  <char>        <num>      <num>      <num>
## 1:        failed_logins 0.4487938856 0.17914107 0.13855422
## 2:       login_attempts 0.2985852159 0.23112251 0.20481928
## 3:  ip_reputation_score 0.2144655762 0.17149014 0.15060241
## 4: browser_type_Unknown 0.0347169007 0.13298742 0.12048193
## 5:     session_duration 0.0026356011 0.17891548 0.23493976
## 6:  network_packet_size 0.0006810178 0.07864690 0.12048193
## 7: encryption_used_None 0.0001218027 0.02769648 0.03012048

lgb.importance(model) |> 
  lgb.plot.importance(top_n = 20)

Conclusiones

El modelo alcanzó un buen desempeño con un AUC de 0.888 en el conjunto de test.
Las variables más importantes para la detección fueron: failed_logins, login_attempts, ip_reputation_score.
El modelo muestra buen balance entre precisión y recall, indicando que no está sesgado hacia falsos positivos o negativos.

# Guardar modelo
lgb.save(model, "modelo_lgbm.txt")

Análisis de Intrusiones Cibernéticas con LightGBM

Katherine Flores (8-987-656)

17/07/2025

Introducción

Descripción Detallada de las Variables del Dataset

Variables de Red

1. `network_packet_size` (Tamaño del paquete de red)

2. `protocol_type` (Tipo de protocolo)

3. `encryption_used` (Tipo de encriptación)

Variables de Comportamiento de Usuario

4. `login_attempts` (Intentos de inicio de sesión)

5. `failed_logins` (Intentos fallidos de inicio de sesión)

6. `session_duration` (Duración de sesión)

7. `unusual_time_access` (Acceso en horario inusual)

8. `ip_reputation_score` (Reputación de IP)

9. `browser_type` (Tipo de navegador)

Variable Objetivo

10. `attack_detected` (Ataque detectado)

Configuración Inicial

Carga de Datos

1. Análisis Exploratorio de Datos (EDA)

Distribución de la Variable Objetivo

Análisis de Variables Categóricas

Análisis de Variables Numéricas

2. Preprocesamiento de Datos

3. Modelado con LightGBM

Configuración del Modelo

4. Evaluación del Modelo

Evolución del Gini

Curva ROC

Matriz de Confusión

Métricas de Clasificación

5. Interpretación del Modelo

Importancia de Variables

Conclusiones

Análisis de Intrusiones Cibernéticas con LightGBM

Katherine Flores (8-987-656)

17/07/2025

Introducción

Descripción Detallada de las Variables del Dataset

Variables de Red

1. network_packet_size (Tamaño del paquete de red)

2. protocol_type (Tipo de protocolo)

3. encryption_used (Tipo de encriptación)

Variables de Comportamiento de Usuario

4. login_attempts (Intentos de inicio de sesión)

5. failed_logins (Intentos fallidos de inicio de sesión)

6. session_duration (Duración de sesión)

7. unusual_time_access (Acceso en horario inusual)

8. ip_reputation_score (Reputación de IP)

9. browser_type (Tipo de navegador)

Variable Objetivo

10. attack_detected (Ataque detectado)

Configuración Inicial

Carga de Datos

1. Análisis Exploratorio de Datos (EDA)

Distribución de la Variable Objetivo

Análisis de Variables Categóricas

Análisis de Variables Numéricas

2. Preprocesamiento de Datos

3. Modelado con LightGBM

Configuración del Modelo

4. Evaluación del Modelo

Evolución del Gini

Curva ROC

Matriz de Confusión

Métricas de Clasificación

5. Interpretación del Modelo

Importancia de Variables

Conclusiones

1. `network_packet_size` (Tamaño del paquete de red)

2. `protocol_type` (Tipo de protocolo)

3. `encryption_used` (Tipo de encriptación)

4. `login_attempts` (Intentos de inicio de sesión)

5. `failed_logins` (Intentos fallidos de inicio de sesión)

6. `session_duration` (Duración de sesión)

7. `unusual_time_access` (Acceso en horario inusual)

8. `ip_reputation_score` (Reputación de IP)

9. `browser_type` (Tipo de navegador)

10. `attack_detected` (Ataque detectado)