Este análisis busca desarrollar un modelo predictivo para detectar intrusiones cibernéticas utilizando el algoritmo LightGBM. El contenido del trabajo incluye:
Análisis exploratorio de datos (EDA)
Preprocesamiento
Modelado con LightGBM
Evaluación del modelo
Interpretación de resultados
network_packet_size
(Tamaño del paquete de
red)Tipo: Cuantitativa continua
Rango: 64-1270 bytes
Significado: Tamaño en bytes de los paquetes transmitidos en la red
Importancia para seguridad:
Paquetes muy pequeños pueden indicar sondeos de red
Paquetes inusualmente grandes pueden ser señal de intentos de desbordamiento de buffer
protocol_type
(Tipo de
protocolo)Tipo: Cualitativa nominal
Valores: TCP, UDP, ICMP
Descripción:
TCP: Conexiones confiables (web, correo, transferencia archivos)
UDP: Transmisiones rápidas (video, voz, DNS)
ICMP: Mensajería de red (ping, traceroute)
Riesgos asociados:
ICMP frecuentemente usado en ataques DoS (Denial Of Service)
UDP vulnerable a ataques de amplificación
encryption_used
(Tipo de
encriptación)Tipo: Cualitativa nominal
Valores: AES, DES, None
Niveles de seguridad:
AES: Estándar actual (256-bit recomendado)
DES: Obsoleto (vulnerable a ataques de fuerza bruta)
None: Sin protección (riesgo de interceptación)
login_attempts
(Intentos de inicio de
sesión)Tipo: Cuantitativa discreta
Rango: 1-13
Rango típico: 1-3 (usuarios legítimos)
Umbral de riesgo: >5 intentos por sesión
failed_logins
(Intentos fallidos de inicio
de sesión)Tipo: Cuantitativa discreta
Rango: 0-5
Valores esperados: 0-1 para usuarios normales
Indicador crítico: >3 fallos consecutivos
session_duration
(Duración de
sesión)Tipo: Cuantitativa continua (segundos)
Distribución típica:
Sesiones web: 2-15 minutos
Conexiones SSH: hasta horas
Bandera roja: Sesiones >8 horas sin actividad
unusual_time_access
(Acceso en horario
inusual)Tipo: Binaria (0/1)
Definición: Acceso fuera del horario laboral típico (7am-6pm)
Relevancia: 85% de ataques ocurren en horario no laboral
ip_reputation_score
(Reputación de
IP)Tipo: Cuantitativa continua
Escala: 0 (limpia) a 1 (maliciosa)
Umbral de alerta: >0.7
attack_detected
(Ataque
detectado)Tipo: Binaria (0/1)
Definición:
0: Actividad normal validada
1: Intrusión confirmada
# Librerías principales
library(readr)
library(dplyr)
library(ggplot2)
library(readxl)
library(caret)
# Visualización de datos
library(summarytools)
# Preprocesamiento
library(fastDummies)
# Modelado
library(lightgbm)
# Evaluación
library(pROC)
# Estilo gráfico
theme_set(theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "bottom"))
cybersecurity_data <- read_excel("C:/Users/Katherine/Documents/UNIVERSIDAD/2025/I SEMESTRE 2025/Seminario/Semestral/cybersecurity_intrusion_data.xlsx")
# Partición del dataset (80% train, 20% test)
set.seed(123)
index <- createDataPartition(cybersecurity_data$attack_detected,
p = 0.8,
list = FALSE)
train_data <- cybersecurity_data[index, ]
test_data <- cybersecurity_data[-index, ]
train_data |>
mutate(attack_detected = as.factor(attack_detected)) |>
count(attack_detected) |>
ggplot(aes(x = attack_detected, y = n, fill = attack_detected)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = n), vjust = -0.5) +
labs(title = "Distribución de Ataques Detectados",
x = "Ataque Detectado",
y = "Conteo") +
scale_fill_manual(values = c("#f8bbd0", "#bbdefb")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
La gráfica muestra la distribución de la variable objetivo
attack_detected
en el conjunto de entrenamiento, donde se
observa que el 55.4 % de las sesiones (4,231 casos) corresponden a
actividad normal (valor 0), mientras que el 44.6 % (3,399 casos)
representan intrusiones detectadas (valor 1).
Esta proporción evidencia una distribución moderadamente balanceada entre clases, lo cual es favorable para el entrenamiento del modelo predictivo, ya que reduce el riesgo de sesgo hacia la clase mayoritaria y permite una mejor capacidad de generalización.
# Función para gráficos de barras
plot_categorical <- function(var, title, colors) {
train_data |>
mutate(!!sym(var) := as.factor(!!sym(var))) |>
count(!!sym(var)) |>
ggplot(aes(x = !!sym(var), y = n, fill = !!sym(var))) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = n), vjust = -0.5) +
labs(title = title,
x = var,
y = "Conteo") +
scale_fill_manual(values = colors) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
}
# Protocol Type
p1 <- plot_categorical("protocol_type", "Distribución por Tipo de Protocolo",
c("#e8eaf6", "#fce4ec", "#f3e5f5"))
# Encryption Used
p2 <- plot_categorical("encryption_used", "Distribución por Tipo de Encriptación",
c("#b3e5fc", "#b2dfdb", "#fff3e0"))
# Browser Type
p3 <- plot_categorical("browser_type", "Distribución por Navegador",
c("#ffcdd2", "#f8bbd0", "#e1bee7", "#d1c4e9", "#c5cae9"))
# Unusual Time Access
p4 <- plot_categorical("unusual_time_access", "Accesos en Horarios Inusuales",
c("#fff9c4", "#ffe0b2"))
Se observa que el protocolo TCP es el más utilizado, con 5,302 registros, seguido por UDP con 1,924 registros y, en menor medida, ICMP con 404 registros. La presencia de UDP e ICMP también es relevante, ya que pueden ser vectores de ataque, como en los casos de amplificación (UDP) o ataques DoS (ICMP).
El tipo de cifrado AES es el más utilizado, con 3,743 registros, seguido por DES con 2,295 registros, y finalmente None, es decir, sin encriptación, con 1,592 registros.
Esta distribución sugiere que una parte considerable del tráfico cuenta con medidas de seguridad adecuadas (AES), mientras que aún existe un volumen significativo de datos protegidos con DES, un algoritmo considerado obsoleto y vulnerable.
El navegador Chrome domina con 4127 registros, seguido por Firefox con 1557 y Edge con 1173. En contraste, los navegadores Safari y Unknown presentan una frecuencia mucho menor, con 385 y 388 registros respectivamente.
La categoría “Unknown” puede ser indicativa de accesos mediante agentes automatizados o herramientas que ocultan el User-Agent, lo cual representa una posible señal de actividad anómala o maliciosa.
Muestra la frecuencia de sesiones según si ocurrieron dentro (0) o fuera (1) del horario laboral típico (7am–6pm).
Se observa que la gran mayoría de accesos (6,484 casos) ocurre en horarios normales, mientras que 1,146 accesos se registraron en horarios inusuales.
Aunque representan una minoría, estos accesos fuera del horario laboral pueden ser indicativos de comportamientos anómalos o intentos de intrusión.
# Estadísticas descriptivas
vars_continuas <- train_data %>%
select(network_packet_size, login_attempts, failed_logins,
session_duration, ip_reputation_score)
descr(vars_continuas,
stats = "common",
transpose = TRUE,
round.digits = 2) |>
tb() |>
kableExtra::kable(align = "c") |>
kableExtra::kable_styling(full_width = FALSE)
variable | mean | sd | min | med | max | n.valid | n | pct.valid |
---|---|---|---|---|---|---|---|---|
failed_logins | 1.5056356 | 1.0332814 | 0.0000000 | 1.0000000 | 5.0000000 | 7630 | 7630 | 100 |
ip_reputation_score | 0.3319164 | 0.1769069 | 0.0024968 | 0.3160605 | 0.9242992 | 7630 | 7630 | 100 |
login_attempts | 4.0394495 | 1.9693258 | 1.0000000 | 4.0000000 | 13.0000000 | 7630 | 7630 | 100 |
network_packet_size | 502.3728702 | 198.4153965 | 64.0000000 | 502.0000000 | 1270.0000000 | 7630 | 7630 | 100 |
session_duration | 789.7117430 | 780.7463393 | 0.5000000 | 554.8337738 | 7190.3922126 | 7630 | 7630 | 100 |
En promedio, los usuarios realizaron 4.04 intentos de inicio de sesión, con una media de 1.51 intentos fallidos, lo que podría sugerir una actividad potencialmente sospechosa, considerando que más de 5 intentos por sesión son considerados riesgosos.
El score de reputación de IP tiene una media de 0.33, indicando que la mayoría de las IPs tienen una reputación moderadamente confiable, aunque algunos valores se acercan a niveles maliciosos (máximo de 0.92).
El tamaño promedio de los paquetes de red es de 502 bytes, lo cual es consistente con un tráfico típico, aunque existen casos extremos de hasta 1270 bytes.
Finalmente, la duración promedio de sesión es de aproximadamente 790 segundos (13 minutos), con valores que van desde fracciones de segundo hasta casi 2 horas, lo cual puede indicar sesiones automatizadas o potencialmente comprometidas.
preprocess_data <- function(data) {
processed <- data %>%
select(-session_id, -attack_detected) %>%
mutate(across(c(protocol_type, encryption_used, browser_type), as.factor))
dummy_cols(
processed,
select_columns = c("protocol_type", "encryption_used", "browser_type"),
remove_selected_columns = TRUE,
remove_first_dummy = TRUE
) %>%
as.matrix()
}
X_train <- preprocess_data(train_data)
X_test <- preprocess_data(test_data)
y_train <- train_data$attack_detected
y_test <- test_data$attack_detected
# Datasets para LightGBM
dtrain <- lgb.Dataset(data = X_train, label = y_train)
dtest <- lgb.Dataset(data = X_test, label = y_test)
# Parámetros del modelo
params <- list(
objective = "binary",
metric = "auc",
boosting = "gbdt",
num_leaves = 25,
max_depth = 4,
min_data_in_leaf = 200,
learning_rate = 0.01,
feature_fraction = 0.5,
lambda_l1 = 10,
lambda_l2 = 10,
seed = 123
)
# Entrenamiento
model <- lgb.train(
params = params,
data = dtrain,
valids = list(train = dtrain, test = dtest),
nrounds = 500,
early_stopping_rounds = 30,
verbose = 1
)
## [LightGBM] [Info] Number of positive: 3399, number of negative: 4231
## [LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001374 seconds.
## You can set `force_col_wise=true` to remove the overhead.
## [LightGBM] [Info] Total Bins 803
## [LightGBM] [Info] Number of data points in the train set: 7630, number of used features: 14
## [LightGBM] [Info] [binary:BoostFromScore]: pavg=0.445478 -> initscore=-0.218957
## [LightGBM] [Info] Start training from score -0.218957
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [1]: train's auc:0.854516 test's auc:0.863584
## Will train until there is no improvement in 30 rounds.
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [2]: train's auc:0.87605 test's auc:0.882832
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [3]: train's auc:0.878798 test's auc:0.883336
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [4]: train's auc:0.882154 test's auc:0.874524
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [5]: train's auc:0.881681 test's auc:0.873121
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [6]: train's auc:0.881338 test's auc:0.872695
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [7]: train's auc:0.882075 test's auc:0.873614
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [8]: train's auc:0.882075 test's auc:0.873614
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [9]: train's auc:0.881937 test's auc:0.873534
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [10]: train's auc:0.88518 test's auc:0.876854
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [11]: train's auc:0.88518 test's auc:0.876854
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [12]: train's auc:0.885389 test's auc:0.87736
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [13]: train's auc:0.886622 test's auc:0.882983
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [14]: train's auc:0.886622 test's auc:0.882983
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [15]: train's auc:0.886622 test's auc:0.882983
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [16]: train's auc:0.886841 test's auc:0.884254
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [17]: train's auc:0.886841 test's auc:0.884254
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [18]: train's auc:0.887008 test's auc:0.884649
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [19]: train's auc:0.887561 test's auc:0.884783
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [20]: train's auc:0.887561 test's auc:0.884907
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [21]: train's auc:0.887561 test's auc:0.884907
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [22]: train's auc:0.887572 test's auc:0.885225
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [23]: train's auc:0.88756 test's auc:0.885481
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [24]: train's auc:0.888211 test's auc:0.885451
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [25]: train's auc:0.888211 test's auc:0.885451
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [26]: train's auc:0.889309 test's auc:0.887346
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [27]: train's auc:0.889488 test's auc:0.887283
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [28]: train's auc:0.889487 test's auc:0.887295
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [29]: train's auc:0.889487 test's auc:0.887295
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [30]: train's auc:0.890028 test's auc:0.887704
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [31]: train's auc:0.890141 test's auc:0.887802
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [32]: train's auc:0.890141 test's auc:0.887802
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [33]: train's auc:0.890141 test's auc:0.887802
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [34]: train's auc:0.890204 test's auc:0.887714
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [35]: train's auc:0.890095 test's auc:0.887735
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [36]: train's auc:0.890095 test's auc:0.887735
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [37]: train's auc:0.8901 test's auc:0.887732
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [38]: train's auc:0.890087 test's auc:0.887677
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [39]: train's auc:0.890235 test's auc:0.887366
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [40]: train's auc:0.8901 test's auc:0.887732
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [41]: train's auc:0.890053 test's auc:0.887659
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [42]: train's auc:0.890121 test's auc:0.887906
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [43]: train's auc:0.890115 test's auc:0.887709
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [44]: train's auc:0.890115 test's auc:0.887709
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [45]: train's auc:0.890115 test's auc:0.887709
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [46]: train's auc:0.890073 test's auc:0.888179
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [47]: train's auc:0.890045 test's auc:0.887948
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [48]: train's auc:0.89005 test's auc:0.887368
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [49]: train's auc:0.890111 test's auc:0.887735
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [50]: train's auc:0.890209 test's auc:0.888006
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [51]: train's auc:0.890209 test's auc:0.888006
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [52]: train's auc:0.89019 test's auc:0.88805
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [53]: train's auc:0.89019 test's auc:0.88805
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [54]: train's auc:0.890189 test's auc:0.88805
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [55]: train's auc:0.890189 test's auc:0.88805
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [56]: train's auc:0.89014 test's auc:0.887936
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [57]: train's auc:0.89014 test's auc:0.887936
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [58]: train's auc:0.890108 test's auc:0.887834
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [59]: train's auc:0.890212 test's auc:0.887851
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [60]: train's auc:0.890251 test's auc:0.887674
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [61]: train's auc:0.890251 test's auc:0.887674
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [62]: train's auc:0.890251 test's auc:0.887674
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [63]: train's auc:0.890263 test's auc:0.887435
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [64]: train's auc:0.890263 test's auc:0.887435
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [65]: train's auc:0.890263 test's auc:0.887435
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [66]: train's auc:0.890217 test's auc:0.887356
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [67]: train's auc:0.890217 test's auc:0.887356
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [68]: train's auc:0.890271 test's auc:0.887411
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [69]: train's auc:0.890271 test's auc:0.887411
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [70]: train's auc:0.890257 test's auc:0.887238
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [71]: train's auc:0.890257 test's auc:0.887238
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [72]: train's auc:0.890386 test's auc:0.887284
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [73]: train's auc:0.890386 test's auc:0.887284
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [74]: train's auc:0.890386 test's auc:0.887284
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [75]: train's auc:0.890386 test's auc:0.887284
## [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
## [76]: train's auc:0.890386 test's auc:0.887284
## Early stopping, best iteration is: [46]: train's auc:0.890073 test's auc:0.888179
# Cálculo de métricas
auc_train <- model$record_evals$train$auc$eval
auc_test <- model$record_evals$test$auc$eval
gini_train <- 2 * unlist(auc_train) - 1
gini_test <- 2 * unlist(auc_test) - 1
gini_df <- data.frame(
iter = seq_along(gini_train),
train = gini_train,
test = gini_test
)
# Visualización
ggplot(gini_df, aes(x = iter)) +
ylim(0.5, 1) +
geom_line(aes(y = train, color = "Train"), size = 1.2) +
geom_line(aes(y = test, color = "Test"), size = 1.2) +
scale_color_manual(values = c("Train" = "#9fa8da", "Test" = "#f48fb1")) +
labs(
title = "Evolución del Coeficiente Gini durante el Entrenamiento",
x = "Iteracion",
y = "Gini",
color = "Conjunto"
)+
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Mide qué tan bien un modelo separa las clases, en este caso intrusión vs. actividad normal.
Se observa que, tras unas pocas iteraciones iniciales, el Gini alcanza rápidamente una meseta en ambos conjuntos, estabilizándose alrededor de 0.77 para Train y 0.76 para Test.
Un Gini por encima de 0.75 indica un alto poder predictivo del modelo para distinguir entre sesiones normales e intrusiones.
# Cálculo de curvas ROC
roc_train <- roc(y_train, predict(model, X_train))
roc_test <- roc(y_test, predict(model, X_test))
# Data frame para ggplot
roc_df <- bind_rows(
data.frame(
fpr = 1 - roc_train$specificities,
tpr = roc_train$sensitivities,
dataset = "Train"
),
data.frame(
fpr = 1 - roc_test$specificities,
tpr = roc_test$sensitivities,
dataset = "Test"
)
)
# Gráfico
ggplot(roc_df, aes(x = fpr, y = tpr, color = dataset)) +
geom_line(size = 1) +
geom_abline(linetype = "dashed", color = "gray") +
labs(title = "Curvas ROC",
x = "Tasa de Falsos Positivos (1 - Especificidad)",
y = "Tasa de Verdaderos Positivos (Sensibilidad)",
caption = paste("AUC Train:", round(auc(roc_train), 3),
"| AUC Test:", round(auc(roc_test), 3))) +
scale_color_manual(values = c("#d1c4e9", "#e57373"))
Ambas curvas se elevan muy por encima de la línea diagonal (que representa una clasificación aleatoria), lo que indica que el modelo tiene un buen rendimiento.
Los valores del área bajo la curva (AUC) son 0.89 para Train y 0.888 para Test, lo cual refleja una capacidad consistente y robusta del modelo para distinguir entre sesiones normales e intrusiones.
# Matriz de confusión para test
conf_matrix <- table(
Predicted = factor(round(predict(model, X_test)), levels = c(1, 0)),
Actual = factor(y_test, levels = c(0, 1))
)
# Visualización
ggplot(as.data.frame(conf_matrix),
aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 8, color = "black") +
scale_fill_gradient(low = "white", high = "#f8bbd0") +
labs(title = "Matriz de Confusión (Conjunto de Test)",
x = "Valor Real",
y = "Predicción") +
theme(legend.position = "none")
Verdaderos Negativos (TN): 1042 → sesiones normales correctamente clasificadas como normales.
Falsos Positivos (FP): 0 → no hubo sesiones normales clasificadas erróneamente como intrusiones.
Falsos Negativos (FP): 236 → intrusiones incorrectamente clasificadas como normales.
Verdaderos Positivos (TP): 629 → intrusiones correctamente identificadas.
Al presentar unos falsos negativos de 236, significa que alrededor de 1 de cada 4 intrusiones no es detectada.
En entornos de ciberseguridad, esto podría representar un riesgo, ya que algunas amenazas pasan desapercibidas.
Aun así, el rendimiento general es sólido, y el modelo tiene potencial si se ajusta para reducir los falsos negativos.
# Train
conf_matrix_train <- table(
Actual = y_train,
Predicted = predict(model, X_train) |> round()
) |> as.matrix()
# Test
conf_matrix_test <- table(
Actual = y_test,
Predicted = predict(model, X_test) |> round()
) |> as.matrix()
# Cálculo de métricas
# Accuracy
accuracy_train <- sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
accuracy_test <- sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
# Precision
precision_train <- conf_matrix_train["1","1"] / sum(conf_matrix_train["1",])
precision_test <- conf_matrix_test["1","1"] / sum(conf_matrix_test["1",])
# Recall
recall_train <- conf_matrix_train["1","1"] / sum(conf_matrix_train[,"1"])
recall_test <- conf_matrix_test["1","1"] / sum(conf_matrix_test[,"1"])
# F1-Score
f1_train <- 2 * precision_train * recall_train / (precision_train + recall_train)
f1_test <- 2 * precision_test * recall_test / (precision_test + recall_test)
# Resultados
metrics <- data.frame(
Dataset = c("Train", "Test"),
Accuracy = c(
round(accuracy_train * 100, 2),
round(accuracy_test * 100, 2)
),
Precision = c(
round(precision_train * 100, 2),
round(precision_test * 100, 2)
),
Recall = c(
round(recall_train * 100, 2),
round(recall_test * 100, 2)
),
F1 = c(
round(f1_train, 4),
round(f1_test, 4)
)
)
metrics
## Dataset Accuracy Precision Recall F1
## 1 Train 87.04 70.90 100 0.8297
## 2 Test 87.62 72.72 100 0.8420
Accuracy: El modelo acierta en aproximadamente el 87 % de los casos, tanto en entrenamiento (87.04 %) como en prueba (87.62 %), lo cual indica buena generalización.
Precision: De todas las sesiones clasificadas como intrusión, el 70.90 % (Train) y el 72.72 % (Test) realmente lo eran. Esto refleja un pequeño porcentaje de falsos positivos, pero aceptable porque alerta de más antes que pasar por alto un ataque.
Recall: Es perfecto en ambos conjuntos, lo que significa que el modelo detecta todas las intrusiones sin dejar escapar ninguna.
F1 Score: El balance entre precisión y recall es alto (0.83 en Train, 0.84 en Test), confirmando que el modelo logra un equilibrio óptimo entre alertar correctamente e identificar todas las amenazas.
## Feature Gain Cover Frequency
## <char> <num> <num> <num>
## 1: failed_logins 0.4487938856 0.17914107 0.13855422
## 2: login_attempts 0.2985852159 0.23112251 0.20481928
## 3: ip_reputation_score 0.2144655762 0.17149014 0.15060241
## 4: browser_type_Unknown 0.0347169007 0.13298742 0.12048193
## 5: session_duration 0.0026356011 0.17891548 0.23493976
## 6: network_packet_size 0.0006810178 0.07864690 0.12048193
## 7: encryption_used_None 0.0001218027 0.02769648 0.03012048
El modelo alcanzó un buen desempeño con un AUC de 0.888 en el conjunto de test.
Las variables más importantes para la detección fueron: failed_logins, login_attempts, ip_reputation_score.
El modelo muestra buen balance entre precisión y recall, indicando que no está sesgado hacia falsos positivos o negativos.