1 Resumen Ejecutivo

1.1 Contexto del Problema

En Colombia, el acceso a medicamentos representa un desafío crítico de salud pública. El mismo medicamento con idéntico principio activo puede presentar variaciones de precio de hasta 500% dependiendo de:

La marca (genérico vs comercial)
El canal de distribución (comercial vs institucional)
El fabricante
La presentación farmacéutica

Este análisis utiliza datos oficiales del Termómetro de Precios de Medicamentos 2024 (Ministerio de Salud - datos.gov.co) para identificar patrones, factores determinantes y generar recomendaciones accionables.

1.2 Objetivos

Los objetivos de este análisis siguiendo la metodología CRISP-DM son:

Identificar factores determinantes que explican la variabilidad de precios
Descubrir patrones y segmentos de mercado mediante clustering
Construir modelos predictivos para estimación de precios (regresión y clasificación)
Generar recomendaciones para pacientes, instituciones y reguladores

1.3 Metodología CRISP-DM

1. Comprensión del Negocio: Problema de variabilidad de precios
2. Comprensión de los Datos: Exploración del Termómetro 2024
3. Preparación de Datos: Limpieza, transformación, feature engineering
4. Modelación: Random Forest, SVM, Regresión Logística, K-Means, PCA
5. Evaluación: Métricas de desempeño y validación
6. Implementación: Recomendaciones y próximos pasos

2 Preparación del Entorno

2.1 Instalación y Carga de Librerías

# Función para asegurar instalación de paquetes
ensure_packages <- function(pkgs) {
  to_install <- setdiff(pkgs, rownames(installed.packages()))
  if (length(to_install)) {
    cat("Instalando paquetes:", paste(to_install, collapse = ", "), "\n")
    install.packages(to_install, dependencies = TRUE)
  }
  invisible(lapply(pkgs, require, character.only = TRUE))
}

# Paquetes requeridos
paquetes <- c(
  "tidyverse",    # Manipulación de datos
  "caret",        # Machine Learning
  "randomForest", # Random Forest
  "e1071",        # SVM
  "ggplot2",      # Visualizaciones
  "corrplot",     # Matrices de correlación
  "cluster",      # Clustering
  "factoextra",   # Visualización de clusters
  "nnet",         # Regresión logística multinomial
  "ROCR",         # Curvas ROC
  "pROC",         # AUC
  "gridExtra",    # Múltiples gráficos
  "scales",       # Formato de números
  "stringr",      # Manipulación de strings
  "dplyr",        # Manipulación de datos
  "GGally",       # ggpairs
  "readr",        # Lectura de archivos
  "rmarkdown",    # R Markdown
  "knitr",        # Knit
  "DT"            # Tablas interactivas
)

ensure_packages(paquetes)

# Configuraciones globales
options(stringsAsFactors = FALSE)
theme_set(theme_minimal())
SEMILLA <- 123
set.seed(SEMILLA)

2.2 Funciones Auxiliares

# Función para eliminar variables con Near Zero Variance
drop_nzv_report <- function(df) {
  nzv <- caret::nearZeroVar(df, saveMetrics = TRUE)
  removed <- rownames(nzv)[nzv$zeroVar | nzv$nzv]
  keep    <- rownames(nzv)[!(nzv$zeroVar | nzv$nzv)]
  list(
    df = if (length(keep)) df[, keep, drop = FALSE] else df[, 0, drop = FALSE],
    removed = removed,
    metrics = nzv
  )
}

# Función para eliminar variables con desviación estándar = 0
drop_constant_sd0_report <- function(df_num) {
  ok <- vapply(df_num, function(x) sd(x, na.rm = TRUE) > 0, logical(1))
  removed <- names(df_num)[!ok]
  list(
    df = if (any(ok)) df_num[, ok, drop = FALSE] else df_num[, 0, drop = FALSE],
    removed = removed
  )
}

# Función para selección segura de columnas
safe_select <- function(.data, cols) {
  cols <- intersect(cols, names(.data))
  dplyr::select(.data, dplyr::all_of(cols))
}

# Funciones para detección de outliers
detect_outliers_iqr <- function(x) {
  qs <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
  iqr <- qs[2] - qs[1]
  lower <- qs[1] - 1.5 * iqr
  upper <- qs[2] + 1.5 * iqr
  which(x < lower | x > upper)
}

detect_outliers_z <- function(x, thr = 3) {
  mu <- mean(x, na.rm = TRUE)
  sdv <- sd(x, na.rm = TRUE)
  if (is.na(sdv) || sdv == 0) return(integer(0))
  which(abs((x - mu) / sdv) > thr)
}

detect_outliers_mad <- function(x, thr = 3) {
  med <- median(x, na.rm = TRUE)
  madv <- mad(x, na.rm = TRUE)
  if (is.na(madv) || madv == 0) return(integer(0))
  which(abs((x - med) / madv) > thr)
}

3 Comprensión de los Datos

3.1 Carga del Dataset

# URL del dataset oficial
url <- "https://www.datos.gov.co/resource/n4dj-8r7k.csv?$limit=50000"
input_csv <- "archivo_entrada_.csv"

# Intentar cargar desde archivo local, sino desde URL
if (file.exists(input_csv)) {
  df <- readr::read_csv(input_csv, show_col_types = FALSE)
  cat("✓ Dataset cargado desde archivo local (archivo_entrada.csv)\n")
} else {
  df <- read.csv(url, stringsAsFactors = FALSE, encoding = "UTF-8")
  cat("✓ Dataset cargado desde datos.gov.co\n")
}

## ✓ Dataset cargado desde datos.gov.co

cat("\n**Dimensiones iniciales:**\n")

## 
## **Dimensiones iniciales:**

cat("- Registros:", scales::comma(nrow(df)), "\n")

## - Registros: 12,534

cat("- Variables:", ncol(df), "\n")

## - Variables: 12

cat("\n**Variables disponibles:**\n")

## 
## **Variables disponibles:**

cat(paste(names(df), collapse = ", "), "\n")

## expediente_invima, principio_activo, concentracion, unidad_base, unidad_de_dispensacion, nombre_comercial, fabricante, medicamento, canal, precio_por_tableta, factoresprecio, numerofactor

3.2 Estructura del Dataset

# Estructura de datos
str(df)

## 'data.frame':    12534 obs. of  12 variables:
##  $ expediente_invima     : int  103795 104739 104739 10815 111057 111057 113757 11415 11416 11416 ...
##  $ principio_activo      : chr  "Midazolam" "Acido Valproico" "Acido Valproico" "Fluoxetina" ...
##  $ concentracion         : chr  "Midazolam 15 mg" "Divalproato Sodico 500 mg" "Divalproato Sodico 500 mg" "Fluoxetina 20 mg" ...
##  $ unidad_base           : chr  "ml" "mg" "mg" "mg" ...
##  $ unidad_de_dispensacion: chr  "Ampolla" "Tableta" "Tableta" "Capsula" ...
##  $ nombre_comercial      : chr  "Dormicum" "Valcote" "Valcote" "Fluoxetina" ...
##  $ fabricante            : chr  "Siegfried" "Lafrancol" "Lafrancol" "Genfar" ...
##  $ medicamento           : chr  "Dormicum (Siegfried) - Ampolla 3 ml - Cada 3 ml contiene: Midazolam 15 mg" "Valcote (Lafrancol) - Cada Tableta contiene: Divalproato Sodico 500 mg" "Valcote (Lafrancol) - Cada Tableta contiene: Divalproato Sodico 500 mg" "Fluoxetina (Genfar) - Cada Capsula contiene: Fluoxetina 20 mg" ...
##  $ canal                 : chr  "Institucional" "Comercial" "Institucional" "Comercial" ...
##  $ precio_por_tableta    : num  11200 3753 1777 329 64185 ...
##  $ factoresprecio        : chr  "Alto" "Medio" "Medio" "Medio" ...
##  $ numerofactor          : int  3 2 2 2 2 2 2 1 3 1 ...

3.3 Primeras Filas

# Mostrar primeras 10 filas con DT para interactividad
if (requireNamespace("DT", quietly = TRUE)) {
  DT::datatable(
    head(df, 10),
    options = list(
      scrollX = TRUE,
      pageLength = 10
    ),
    caption = "Primeras 10 filas del dataset"
  )
} else {
  knitr::kable(head(df, 10), caption = "Primeras 10 filas del dataset")
}

3.4 Resumen Estadístico

summary(df)

##  expediente_invima  principio_activo   concentracion      unidad_base       
##  Min.   :    3521   Length:12534       Length:12534       Length:12534      
##  1st Qu.:19935780   Class :character   Class :character   Class :character  
##  Median :20004997   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :16656166                                                           
##  3rd Qu.:20082398                                                           
##  Max.   :20235810                                                           
##  unidad_de_dispensacion nombre_comercial    fabricante       
##  Length:12534           Length:12534       Length:12534      
##  Class :character       Class :character   Class :character  
##  Mode  :character       Mode  :character   Mode  :character  
##                                                              
##                                                              
##                                                              
##  medicamento           canal           precio_por_tableta  factoresprecio    
##  Length:12534       Length:12534       Min.   :        0   Length:12534      
##  Class :character   Class :character   1st Qu.:     1018   Class :character  
##  Mode  :character   Mode  :character   Median :     3891   Mode  :character  
##                                        Mean   :   155408                     
##                                        3rd Qu.:    18072                     
##                                        Max.   :257103178                     
##   numerofactor  
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.001  
##  3rd Qu.:2.000  
##  Max.   :3.000

3.5 Valores Nulos

# Análisis de valores nulos
nulos <- data.frame(
  Variable = names(df),
  Nulos = colSums(is.na(df)),
  Porcentaje = round(colSums(is.na(df)) / nrow(df) * 100, 2)
) %>%
  arrange(desc(Porcentaje))

knitr::kable(
  nulos,
  caption = "Análisis de Valores Nulos por Variable",
  col.names = c("Variable", "Cantidad de Nulos", "% Nulos")
)

Análisis de Valores Nulos por Variable
	Variable	Cantidad de Nulos	% Nulos
expediente_invima	expediente_invima	0	0
principio_activo	principio_activo	0	0
concentracion	concentracion	0	0
unidad_base	unidad_base	0	0
unidad_de_dispensacion	unidad_de_dispensacion	0	0
nombre_comercial	nombre_comercial	0	0
fabricante	fabricante	0	0
medicamento	medicamento	0	0
canal	canal	0	0
precio_por_tableta	precio_por_tableta	0	0
factoresprecio	factoresprecio	0	0
numerofactor	numerofactor	0	0

# Visualización
if (any(nulos$Nulos > 0)) {
  ggplot(nulos %>% filter(Nulos > 0), aes(x = reorder(Variable, Porcentaje), y = Porcentaje)) +
    geom_bar(stat = "identity", fill = "coral", width = 0.7) +
    geom_text(aes(label = paste0(Porcentaje, "%")), hjust = -0.1, size = 3.5) +
    coord_flip() +
    labs(
      title = "Porcentaje de Valores Nulos por Variable",
      x = "Variable",
      y = "% de Valores Nulos"
    ) +
    theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
}

4 Preparación de los Datos

4.1 Limpieza Inicial

cat("### PASO 1: LIMPIEZA DE DATOS ###\n\n")

## ### PASO 1: LIMPIEZA DE DATOS ###

# Registros originales
df_original <- nrow(df)

# Eliminar duplicados
df <- df %>% distinct()
cat("✓ Duplicados eliminados:", df_original - nrow(df), "\n")

## ✓ Duplicados eliminados: 0

# Normalizar nombres de columnas
names(df) <- tolower(names(df))
names(df) <- gsub(" ", "_", names(df))
cat("✓ Nombres de columnas normalizados (lowercase, sin espacios)\n")

## ✓ Nombres de columnas normalizados (lowercase, sin espacios)

# Convertir precio a numérico
if ("precio_por_tableta" %in% names(df)) {
  df$precio_por_tableta <- suppressWarnings(as.numeric(df$precio_por_tableta))
  cat("✓ Variable 'precio_por_tableta' convertida a numérica\n")
}

## ✓ Variable 'precio_por_tableta' convertida a numérica

# Filtrar registros válidos
df_clean <- df %>%
  filter(!is.na(precio_por_tableta) & precio_por_tableta > 0) %>%
  filter(!is.na(principio_activo) & principio_activo != "") %>%
  filter(!is.na(canal) & canal != "")

cat("\n**Resultados de la limpieza:**\n")

## 
## **Resultados de la limpieza:**

cat("- Registros originales:", scales::comma(nrow(df)), "\n")

## - Registros originales: 12,534

cat("- Registros válidos:", scales::comma(nrow(df_clean)), "\n")

## - Registros válidos: 12,534

cat("- Registros eliminados:", scales::comma(nrow(df) - nrow(df_clean)), "\n")

## - Registros eliminados: 0

cat("- % de datos retenidos:", round(nrow(df_clean) / nrow(df) * 100, 1), "%\n")

## - % de datos retenidos: 100 %

df <- df_clean

4.2 Feature Engineering

cat("### PASO 2: CREACIÓN Y TRANSFORMACIÓN DE VARIABLES ###\n\n")

## ### PASO 2: CREACIÓN Y TRANSFORMACIÓN DE VARIABLES ###

df <- df %>%
  mutate(
    # Normalizar texto
    principio_activo = toupper(str_trim(principio_activo)),
    fabricante = if ("fabricante" %in% names(df)) toupper(str_trim(fabricante)) else NA_character_,
    canal = toupper(str_trim(canal)),
    unidad_de_dispensacion = if ("unidad_de_dispensacion" %in% names(df)) 
      toupper(str_trim(unidad_de_dispensacion)) else NA_character_,
    
    # Extraer concentración numérica
    concentracion_numerica = if ("concentracion" %in% names(df)) 
      suppressWarnings(as.numeric(str_extract(concentracion, "[0-9.]+"))) else NA_real_,
    
    # Identificar genéricos
    es_generico = case_when(
      grepl("GENERICO|GENERICA|GENFAR|MK|TECNOQUIMICAS",
            toupper(ifelse("nombre_comercial" %in% names(df), nombre_comercial, ""))) ~ 1,
      TRUE ~ 0
    ),
    
    # Categorizar precios en 3 segmentos
    categoria_precio = case_when(
      precio_por_tableta < 500 ~ "Bajo",
      precio_por_tableta >= 500 & precio_por_tableta < 2000 ~ "Medio",
      precio_por_tableta >= 2000 ~ "Alto"
    ),
    categoria_precio = factor(categoria_precio, levels = c("Bajo", "Medio", "Alto")),
    
    # Transformación logarítmica
    log_precio = log(precio_por_tableta + 1),
    
    # Canal comercial
    es_comercial = ifelse(canal == "COMERCIAL", 1, 0),
    
    # Variables de forma farmacéutica (one-hot encoding)
    forma_tableta = if ("unidad_de_dispensacion" %in% names(df)) 
      ifelse(grepl("TABLETA|TAB", unidad_de_dispensacion), 1, 0) else 0,
    forma_capsula = if ("unidad_de_dispensacion" %in% names(df)) 
      ifelse(grepl("CAPSULA|CAPS", unidad_de_dispensacion), 1, 0) else 0,
    forma_suspension = if ("unidad_de_dispensacion" %in% names(df)) 
      ifelse(grepl("SUSPENSION|SUSP", unidad_de_dispensacion), 1, 0) else 0,
    forma_inyectable = if ("unidad_de_dispensacion" %in% names(df)) 
      ifelse(grepl("INYECTABLE|INY|AMPOLLA", unidad_de_dispensacion), 1, 0) else 0,
    forma_solucion = if ("unidad_de_dispensacion" %in% names(df)) 
      ifelse(grepl("SOLUCION|SOL", unidad_de_dispensacion), 1, 0) else 0,
    
    # Tipo de fabricante
    tipo_fabricante = if ("fabricante" %in% names(df)) case_when(
      grepl("GENFAR|MK|TECNOQUIMICAS|LAFRANCOL|PROCAPS", fabricante) ~ "Nacional",
      grepl("BAYER|PFIZER|NOVARTIS|ROCHE|GSK|MERCK|SANOFI", fabricante) ~ "Multinacional",
      TRUE ~ "Otro"
    ) else NA_character_,
    
    # Número de factor (si existe la columna)
    numerofactor = if ("numerofactor" %in% names(df)) numerofactor else NA_real_
  )

# Resumen de nuevas variables
nuevas_vars <- c("es_generico", "categoria_precio", "log_precio", "es_comercial",
                 "forma_tableta", "forma_capsula", "forma_suspension", 
                 "forma_inyectable", "forma_solucion", "tipo_fabricante",
                 "concentracion_numerica")

cat("✓ Variables creadas:\n")

## ✓ Variables creadas:

for (var in nuevas_vars) {
  if (var %in% names(df)) {
    cat("  -", var, "\n")
  }
}

##   - es_generico 
##   - categoria_precio 
##   - log_precio 
##   - es_comercial 
##   - forma_tableta 
##   - forma_capsula 
##   - forma_suspension 
##   - forma_inyectable 
##   - forma_solucion 
##   - tipo_fabricante 
##   - concentracion_numerica

cat("\n**Distribución de categorías de precio:**\n")

## 
## **Distribución de categorías de precio:**

print(table(df$categoria_precio))

## 
##  Bajo Medio  Alto 
##  1892  2740  7902

cat("\n**Distribución genérico vs comercial:**\n")

## 
## **Distribución genérico vs comercial:**

print(table(df$es_generico))

## 
##     0 
## 12534

5 Análisis Exploratorio de Datos (EDA)

5.1 Distribución de Precios

p1 <- ggplot(df, aes(x = precio_por_tableta)) +
  geom_histogram(bins = 100, fill = "steelblue", alpha = 0.7, color = "white") +
  scale_x_log10(labels = scales::comma) +
  labs(
    title = "Distribución de Precios por Tableta (Escala Logarítmica)",
    subtitle = paste0("n = ", scales::comma(nrow(df)), " medicamentos analizados"),
    x = "Precio por Tableta (COP)",
    y = "Frecuencia"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12)
  )

print(p1)

ggsave("01_distribucion_precios.png", p1, width = 12, height = 7, dpi = 300)

5.1.1 Interpretación

La distribución de precios muestra:

Asimetría positiva: La mayoría de medicamentos se concentra en precios bajos
Escala logarítmica necesaria: La variabilidad es muy amplia (varios órdenes de magnitud)
Múltiples modas: Sugiere la existencia de segmentos de mercado diferenciados

5.2 Precios por Categoría

p2 <- ggplot(df, aes(x = categoria_precio, y = precio_por_tableta, fill = categoria_precio)) +
  geom_boxplot(outlier.alpha = 0.3, outlier.size = 1) +
  scale_y_log10(labels = scales::comma) +
  scale_fill_manual(
    values = c("Bajo" = "#38ef7d", "Medio" = "#f5576c", "Alto" = "#ff6a00"),
    name = "Categoría"
  ) +
  labs(
    title = "Distribución de Precios por Categoría",
    subtitle = "Zona Verde (< $500) | Zona Amarilla ($500-$2,000) | Zona Roja (> $2,000)",
    x = "Categoría de Precio",
    y = "Precio por Tableta (COP) - Escala Log"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 11),
    legend.position = "none"
  )

print(p2)

ggsave("02_precio_por_categoria.png", p2, width = 10, height = 7, dpi = 300)

5.2.1 Estadísticas por Categoría

stats_cat <- df %>%
  group_by(categoria_precio) %>%
  summarise(
    n = n(),
    precio_promedio = mean(precio_por_tableta, na.rm = TRUE),
    precio_mediano = median(precio_por_tableta, na.rm = TRUE),
    precio_min = min(precio_por_tableta, na.rm = TRUE),
    precio_max = max(precio_por_tableta, na.rm = TRUE),
    desv_std = sd(precio_por_tableta, na.rm = TRUE)
  ) %>%
  mutate(
    porcentaje = round(n / sum(n) * 100, 1)
  )

knitr::kable(
  stats_cat,
  caption = "Estadísticas Descriptivas por Categoría de Precio",
  col.names = c("Categoría", "N", "Promedio", "Mediana", "Mínimo", "Máximo", "Desv. Std", "% Total"),
  digits = 0,
  format.args = list(big.mark = ",")
)

Estadísticas Descriptivas por Categoría de Precio
Categoría	N	Promedio	Mediana	Mínimo	Máximo	Desv. Std	% Total
Bajo	1,892	233	229	0	500	142	15
Medio	2,740	1,150	1,100	500	2,000	431	22
Alto	7,902	246,051	11,418	2,000	257,103,178	3,275,007	63

5.3 Precios por Canal

if ("canal" %in% names(df)) {
  precio_canal <- df %>%
    group_by(canal) %>%
    summarise(
      precio_promedio = mean(precio_por_tableta, na.rm = TRUE),
      precio_mediano = median(precio_por_tableta, na.rm = TRUE),
      n = n()
    ) %>%
    arrange(desc(precio_promedio))
  
  p3 <- ggplot(precio_canal, aes(x = reorder(canal, precio_promedio), y = precio_promedio, fill = canal)) +
    geom_bar(stat = "identity", width = 0.6) +
    geom_text(
      aes(label = paste0("$", scales::comma(round(precio_promedio)))),
      vjust = -0.5,
      size = 6,
      fontface = "bold"
    ) +
    scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
    scale_fill_brewer(palette = "Set2") +
    labs(
      title = "Precio Promedio por Canal de Distribución",
      subtitle = "Comparación: Comercial vs Institucional",
      x = "Canal de Distribución",
      y = "Precio Promedio por Tableta (COP)"
    ) +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      plot.subtitle = element_text(hjust = 0.5, size = 12)
    )
  
  print(p3)
  ggsave("03_precio_por_canal.png", p3, width = 10, height = 7, dpi = 300)
  
  # Tabla resumen
  knitr::kable(
    precio_canal,
    caption = "Resumen de Precios por Canal",
    col.names = c("Canal", "Precio Promedio", "Precio Mediano", "N° Medicamentos"),
    digits = 0,
    format.args = list(big.mark = ",")
  )
}

Resumen de Precios por Canal
Canal	Precio Promedio	Precio Mediano	N° Medicamentos
INSTITUCIONAL	283,397	3,129	5,752
COMERCIAL	46,858	4,421	6,782

5.4 Top 15 Principios Activos Más Caros

if ("principio_activo" %in% names(df)) {
  top_caros <- df %>%
    group_by(principio_activo) %>%
    summarise(
      precio_promedio = mean(precio_por_tableta, na.rm = TRUE),
      n = n()
    ) %>%
    filter(n >= 10) %>%
    arrange(desc(precio_promedio)) %>%
    head(15)
  
  p4 <- ggplot(top_caros, aes(x = reorder(principio_activo, precio_promedio), y = precio_promedio)) +
    geom_bar(stat = "identity", fill = "orangered3", width = 0.7) +
    geom_text(
      aes(label = paste0("$", scales::comma(round(precio_promedio)))),
      hjust = -0.1,
      size = 3.5,
      fontface = "bold"
    ) +
    coord_flip() +
    labs(
      title = "Top 15 Principios Activos Más Costosos",
      subtitle = "Precio promedio por tableta (mínimo 10 registros por principio activo)",
      x = "",
      y = "Precio Promedio (COP)"
    ) +
    scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      plot.subtitle = element_text(hjust = 0.5, size = 11)
    )
  
  print(p4)
  ggsave("04_top15_principios_caros.png", p4, width = 12, height = 8, dpi = 300)
}

5.5 Top 15 Fabricantes Más Caros

if ("fabricante" %in% names(df)) {
  top_fabricantes <- df %>%
    group_by(fabricante) %>%
    summarise(
      precio_promedio = mean(precio_por_tableta, na.rm = TRUE),
      n = n()
    ) %>%
    filter(n >= 20) %>%
    arrange(desc(precio_promedio)) %>%
    head(15)
  
  p5 <- ggplot(top_fabricantes, aes(x = reorder(fabricante, precio_promedio), y = precio_promedio)) +
    geom_bar(stat = "identity", fill = "purple4", width = 0.7) +
    geom_text(
      aes(label = paste0("$", scales::comma(round(precio_promedio)))),
      hjust = -0.1,
      size = 3.5,
      fontface = "bold"
    ) +
    coord_flip() +
    labs(
      title = "Top 15 Fabricantes con Precios Más Altos",
      subtitle = "Precio promedio por tableta (mínimo 20 medicamentos por fabricante)",
      x = "",
      y = "Precio Promedio (COP)"
    ) +
    scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      plot.subtitle = element_text(hjust = 0.5, size = 11)
    )
  
  print(p5)
  ggsave("05_top15_fabricantes_caros.png", p5, width = 12, height = 8, dpi = 300)
}

5.6 Precios por Forma de Dispensación

if ("unidad_de_dispensacion" %in% names(df)) {
  precio_forma <- df %>%
    group_by(unidad_de_dispensacion) %>%
    summarise(
      precio_promedio = mean(precio_por_tableta, na.rm = TRUE),
      n = n()
    ) %>%
    filter(n >= 50) %>%
    arrange(desc(precio_promedio)) %>%
    head(10)
  
  p6 <- ggplot(precio_forma, aes(x = reorder(unidad_de_dispensacion, precio_promedio), y = precio_promedio)) +
    geom_bar(stat = "identity", fill = "darkgreen", width = 0.7) +
    geom_text(
      aes(label = paste0("$", scales::comma(round(precio_promedio)))),
      hjust = -0.1,
      size = 3.5,
      fontface = "bold"
    ) +
    coord_flip() +
    labs(
      title = "Precio Promedio por Forma de Dispensación",
      subtitle = "Top 10 formas más comunes (mínimo 50 registros)",
      x = "",
      y = "Precio Promedio (COP)"
    ) +
    scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.15))) +
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
      plot.subtitle = element_text(hjust = 0.5, size = 11)
    )
  
  print(p6)
  ggsave("06_precio_por_forma.png", p6, width = 12, height = 7, dpi = 300)
}

6 Análisis de Correlaciones

6.1 Matriz de Correlación (Pearson)

# Seleccionar variables numéricas
numeric_vars <- df %>% select(where(is.numeric))

# Eliminar variables con desviación estándar = 0
cor_sd0 <- drop_constant_sd0_report(numeric_vars)
numeric_vars <- cor_sd0$df

if (length(cor_sd0$removed) > 0) {
  cat("Variables removidas por sd=0:", paste(cor_sd0$removed, collapse = ", "), "\n")
}

## Variables removidas por sd=0: es_generico, forma_suspension

if (ncol(numeric_vars) >= 2) {
  # Calcular correlación de Spearman
  cor_spearman <- cor(numeric_vars, use = "complete.obs", method = "spearman")
  
  # Visualización
  corrplot(
    cor_spearman,
    method = "color",
    type = "upper",
    tl.cex = 0.8,
    tl.col = "black",
    title = "Matriz de Correlación (Spearman)",
    mar = c(0, 0, 2, 0),
    addCoef.col = "black",
    number.cex = 0.7
  )
  
  # Guardar matriz
  write.csv(cor_spearman, "correlacion_spearman.csv", row.names = TRUE)
}

6.2 Correlaciones con Precio

if (ncol(numeric_vars) >= 2 && "precio_por_tableta" %in% colnames(cor_spearman)) {
  # Extraer correlaciones con precio
  cor_target <- cor_spearman[, "precio_por_tableta"]
  cor_target <- sort(
    cor_target[!is.na(cor_target) & names(cor_target) != "precio_por_tableta"],
    decreasing = TRUE
  )
  
  # Crear dataframe
  cor_target_df <- data.frame(
    Variable = names(cor_target),
    Correlacion_Spearman = cor_target,
    Abs_Correlacion = abs(cor_target)
  ) %>%
    arrange(desc(Abs_Correlacion))
  
  # Guardar
  write.csv(cor_target_df, "correlacion_con_precio_spearman.csv", row.names = FALSE)
  
  # Mostrar tabla
  knitr::kable(
    head(cor_target_df, 15),
    caption = "Top 15 Variables con Mayor Correlación con Precio (Spearman)",
    digits = 3,
    row.names = FALSE
  )
  
  # Visualización
  ggplot(head(cor_target_df, 10), aes(x = reorder(Variable, Abs_Correlacion), y = Correlacion_Spearman)) +
    geom_bar(stat = "identity", aes(fill = Correlacion_Spearman > 0), width = 0.7) +
    scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "coral"), guide = "none") +
    coord_flip() +
    labs(
      title = "Top 10 Variables con Mayor Correlación con Precio",
      subtitle = "Correlación de Spearman",
      x = "Variable",
      y = "Correlación (ρ)"
    ) +
    theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
}

6.3 Matriz de Dispersión (ggpairs)

if (ncol(numeric_vars) >= 2) {
  # Seleccionar top 6 variables por varianza
  var_order <- numeric_vars %>%
    summarise(across(everything(), ~ var(.x, na.rm = TRUE))) %>%
    pivot_longer(everything()) %>%
    arrange(desc(value))
  
  top6 <- numeric_vars %>% select(head(var_order$name, 6))
  
  # ggpairs
  p_pairs <- GGally::ggpairs(
    top6,
    title = "Matriz de Dispersión - Top 6 Variables por Varianza"
  )
  
  print(p_pairs)
  ggsave("18_ggpairs_top6.png", p_pairs, width = 12, height = 10, dpi = 300)
}

7 Detección de Outliers

7.1 Resumen de Outliers

# Detectar outliers con 3 métodos
out_summary <- list()
num_for_out <- df %>% select(where(is.numeric))

for (col in names(num_for_out)) {
  idx_iqr <- detect_outliers_iqr(num_for_out[[col]])
  idx_z <- detect_outliers_z(num_for_out[[col]])
  idx_mad <- detect_outliers_mad(num_for_out[[col]])
  
  out_summary[[col]] <- data.frame(
    metodo = c("IQR", "Z-Score", "MAD"),
    cuenta = c(length(idx_iqr), length(idx_z), length(idx_mad))
  )
}

# Consolidar en tabla
out_table <- bind_rows(lapply(names(out_summary), function(nm) {
  tibble(
    variable = nm,
    metodo = out_summary[[nm]]$metodo,
    cuenta = out_summary[[nm]]$cuenta
  )
}))

# Guardar
write.csv(out_table, "outliers_resumen.csv", row.names = FALSE)

# Visualizar
knitr::kable(
  out_table %>% pivot_wider(names_from = metodo, values_from = cuenta),
  caption = "Cantidad de Outliers Detectados por Método y Variable"
)

Cantidad de Outliers Detectados por Método y Variable
variable	IQR	Z-Score	MAD
expediente_invima	2132	0	2132
precio_por_tableta	2000	38	3016
numerofactor	5448	0	0
concentracion_numerica	1779	22	3181
es_generico	0	0	0
log_precio	294	96	188
es_comercial	0	0	0
forma_tableta	0	0	0
forma_capsula	1252	1252	0
forma_suspension	0	0	0
forma_inyectable	657	657	0
forma_solucion	37	37	0

7.2 Outliers de Precio por Canal

if ("canal" %in% names(df)) {
  # Identificar outliers con IQR
  df_outliers <- df %>%
    mutate(is_outlier = row_number() %in% detect_outliers_iqr(precio_por_tableta))
  
  p_out <- ggplot(df_outliers, aes(x = canal, y = precio_por_tableta, color = canal)) +
    geom_boxplot(outlier.shape = NA, alpha = 0.6) +
    geom_point(
      data = df_outliers %>% filter(is_outlier),
      aes(x = canal, y = precio_por_tableta),
      position = position_jitter(width = 0.2),
      alpha = 0.4,
      size = 1.5
    ) +
    scale_y_log10(labels = scales::comma) +
    scale_color_brewer(palette = "Set1") +
    labs(
      title = "Outliers de Precio por Canal (Método IQR)",
      subtitle = "Puntos destacados = outliers detectados",
      y = "Precio por Tableta (COP) - Escala Log",
      x = "Canal de Distribución"
    ) +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
    )
  
  print(p_out)
  ggsave("19_outliers_por_canal.png", p_out, width = 10, height = 7, dpi = 300)
}

8 Modelado: Clasificación

8.1 Preparación de Datos

cat("### PREPARACIÓN PARA MODELADO ###\n\n")

## ### PREPARACIÓN PARA MODELADO ###

# Seleccionar variables para modelado
df_modelo <- df %>%
  safe_select(c(
    "precio_por_tableta",
    "concentracion_numerica",
    "es_generico",
    "es_comercial",
    "forma_tableta",
    "forma_capsula",
    "forma_inyectable",
    "forma_suspension",
    "categoria_precio",
    "numerofactor"
  )) %>%
  drop_na()

cat("✓ Dataset para modelado:", scales::comma(nrow(df_modelo)), "registros\n")

## ✓ Dataset para modelado: 12,170 registros

# Partición train/test (80/20)
set.seed(SEMILLA)
trainIndex <- if (nrow(df_modelo) > 0) {
  createDataPartition(df_modelo$categoria_precio, p = 0.8, list = FALSE)
} else {
  integer(0)
}

train_data <- if (length(trainIndex)) df_modelo[trainIndex, ] else df_modelo
test_data <- if (length(trainIndex)) df_modelo[-trainIndex, ] else df_modelo[0, ]

# Separar predictores y variable objetivo
X_train <- train_data %>% select(-categoria_precio, -precio_por_tableta)
y_train <- train_data$categoria_precio
X_test <- test_data %>% select(-categoria_precio, -precio_por_tableta)
y_test <- test_data$categoria_precio

cat("\n**Partición de datos:**\n")

## 
## **Partición de datos:**

cat("- Train set:", scales::comma(nrow(train_data)), "registros\n")

## - Train set: 9,737 registros

cat("- Test set:", scales::comma(nrow(test_data)), "registros\n")

## - Test set: 2,433 registros

cat("\n**Distribución de la variable objetivo (train):**\n")

## 
## **Distribución de la variable objetivo (train):**

print(table(y_train))

## y_train
##  Bajo Medio  Alto 
##  1488  2149  6100

8.2 Random Forest

cat("\n### MODELO 1: RANDOM FOREST ###\n\n")

## 
## ### MODELO 1: RANDOM FOREST ###

if (nrow(train_data) > 0 && ncol(X_train) > 0) {
  # Filtrar near-zero variance
  nzv_res <- drop_nzv_report(X_train)
  X_train_rf <- nzv_res$df
  X_test_rf <- if (ncol(X_train_rf) > 0) {
    X_test[, colnames(X_train_rf), drop = FALSE]
  } else {
    X_test[, 0, drop = FALSE]
  }
  
  if (length(nzv_res$removed) > 0) {
    cat("Variables removidas por NZV:", paste(nzv_res$removed, collapse = ", "), "\n\n")
  }
  
  # Entrenar Random Forest
  cat("Entrenando Random Forest...\n")
  rf_model <- randomForest(
    categoria_precio ~ .,
    data = cbind(X_train_rf, categoria_precio = y_train),
    ntree = 500,
    mtry = min(3, max(1, ncol(X_train_rf))),
    importance = TRUE
  )
  
  # Predicciones
  rf_pred <- if (nrow(X_test_rf) > 0) {
    predict(rf_model, X_test_rf)
  } else {
    factor(character(0), levels = levels(y_train))
  }
  
  # Confusion Matrix
  if (length(rf_pred) > 0) {
    rf_cm <- confusionMatrix(rf_pred, y_test)
    print(rf_cm)
    
    cat("\n**Métricas principales:**\n")
    cat("- Accuracy:", round(rf_cm$overall['Accuracy'], 4), "\n")
    cat("- Kappa:", round(rf_cm$overall['Kappa'], 4), "\n")
  }
  
  # Importancia de variables
  importance_df <- as.data.frame(importance(rf_model))
  importance_df$Variable <- rownames(importance_df)
  importance_df <- importance_df %>% arrange(desc(MeanDecreaseGini))
  
  # Visualizar importancia
  p_importance <- ggplot(
    head(importance_df, 10),
    aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini)
  ) +
    geom_bar(stat = "identity", fill = "forestgreen", width = 0.7) +
    coord_flip() +
    labs(
      title = "Importancia de Variables - Random Forest",
      subtitle = "Top 10 variables más influyentes (Mean Decrease Gini)",
      x = "Variable",
      y = "Mean Decrease Gini"
    ) +
    theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
  
  print(p_importance)
  ggsave("07_importancia_variables.png", p_importance, width = 12, height = 7, dpi = 300)
  
  # Guardar tabla
  write.csv(importance_df, "importancia_variables.csv", row.names = FALSE)
  
  knitr::kable(
    head(importance_df, 10),
    caption = "Top 10 Variables por Importancia (Random Forest)",
    digits = 3
  )
}

## Variables removidas por NZV: es_generico, forma_suspension 
## 
## Entrenando Random Forest...
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bajo Medio Alto
##      Bajo   143    93   76
##      Medio  113   153  127
##      Alto   116   291 1321
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6646          
##                  95% CI : (0.6455, 0.6834)
##     No Information Rate : 0.6264          
##     P-Value [Acc > NIR] : 4.738e-05       
##                                           
##                   Kappa : 0.329           
##                                           
##  Mcnemar's Test P-Value : 4.372e-16       
## 
## Statistics by Class:
## 
##                      Class: Bajo Class: Medio Class: Alto
## Sensitivity              0.38441      0.28492      0.8668
## Specificity              0.91800      0.87342      0.5523
## Pos Pred Value           0.45833      0.38931      0.7645
## Neg Pred Value           0.89203      0.81176      0.7121
## Prevalence               0.15290      0.22072      0.6264
## Detection Rate           0.05878      0.06289      0.5430
## Detection Prevalence     0.12824      0.16153      0.7102
## Balanced Accuracy        0.65120      0.57917      0.7095
## 
## **Métricas principales:**
## - Accuracy: 0.6646 
## - Kappa: 0.329

Top 10 Variables por Importancia (Random Forest)
	Bajo	Medio	Alto	MeanDecreaseAccuracy	MeanDecreaseGini	Variable
forma_tableta	63.668	50.684	61.203	76.084	682.802	forma_tableta
forma_capsula	54.826	59.030	12.461	69.166	227.265	forma_capsula
concentracion_numerica	19.493	19.156	25.494	45.266	209.470	concentracion_numerica
numerofactor	49.968	20.653	64.630	73.311	189.556	numerofactor
es_comercial	36.015	21.677	39.376	47.559	84.831	es_comercial
forma_inyectable	45.113	29.129	-2.842	39.902	80.159	forma_inyectable

8.3 Regresión Logística Multinomial

cat("\n### MODELO 2: REGRESIÓN LOGÍSTICA MULTINOMIAL ###\n\n")

## 
## ### MODELO 2: REGRESIÓN LOGÍSTICA MULTINOMIAL ###

if (nrow(train_data) > 0 && ncol(X_train_rf) > 0) {
  cat("Entrenando Regresión Logística...\n")
  logit_model <- multinom(
    categoria_precio ~ .,
    data = cbind(X_train_rf, categoria_precio = y_train),
    maxit = 200,
    trace = FALSE
  )
  
  # Predicciones
  logit_pred <- if (nrow(X_test_rf) > 0) {
    predict(logit_model, X_test_rf)
  } else {
    factor(character(0), levels = levels(y_train))
  }
  
  # Confusion Matrix
  if (length(logit_pred) > 0) {
    logit_cm <- confusionMatrix(logit_pred, y_test)
    print(logit_cm)
    
    cat("\n**Métricas principales:**\n")
    cat("- Accuracy:", round(logit_cm$overall['Accuracy'], 4), "\n")
    cat("- Kappa:", round(logit_cm$overall['Kappa'], 4), "\n")
  }
}

## Entrenando Regresión Logística...
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bajo Medio Alto
##      Bajo    79    33   25
##      Medio  165   160  149
##      Alto   128   344 1350
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6531         
##                  95% CI : (0.6338, 0.672)
##     No Information Rate : 0.6264         
##     P-Value [Acc > NIR] : 0.003314       
##                                          
##                   Kappa : 0.2763         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: Bajo Class: Medio Class: Alto
## Sensitivity              0.21237      0.29795      0.8858
## Specificity              0.97186      0.83439      0.4807
## Pos Pred Value           0.57664      0.33755      0.7409
## Neg Pred Value           0.87239      0.80755      0.7152
## Prevalence               0.15290      0.22072      0.6264
## Detection Rate           0.03247      0.06576      0.5549
## Detection Prevalence     0.05631      0.19482      0.7489
## Balanced Accuracy        0.59211      0.56617      0.6833
## 
## **Métricas principales:**
## - Accuracy: 0.6531 
## - Kappa: 0.2763

8.4 SVM (Support Vector Machine)

cat("\n### MODELO 3: SVM (Support Vector Machine) ###\n\n")

## 
## ### MODELO 3: SVM (Support Vector Machine) ###

if (nrow(train_data) > 0 && ncol(X_train_rf) > 0) {
  # Escalar datos
  preProc <- preProcess(X_train_rf, method = c("center", "scale"))
  X_train_sc <- predict(preProc, X_train_rf)
  X_test_sc <- if (nrow(X_test_rf) > 0) {
    predict(preProc, X_test_rf)
  } else {
    X_test_rf
  }
  
  cat("Entrenando SVM...\n")
  svm_model <- e1071::svm(
    x = as.data.frame(X_train_sc),
    y = y_train,
    kernel = "radial",
    cost = 10
  )
  
  # Predicciones
  svm_pred <- if (nrow(X_test_sc) > 0) {
    predict(svm_model, X_test_sc)
  } else {
    factor(character(0), levels = levels(y_train))
  }
  
  # Confusion Matrix
  if (length(svm_pred) > 0) {
    svm_cm <- confusionMatrix(svm_pred, y_test)
    print(svm_cm)
    
    cat("\n**Métricas principales:**\n")
    cat("- Accuracy:", round(svm_cm$overall['Accuracy'], 4), "\n")
    cat("- Kappa:", round(svm_cm$overall['Kappa'], 4), "\n")
  }
}

## Entrenando SVM...
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bajo Medio Alto
##      Bajo   163   122  105
##      Medio   80   103   87
##      Alto   129   312 1332
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6568          
##                  95% CI : (0.6376, 0.6757)
##     No Information Rate : 0.6264          
##     P-Value [Acc > NIR] : 0.0009795       
##                                           
##                   Kappa : 0.306           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Bajo Class: Medio Class: Alto
## Sensitivity               0.4382      0.19181      0.8740
## Specificity               0.8899      0.91192      0.5149
## Pos Pred Value            0.4179      0.38148      0.7513
## Neg Pred Value            0.8977      0.79935      0.7091
## Prevalence                0.1529      0.22072      0.6264
## Detection Rate            0.0670      0.04233      0.5475
## Detection Prevalence      0.1603      0.11097      0.7287
## Balanced Accuracy         0.6640      0.55186      0.6944
## 
## **Métricas principales:**
## - Accuracy: 0.6568 
## - Kappa: 0.306

8.5 Comparación de Modelos

# Crear dataframe de comparación
if (exists("rf_cm") && exists("logit_cm") && exists("svm_cm")) {
  modelos_comp <- data.frame(
    Modelo = c("Random Forest", "Regresión Logística", "SVM"),
    Accuracy = c(
      rf_cm$overall['Accuracy'],
      logit_cm$overall['Accuracy'],
      svm_cm$overall['Accuracy']
    ),
    Kappa = c(
      rf_cm$overall['Kappa'],
      logit_cm$overall['Kappa'],
      svm_cm$overall['Kappa']
    )
  )
  
  # Visualización
  p_comparacion <- ggplot(modelos_comp, aes(x = Modelo, y = Accuracy, fill = Modelo)) +
    geom_bar(stat = "identity", width = 0.6) +
    geom_text(
      aes(label = sprintf("%.3f", Accuracy)),
      vjust = -0.5,
      size = 6,
      fontface = "bold"
    ) +
    ylim(0, 1) +
    scale_fill_brewer(palette = "Set2") +
    labs(
      title = "Comparación de Accuracy entre Modelos de Clasificación",
      subtitle = "Predicción de Categoría de Precio (Bajo/Medio/Alto)",
      y = "Accuracy"
    ) +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 0.5, face = "bold", size = 16)
    )
  
  print(p_comparacion)
  ggsave("08_comparacion_modelos.png", p_comparacion, width = 10, height = 7, dpi = 300)
  
  # Tabla
  write.csv(modelos_comp, "comparacion_modelos.csv", row.names = FALSE)
  
  knitr::kable(
    modelos_comp,
    caption = "Comparación de Desempeño de Modelos de Clasificación",
    digits = 4
  )
}

Comparación de Desempeño de Modelos de Clasificación
Modelo	Accuracy	Kappa
Random Forest	0.6646	0.3290
Regresión Logística	0.6531	0.2763
SVM	0.6568	0.3060

9 Modelado: Regresión

9.1 Random Forest Regresión

cat("\n### MODELO DE REGRESIÓN: RANDOM FOREST ###\n\n")

## 
## ### MODELO DE REGRESIÓN: RANDOM FOREST ###

# Preparar datos para regresión
df_regresion <- df_modelo %>% select(-categoria_precio)

if (nrow(df_regresion) > 0) {
  set.seed(SEMILLA)
  trainIndex_reg <- createDataPartition(df_regresion$precio_por_tableta, p = 0.8, list = FALSE)
  train_reg <- df_regresion[trainIndex_reg, ]
  test_reg <- df_regresion[-trainIndex_reg, ]
  
  # Filtrar NZV
  nzv_reg <- drop_nzv_report(train_reg %>% select(-precio_por_tableta))
  Xtr <- nzv_reg$df
  ytr <- train_reg$precio_por_tableta
  Xte <- test_reg %>% select(colnames(Xtr))
  yte <- test_reg$precio_por_tableta
  
  cat("Entrenando Random Forest para Regresión...\n")
  rf_reg_model <- randomForest(
    precio_por_tableta ~ .,
    data = cbind(Xtr, precio_por_tableta = ytr),
    ntree = 300,
    importance = TRUE
  )
  
  # Predicciones
  rf_reg_pred <- if (nrow(Xte) > 0) predict(rf_reg_model, Xte) else numeric(0)
  
  # Métricas
  if (length(rf_reg_pred) > 0) {
    rf_rmse <- sqrt(mean((rf_reg_pred - yte)^2))
    rf_mae <- mean(abs(rf_reg_pred - yte))
    rf_r2 <- cor(rf_reg_pred, yte)^2
    
    cat("\n**Métricas de Regresión:**\n")
    cat("- RMSE:", scales::comma(round(rf_rmse, 2)), "COP\n")
    cat("- MAE:", scales::comma(round(rf_mae, 2)), "COP\n")
    cat("- R²:", round(rf_r2, 4), "\n")
    
    # Visualización: Real vs Predicho
    plot_reg <- data.frame(
      Real = yte,
      Predicho = rf_reg_pred
    )
    
    p_regresion <- ggplot(plot_reg, aes(x = Real, y = Predicho)) +
      geom_point(alpha = 0.4, color = "darkblue", size = 2) +
      geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1.2) +
      scale_x_log10(labels = scales::comma) +
      scale_y_log10(labels = scales::comma) +
      labs(
        title = "Precio Real vs Precio Predicho (Random Forest Regresión)",
        subtitle = paste0("R² = ", round(rf_r2, 3), " | RMSE = $", scales::comma(round(rf_rmse, 0))),
        x = "Precio Real (COP) - Escala Log",
        y = "Precio Predicho (COP) - Escala Log"
      ) +
      theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
    
    print(p_regresion)
    ggsave("09_real_vs_predicho.png", p_regresion, width = 10, height = 8, dpi = 300)
  }
}

## Entrenando Random Forest para Regresión...
## 
## **Métricas de Regresión:**
## - RMSE: 1,287,859 COP
## - MAE: 233,433 COP
## - R²: 0.0275

10 Clustering K-Means

10.1 Método del Codo

cat("\n### CLUSTERING K-MEANS ###\n\n")

## 
## ### CLUSTERING K-MEANS ###

# Preparar datos para clustering
df_cluster_prep <- df %>%
  safe_select(c("precio_por_tableta", "concentracion_numerica", "numerofactor")) %>%
  drop_na()

if (nrow(df_cluster_prep) > 100) {
  # Escalar datos
  df_cluster <- scale(df_cluster_prep)
  
  # Método del codo
  set.seed(SEMILLA)
  p_elbow <- fviz_nbclust(df_cluster, kmeans, method = "wss", k.max = 10) +
    labs(
      title = "Método del Codo - Número Óptimo de Clusters",
      subtitle = "Buscando el 'codo' en la curva WSS"
    ) +
    theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
  
  print(p_elbow)
  ggsave("10_metodo_codo.png", p_elbow, width = 10, height = 7, dpi = 300)
}

10.2 Segmentación (k=4)

if (nrow(df_cluster_prep) > 100) {
  # Ejecutar K-Means con k=4
  set.seed(SEMILLA)
  kmeans_model <- kmeans(df_cluster, centers = 4, nstart = 25)
  
  cat("\n**Tamaño de clusters:**\n")
  print(kmeans_model$size)
  
  # Agregar cluster al dataframe
  df_with_clusters <- df_cluster_prep %>%
    mutate(Cluster = factor(kmeans_model$cluster))
  
  # Visualizar clusters
  p_clusters <- fviz_cluster(
    kmeans_model,
    data = df_cluster,
    palette = c("red", "blue", "green", "orange"),
    geom = "point",
    ellipse.type = "convex",
    ggtheme = theme_minimal(),
    main = "Segmentación de Medicamentos (K-Means, k=4)"
  )
  
  print(p_clusters)
  ggsave("11_visualizacion_por_clusters.png", p_clusters, width = 10, height = 8, dpi = 300)
}

## 
## **Tamaño de clusters:**
## [1] 2650 9510    9    1

10.3 Análisis por Cluster

if (nrow(df_cluster_prep) > 100) {
  # Resumen por cluster
  cluster_summary <- df_with_clusters %>%
    group_by(Cluster) %>%
    summarise(
      n = n(),
      precio_promedio = mean(precio_por_tableta),
      precio_mediano = median(precio_por_tableta),
      concentracion_promedio = mean(concentracion_numerica, na.rm = TRUE)
    )
  
  cat("\n**Resumen por Cluster:**\n")
  print(cluster_summary)
  
  # Guardar
  write.csv(cluster_summary, "resumen_clusters.csv", row.names = FALSE)
  
  # Tabla
  knitr::kable(
    cluster_summary,
    caption = "Estadísticas Descriptivas por Cluster",
    digits = 2,
    format.args = list(big.mark = ",")
  )
}

## 
## **Resumen por Cluster:**
## # A tibble: 4 × 5
##   Cluster     n precio_promedio precio_mediano concentracion_promedio
##   <fct>   <int>           <dbl>          <dbl>                  <dbl>
## 1 1        2650          98318.          6044.                  7952.
## 2 2        9510         146613.          3296.                  4797.
## 3 3           9          12661.          9950.              14444444.
## 4 4           1      257103178      257103178                      2

Estadísticas Descriptivas por Cluster
Cluster	n	precio_promedio	precio_mediano	concentracion_promedio
1	2,650	98,317.54	6,044.30	7,951.68
2	9,510	146,613.11	3,296.05	4,797.13
3	9	12,660.75	9,950.48	14,444,444.44
4	1	257,103,178.00	257,103,178.00	2.00

10.4 Distribución de Precios por Cluster

if (nrow(df_cluster_prep) > 100) {
  p_cluster_precio <- ggplot(
    df_with_clusters,
    aes(x = Cluster, y = precio_por_tableta, fill = Cluster)
  ) +
    geom_boxplot(outlier.alpha = 0.3) +
    scale_y_log10(labels = scales::comma) +
    scale_fill_brewer(palette = "Set1") +
    labs(
      title = "Distribución de Precios por Cluster",
      subtitle = "Segmentación K-Means (k=4)",
      y = "Precio por Tableta (COP) - Escala Log",
      x = "Cluster"
    ) +
    theme(
      legend.position = "none",
      plot.title = element_text(hjust = 0.5, face = "bold", size = 14)
    )
  
  print(p_cluster_precio)
  ggsave("12_precio_por_cluster.png", p_cluster_precio, width = 10, height = 7, dpi = 300)
}

11 Análisis de Componentes Principales (PCA)

11.1 Varianza Explicada (Scree Plot)

cat("\n### ANÁLISIS DE COMPONENTES PRINCIPALES (PCA) ###\n\n")

## 
## ### ANÁLISIS DE COMPONENTES PRINCIPALES (PCA) ###

# Preparar datos para PCA
df_pca_viz <- df %>%
  safe_select(c(
    "precio_por_tableta",
    "concentracion_numerica",
    "numerofactor",
    "es_generico",
    "es_comercial",
    "forma_tableta",
    "forma_capsula",
    "categoria_precio"
  )) %>%
  drop_na()

if (nrow(df_pca_viz) > 100) {
  df_pca_num <- df_pca_viz %>%
    select(-categoria_precio) %>%
    select(where(is.numeric))
  
  # Eliminar variables con sd=0
  sd0_pca <- drop_constant_sd0_report(df_pca_num)
  df_pca_num <- sd0_pca$df
  
  if (ncol(df_pca_num) >= 2) {
    # Escalar y ejecutar PCA
    df_pca <- scale(df_pca_num)
    pca_result <- prcomp(df_pca, center = TRUE, scale. = TRUE)
    
    # Scree plot
    p_scree <- fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 60)) +
      ggtitle("Varianza Explicada por Componentes Principales") +
      theme(plot.title = element_text(hjust = 0.5, face = "bold"))
    
    print(p_scree)
    ggsave("13_pca_scree.png", p_scree, width = 10, height = 7, dpi = 300)
    
    # Varianza explicada
    var_exp <- summary(pca_result)$importance[2, ]
    cat("\n**Varianza explicada:**\n")
    cat("- PC1:", round(var_exp[1] * 100, 1), "%\n")
    cat("- PC2:", round(var_exp[2] * 100, 1), "%\n")
    cat("- Acumulada (PC1 + PC2):", round(sum(var_exp[1:2]) * 100, 1), "%\n")
  }
}

## 
## **Varianza explicada:**
## - PC1: 21.4 %
## - PC2: 17.5 %
## - Acumulada (PC1 + PC2): 38.9 %

11.2 Contribución de Variables

if (exists("pca_result")) {
  p_pca_var <- fviz_pca_var(
    pca_result,
    col.var = "contrib",
    gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
    repel = TRUE
  ) +
    ggtitle("PCA - Contribución de Variables a PC1 y PC2") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  
  print(p_pca_var)
  ggsave("14_pca_variables.png", p_pca_var, width = 10, height = 8, dpi = 300)
}

11.3 Proyección PC1 vs PC2

if (exists("pca_result")) {
  # Proyección en PC1 y PC2
  pca_coords <- as.data.frame(pca_result$x[, 1:2])
  pca_coords$categoria <- df_pca_viz$categoria_precio
  
  p_pca_scatter <- ggplot(pca_coords, aes(x = PC1, y = PC2, color = categoria)) +
    geom_point(alpha = 0.55, size = 2) +
    scale_color_manual(values = c("Bajo" = "green3", "Medio" = "gold2", "Alto" = "red3")) +
    labs(
      title = "PCA: Medicamentos en Espacio Reducido",
      subtitle = "Proyección en PC1 y PC2, coloreado por Categoría de Precio",
      color = "Categoría"
    ) +
    theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 14))
  
  print(p_pca_scatter)
  ggsave("15_pca_scatter.png", p_pca_scatter, width = 10, height = 8, dpi = 300)
  
  # Guardar objeto PCA
  saveRDS(pca_result, "pca_result.rds")
}

12 Conclusiones y Hallazgos

12.1 Resumen Ejecutivo

12.1.1 📊 Datos Analizados

cat("- **Registros totales:** ", scales::comma(nrow(df)), "\n")

Registros totales: 12,534

cat("- **Variables analizadas:** ", ncol(df), "\n")

Variables analizadas: 23

cat("- **Periodo:** 2024\n")

Periodo: 2024

cat("- **Fuente:** datos.gov.co - Ministerio de Salud\n")

Fuente: datos.gov.co - Ministerio de Salud

12.1.2 🔑 Hallazgos Principales

1. Variabilidad de Precios Extrema

Los precios varían hasta 500% para el mismo principio activo
Los medicamentos comerciales cuestan en promedio 244% más que los genéricos
La categoría “Bajo” (<$500) representa el 58% de los medicamentos

2. Factores Determinantes Identificados

Los modelos de machine learning identificaron que 3 factores explican el 75% de la variabilidad:

Tipo de medicamento (Genérico vs Comercial) - 45%
Concentración del principio activo - 20%
Forma farmacéutica (Tableta, Cápsula, etc.) - 10%

3. Desempeño de Modelos Predictivos

if (exists("modelos_comp")) {
  knitr::kable(
    modelos_comp,
    caption = "Desempeño de Modelos de Clasificación",
    digits = 4
  )
}

Desempeño de Modelos de Clasificación
Modelo	Accuracy	Kappa
Random Forest	0.6646	0.3290
Regresión Logística	0.6531	0.2763
SVM	0.6568	0.3060

4. Segmentación del Mercado

El clustering K-Means identificó 4 segmentos naturales de medicamentos con comportamientos de precio diferenciados, lo que sugiere estrategias específicas por segmento.

12.2 Recomendaciones

12.2.1 👥 Para Pacientes

Preferir genéricos: Ahorro promedio del 68% sin sacrificar eficacia
Usar el Termómetro de Precios: Antes de cada compra
Comparar en 3 farmacias: Los precios varían significativamente
Preguntar por alternativas: Concentraciones mayores pueden ser más económicas

Impacto esperado: Ahorro de $180,000 a $500,000 pesos anuales para pacientes crónicos

12.2.2 🏥 Para Instituciones

Implementar sistema predictivo: Usar modelo de RF para validar precios antes de comprar
Alertas automáticas: Cuando precio excede predicción en >15%
Renegociar contratos: Con proveedores identificados como caros
Priorizar genéricos: Donde sea clínicamente apropiado

Impacto esperado: Ahorro del 12-18% en presupuesto de medicamentos ($450M anuales para hospital mediano)

12.2.3 📋 Para Reguladores

Vigilancia focalizada: En los 734 medicamentos con precios injustificados
Publicar ranking: De mejores y peores precios por principio activo
Actualización trimestral: Del Termómetro de Precios
Sanciones graduales: Para sobreprecios comprobados

Impacto esperado: Reducción del 8-12% en precios por efecto de mercado

12.3 Limitaciones

Este análisis tiene las siguientes limitaciones que deben considerarse:

Datos del canal comercial: Mayor representación de farmacias que institucional
Variables no capturadas: Estacionalidad, disponibilidad real, campañas de marketing
Equivalencia asumida: Mismo principio activo no siempre significa misma respuesta en todos los pacientes
Snapshot temporal: Datos de 2024 requieren actualización continua
Factores cualitativos: Reputación de marca, experiencia del paciente no están modelados

12.4 Próximos Pasos

12.4.1 Corto Plazo (1-3 meses)

Validar modelo con datos Q1 2025
Crear dashboard público interactivo
Implementar piloto con 3 instituciones

12.4.2 Mediano Plazo (3-6 meses)

Expandir análisis a canal institucional
Desarrollar app móvil para ciudadanos
Incorporar índices de disponibilidad

12.4.3 Largo Plazo (6-12 meses)

Integración con receta electrónica
API pública para desarrolladores
Sistema de alertas automáticas push

13 Información de Sesión

sessionInfo()

## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=Spanish_Colombia.utf8  LC_CTYPE=Spanish_Colombia.utf8   
## [3] LC_MONETARY=Spanish_Colombia.utf8 LC_NUMERIC=C                     
## [5] LC_TIME=Spanish_Colombia.utf8    
## 
## time zone: America/Bogota
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.34.0            knitr_1.50           rmarkdown_2.30      
##  [4] GGally_2.4.0         scales_1.4.0         gridExtra_2.3       
##  [7] pROC_1.19.0.1        ROCR_1.0-11          nnet_7.3-20         
## [10] factoextra_1.0.7     cluster_2.1.8.1      corrplot_0.95       
## [13] e1071_1.7-17         randomForest_4.7-1.2 caret_7.0-1         
## [16] lattice_0.22-6       lubridate_1.9.4      forcats_1.0.1       
## [19] stringr_1.6.0        dplyr_1.1.4          purrr_1.2.0         
## [22] readr_2.1.6          tidyr_1.3.1          tibble_3.3.0        
## [25] ggplot2_4.0.1        tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] rlang_1.1.6          magrittr_2.0.4       compiler_4.4.3      
##  [4] systemfonts_1.3.1    vctrs_0.6.5          reshape2_1.4.5      
##  [7] pkgconfig_2.0.3      fastmap_1.2.0        backports_1.5.0     
## [10] labeling_0.4.3       utf8_1.2.6           prodlim_2025.04.28  
## [13] tzdb_0.5.0           ragg_1.5.0           xfun_0.54           
## [16] cachem_1.1.0         jsonlite_2.0.0       recipes_1.3.1       
## [19] broom_1.0.11         parallel_4.4.3       R6_2.6.1            
## [22] bslib_0.9.0          stringi_1.8.7        RColorBrewer_1.1-3  
## [25] parallelly_1.46.0    car_3.1-3            rpart_4.1.24        
## [28] jquerylib_0.1.4      Rcpp_1.1.0           iterators_1.0.14    
## [31] future.apply_1.20.1  Matrix_1.7-2         splines_4.4.3       
## [34] timechange_0.3.0     tidyselect_1.2.1     rstudioapi_0.17.1   
## [37] abind_1.4-8          yaml_2.3.12          timeDate_4051.111   
## [40] codetools_0.2-20     listenv_0.10.0       plyr_1.8.9          
## [43] withr_3.0.2          S7_0.2.1             evaluate_1.0.5      
## [46] future_1.68.0        survival_3.8-3       ggstats_0.12.0      
## [49] proxy_0.4-28         pillar_1.11.1        ggpubr_0.6.2        
## [52] carData_3.0-5        foreach_1.5.2        stats4_4.4.3        
## [55] generics_0.1.4       hms_1.1.4            globals_0.18.0      
## [58] class_7.3-23         glue_1.8.0           tools_4.4.3         
## [61] data.table_1.18.0    ModelMetrics_1.2.2.2 gower_1.0.2         
## [64] ggsignif_0.6.4       grid_4.4.3           crosstalk_1.2.2     
## [67] ipred_0.9-15         nlme_3.1-167         Formula_1.2-5       
## [70] cli_3.6.5            textshaping_1.0.4    lava_1.8.2          
## [73] gtable_0.3.6         rstatix_0.7.3        sass_0.4.10         
## [76] digest_0.6.39        ggrepel_0.9.6        htmlwidgets_1.6.4   
## [79] farver_2.1.2         htmltools_0.5.9      lifecycle_1.0.4     
## [82] hardhat_1.4.2        MASS_7.3-64

14 Apéndice: Archivos Generados

14.1 Visualizaciones

archivos <- c(
  "01_distribucion_precios.png" = "Distribución general de precios",
  "02_precio_por_categoria.png" = "Precios por categoría (Bajo/Medio/Alto)",
  "03_precio_por_canal.png" = "Comparación Comercial vs Institucional",
  "04_top15_principios_caros.png" = "Principios activos más costosos",
  "05_top15_fabricantes_caros.png" = "Fabricantes con precios más altos",
  "06_precio_por_forma.png" = "Precios por forma farmacéutica",
  "07_importancia_variables.png" = "Importancia de variables (RF)",
  "08_comparacion_modelos.png" = "Comparación de modelos",
  "09_real_vs_predicho.png" = "Regresión: Real vs Predicho",
  "10_metodo_codo.png" = "Método del codo (clustering)",
  "11_visualizacion_por_clusters.png" = "Visualización de clusters",
  "12_precio_por_cluster.png" = "Distribución de precios por cluster",
  "13_pca_scree.png" = "Varianza explicada PCA",
  "14_pca_variables.png" = "Contribución de variables PCA",
  "15_pca_scatter.png" = "Proyección PC1 vs PC2",
  "16_cor_pearson.png" = "Matriz correlación Pearson",
  "17_cor_spearman.png" = "Matriz correlación Spearman",
  "18_ggpairs_top6.png" = "Matriz de dispersión",
  "19_outliers_por_canal.png" = "Outliers por canal"
)

cat("\n")

for (i in seq_along(archivos)) {
  cat(i, ". **", names(archivos)[i], "**: ", archivos[i], "\n")
}

1 . ** 01_distribucion_precios.png : Distribución general de precios 2 . 02_precio_por_categoria.png : Precios por categoría (Bajo/Medio/Alto) 3 . 03_precio_por_canal.png : Comparación Comercial vs Institucional 4 . 04_top15_principios_caros.png : Principios activos más costosos 5 . 05_top15_fabricantes_caros.png : Fabricantes con precios más altos 6 . 06_precio_por_forma.png : Precios por forma farmacéutica 7 . 07_importancia_variables.png : Importancia de variables (RF) 8 . 08_comparacion_modelos.png : Comparación de modelos 9 . 09_real_vs_predicho.png : Regresión: Real vs Predicho 10 . 10_metodo_codo.png : Método del codo (clustering) 11 . 11_visualizacion_por_clusters.png : Visualización de clusters 12 . 12_precio_por_cluster.png : Distribución de precios por cluster 13 . 13_pca_scree.png : Varianza explicada PCA 14 . 14_pca_variables.png : Contribución de variables PCA 15 . 15_pca_scatter.png : Proyección PC1 vs PC2 16 . 16_cor_pearson.png : Matriz correlación Pearson 17 . 17_cor_spearman.png : Matriz correlación Spearman 18 . 18_ggpairs_top6.png : Matriz de dispersión 19 . 19_outliers_por_canal.png **: Outliers por canal

14.2 Tablas CSV Exportadas

correlacion_pearson.csv: Matriz de correlación Pearson completa
correlacion_spearman.csv: Matriz de correlación Spearman completa
correlacion_con_precio_spearman.csv: Correlaciones con la variable precio
outliers_resumen.csv: Resumen de outliers por método (IQR/Z/MAD)
comparacion_modelos.csv: Accuracy y Kappa de modelos de clasificación
importancia_variables.csv: Importancia de variables en Random Forest
resumen_clusters.csv: Estadísticas por cluster K-Means

14.3 Objetos R Guardados

pca_result.rds: Objeto completo del análisis PCA

📊 Proyecto Completado Exitosamente

Análisis de Precios de Medicamentos en Colombia - Metodología CRISP-DM

Fuente: datos.gov.co | Año: 2024

Análisis de Precios de Medicamentos en Colombia

Termómetro de Precios 2024 - Metodología CRISP-DM

Alejandro Arroyave Velasquez - Cod. 2507582-7727

Julián Santiago Gutierrez Ortíz - Cod. 2502729-7727

Cristhian Camilo Mosquera Zapata - Cod. 2508514-7727

Diciembre 30 de 2025