Nombre: Francisco Gonzalez
Carnet: 24002914
library(tidyverse) # Incluye dplyr, ggplot2, etc.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(recipes)
##
## Attaching package: 'recipes'
##
## The following object is masked from 'package:stringr':
##
## fixed
##
## The following object is masked from 'package:stats':
##
## step
library(MASS) # Esto puede causar conflicto con select()
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(Metrics) # Para RMSE
##
## Attaching package: 'Metrics'
##
## The following objects are masked from 'package:caret':
##
## precision, recall
# función select es la de dplyr
select <- dplyr::select
# lectura del archivo
walmart <- read_csv("walmart.csv")
## Rows: 6435 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (7): Store, Weekly_Sales, Holiday_Flag, Temperature, Fuel_Price, CPI, Un...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Estructura general
glimpse(walmart)
## Rows: 6,435
## Columns: 8
## $ Store <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Date <chr> "05-02-2010", "12-02-2010", "19-02-2010", "26-02-2010", "…
## $ Weekly_Sales <dbl> 1643691, 1641957, 1611968, 1409728, 1554807, 1439542, 147…
## $ Holiday_Flag <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Temperature <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58, 51.45, 6…
## $ Fuel_Price <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720, 2.732, 2…
## $ CPI <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501, 211.380…
## $ Unemployment <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 7…
summary(walmart)
## Store Date Weekly_Sales Holiday_Flag
## Min. : 1 Length:6435 Min. : 209986 Min. :0.00000
## 1st Qu.:12 Class :character 1st Qu.: 553350 1st Qu.:0.00000
## Median :23 Mode :character Median : 960746 Median :0.00000
## Mean :23 Mean :1046965 Mean :0.06993
## 3rd Qu.:34 3rd Qu.:1420159 3rd Qu.:0.00000
## Max. :45 Max. :3818686 Max. :1.00000
## Temperature Fuel_Price CPI Unemployment
## Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
## Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
## Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
## 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
## Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
# Variables continuas a analizar
vars_cont <- c("Weekly_Sales", "Temperature", "Fuel_Price", "CPI", "Unemployment")
# Gráficas de densidad
walmart %>%
pivot_longer(all_of(vars_cont), names_to = "variable", values_to = "valor") %>%
ggplot(aes(x = valor, fill = variable)) +
geom_density(alpha = 0.5) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Gráficas de densidad para variables continuas")
# Gráfica de barras para Holiday_Flag
ggplot(walmart, aes(x = as.factor(Holiday_Flag))) +
geom_bar(fill = "skyblue") +
labs(title = "Cantidad de semanas con y sin feriado", x = "Holiday_Flag", y = "Cantidad")
# Gráfica de barras para Store
ggplot(walmart, aes(x = as.factor(Store))) +
geom_bar(fill = "orange") +
labs(title = "Cantidad de registros por tienda", x = "Store", y = "Cantidad")
# Boxplot: Weekly_Sales vs Holiday_Flag
ggplot(walmart, aes(x = as.factor(Holiday_Flag), y = Weekly_Sales, fill = as.factor(Holiday_Flag))) +
geom_boxplot() +
labs(title = "Distribución de ventas por semanas con/ sin feriado", x = "Holiday_Flag")
# Convertimos la fecha a formato Date
walmart <- walmart %>%
mutate(Date = dmy(Date))
# Serie temporal
ggplot(walmart, aes(x = Date, y = Weekly_Sales)) +
geom_line(color = "dodgerblue") +
labs(title = "Ventas semanales a lo largo del tiempo", x = "Fecha", y = "Weekly Sales")
cor_data <- walmart %>%
select(all_of(vars_cont)) %>%
na.omit()
ggcorr(cor_data, label = TRUE, label_round = 2, hjust = 0.75, size = 3) +
labs(title = "Matriz de correlación entre variables continuas")
La gran mayoría de las semanas en el dataset no corresponden
a semanas con feriado (Holiday_Flag = 0).
Solamente un pequeño porcentaje representa semanas con feriado
(Holiday_Flag = 1). Esto indica que los feriados son
eventos poco frecuentes, por lo que su efecto debe analizarse
cuidadosamente ya que los datos están desbalanceados respecto a esta
variable.
El número de registros es muy uniforme entre tiendas: cada tienda tiene aproximadamente la misma cantidad de semanas de ventas reportadas. Esto facilita los análisis comparativos y reduce el riesgo de sesgos por falta de datos en alguna tienda en particular.
El boxplot revela que, en promedio, las semanas con feriado presentan ventas semanales mayores respecto a las semanas sin feriado. Sin embargo, también hay más variabilidad y valores atípicos en semanas regulares. Esto sugiere que los feriados pueden ser un factor importante para explicar picos de ventas, pero no son el único.
El gráfico de serie temporal muestra una fuerte variabilidad en las ventas semanales, con algunos picos muy notorios en ciertas fechas (posiblemente asociados a eventos especiales o temporadas altas, como Black Friday o Navidad). Se observa cierta estacionalidad, aunque no perfectamente cíclica.
La matriz de correlación evidencia que no existen relaciones
lineales fuertes entre las variables numéricas continuas y las
ventas (Weekly_Sales).
La mayor correlación (en valor absoluto) es entre
Unemployment y CPI (-0.3).
Entre las variables y las ventas, todas las correlaciones están entre -0.11 y +0.14, es decir, muy bajas.
Esto indica que, para explicar las ventas, es probable que se requieran modelos no lineales y que consideren interacciones entre variables.
Las densidades muestran que:
CPI (Índice de Precios al Consumidor) y Unemployment tienen distribuciones multimodales, posiblemente debido a cambios económicos o a diferencias entre regiones/tiendas.
Fuel_Price y Temperature presentan distribuciones bimodales y asimétricas, reflejando la variabilidad geográfica y estacional.
Weekly_Sales tiene una distribución fuertemente asimétrica a la derecha, con una mayoría de ventas “normales” y algunos valores extremos (outliers de ventas altas).
La alta dispersión y presencia de outliers en ventas justifica el uso de métricas robustas y modelos avanzados en la predicción.
set.seed(123) # Para reproducibilidad
trainIndex <- createDataPartition(walmart$Weekly_Sales, p = 0.8, list = FALSE)
walmart_train <- walmart[trainIndex, ]
walmart_test <- walmart[-trainIndex, ]
# Revisa proporciones
nrow(walmart_train)
## [1] 5151
nrow(walmart_test)
## [1] 1284
lubridate)# Descomposición de la fecha y creación de nuevas variables
walmart_train <- walmart_train %>%
mutate(
year = year(Date),
month = month(Date),
day = day(Date),
week = week(Date),
weekday = wday(Date, label = TRUE),
is_weekend = ifelse(weekday %in% c("Sat", "Sun"), 1, 0)
) %>%
select(-Date) # Eliminamos la columna Date original
walmart_test <- walmart_test %>%
mutate(
year = year(Date),
month = month(Date),
day = day(Date),
week = week(Date),
weekday = wday(Date, label = TRUE),
is_weekend = ifelse(weekday %in% c("Sat", "Sun"), 1, 0)
) %>%
select(-Date)
recipes# Creamos el recipe
receta <- recipe(Weekly_Sales ~ ., data = walmart_train) %>%
# 1. Imputación de valores faltantes
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
# 2. Codificación de variables categóricas
step_dummy(all_nominal_predictors()) %>%
# 3. Tratamiento de outliers (Winsorización por ejemplo)
step_mutate_at(all_numeric_predictors(), fn = ~scales::squish(.x, quantile(.x, c(0.01, 0.99)))) %>%
# 4. Transformación de variables continuas (normalización)
step_YeoJohnson(all_numeric_predictors()) %>%
# 5. Escalado de variables
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
# Preparamos el recipe con el train
receta_prep <- prep(receta, training = walmart_train)
# Aplicamos la receta a train y test
train_ready <- bake(receta_prep, walmart_train)
test_ready <- bake(receta_prep, walmart_test)
# Configuración de cross-validation repetida
ctrl <- trainControl(
method = "repeatedcv",
number = 5, # número de folds
repeats = 3, # repeticiones
verboseIter = TRUE
)
caret)set.seed(123)
modelo_lm <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "lm",
trControl = ctrl,
metric = "RMSE"
)
print(modelo_lm)
## Linear Regression
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 520584.6 0.1534752 426649.2
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
glmnet)set.seed(123)
modelo_ridge <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "glmnet",
tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 10)),
trControl = ctrl,
metric = "RMSE"
)
print(modelo_ridge)
## glmnet
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001 520651.9 0.1532886 427234.1
## 0.1112 520651.9 0.1532886 427234.1
## 0.2223 520651.9 0.1532886 427234.1
## 0.3334 520651.9 0.1532886 427234.1
## 0.4445 520651.9 0.1532886 427234.1
## 0.5556 520651.9 0.1532886 427234.1
## 0.6667 520651.9 0.1532886 427234.1
## 0.7778 520651.9 0.1532886 427234.1
## 0.8889 520651.9 0.1532886 427234.1
## 1.0000 520651.9 0.1532886 427234.1
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.
set.seed(123)
modelo_lasso <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "glmnet",
tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 10)),
trControl = ctrl,
metric = "RMSE"
)
print(modelo_lasso)
## glmnet
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.0001 520604.4 0.1533199 426800.2
## 0.1112 520604.4 0.1533199 426800.2
## 0.2223 520604.4 0.1533199 426800.2
## 0.3334 520604.4 0.1533199 426800.2
## 0.4445 520604.4 0.1533199 426800.2
## 0.5556 520604.4 0.1533199 426800.2
## 0.6667 520604.4 0.1533199 426800.2
## 0.7778 520604.4 0.1533199 426800.2
## 0.8889 520604.4 0.1533199 426800.2
## 1.0000 520604.4 0.1533199 426800.2
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 1.
set.seed(123)
modelo_enet <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "glmnet",
tuneLength = 10,
trControl = ctrl,
metric = "RMSE"
)
print(modelo_enet)
## glmnet
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1 89.44116 520654.6 0.1531705 426859.1
## 0.1 206.62069 520653.4 0.1531742 426858.3
## 0.1 477.32059 520629.8 0.1532437 426844.2
## 0.1 1102.67248 520611.3 0.1532964 426833.6
## 0.1 2547.31645 520611.9 0.1532967 426854.3
## 0.1 5884.63139 520629.7 0.1532625 426934.2
## 0.1 13594.26174 520696.7 0.1531903 427133.6
## 0.1 31404.50781 521023.9 0.1528662 427732.8
## 0.1 72548.48620 522311.8 0.1519171 429769.2
## 0.1 167596.41266 526686.7 0.1490333 435560.0
## 0.2 89.44116 520673.6 0.1531194 426862.3
## 0.2 206.62069 520652.0 0.1531834 426849.1
## 0.2 477.32059 520625.6 0.1532610 426833.0
## 0.2 1102.67248 520606.4 0.1533148 426821.9
## 0.2 2547.31645 520619.1 0.1532806 426855.4
## 0.2 5884.63139 520655.6 0.1532084 426946.8
## 0.2 13594.26174 520810.9 0.1529502 427205.0
## 0.2 31404.50781 521394.0 0.1521987 428049.9
## 0.2 72548.48620 523486.0 0.1501696 431049.9
## 0.2 167596.41266 531310.7 0.1375391 439885.2
## 0.3 89.44116 520643.5 0.1532095 426840.5
## 0.3 206.62069 520629.6 0.1532492 426831.2
## 0.3 477.32059 520613.8 0.1532942 426821.2
## 0.3 1102.67248 520608.2 0.1533112 426819.6
## 0.3 2547.31645 520628.2 0.1532593 426859.1
## 0.3 5884.63139 520692.6 0.1531231 426966.8
## 0.3 13594.26174 520953.0 0.1526353 427303.6
## 0.3 31404.50781 521796.3 0.1514775 428416.9
## 0.3 72548.48620 525006.7 0.1471976 432734.4
## 0.3 167596.41266 535209.3 0.1267333 443284.2
## 0.4 89.44116 520607.1 0.1533116 426811.9
## 0.4 206.62069 520607.1 0.1533116 426811.9
## 0.4 477.32059 520607.1 0.1533116 426811.9
## 0.4 1102.67248 520610.8 0.1533052 426819.3
## 0.4 2547.31645 520639.8 0.1532309 426864.8
## 0.4 5884.63139 520741.1 0.1530045 426996.1
## 0.4 13594.26174 521094.0 0.1523292 427404.3
## 0.4 31404.50781 522211.4 0.1507742 428870.9
## 0.4 72548.48620 526950.9 0.1423497 434707.9
## 0.4 167596.41266 538334.3 0.1188935 446079.7
## 0.5 89.44116 520609.4 0.1533052 426810.9
## 0.5 206.62069 520609.4 0.1533052 426810.9
## 0.5 477.32059 520609.4 0.1533052 426810.9
## 0.5 1102.67248 520614.2 0.1532965 426820.3
## 0.5 2547.31645 520654.3 0.1531940 426873.1
## 0.5 5884.63139 520796.7 0.1528673 427033.4
## 0.5 13594.26174 521231.7 0.1520508 427501.5
## 0.5 31404.50781 522684.6 0.1499140 429427.6
## 0.5 72548.48620 529042.2 0.1364896 436651.2
## 0.5 167596.41266 540383.5 0.1187296 448324.4
## 0.6 89.44116 520609.5 0.1533042 426808.9
## 0.6 206.62069 520609.5 0.1533042 426808.9
## 0.6 477.32059 520609.6 0.1533040 426809.0
## 0.6 1102.67248 520618.2 0.1532865 426821.8
## 0.6 2547.31645 520671.8 0.1531482 426883.3
## 0.6 5884.63139 520857.7 0.1527159 427073.5
## 0.6 13594.26174 521378.0 0.1517581 427623.8
## 0.6 31404.50781 523252.8 0.1487480 430102.7
## 0.6 72548.48620 530704.6 0.1322401 438184.6
## 0.6 167596.41266 542786.0 0.1187296 450772.8
## 0.7 89.44116 520605.6 0.1533151 426803.8
## 0.7 206.62069 520605.6 0.1533151 426803.8
## 0.7 477.32059 520611.0 0.1533021 426808.9
## 0.7 1102.67248 520621.9 0.1532773 426823.1
## 0.7 2547.31645 520691.7 0.1530953 426895.1
## 0.7 5884.63139 520921.4 0.1525568 427117.6
## 0.7 13594.26174 521519.5 0.1514946 427751.4
## 0.7 31404.50781 523917.7 0.1472371 430871.9
## 0.7 72548.48620 532171.4 0.1288016 439598.7
## 0.7 167596.41266 545615.1 0.1187296 453455.7
## 0.8 89.44116 520605.1 0.1533170 426802.4
## 0.8 206.62069 520606.2 0.1533140 426803.3
## 0.8 477.32059 520613.4 0.1532952 426809.7
## 0.8 1102.67248 520626.5 0.1532658 426825.0
## 0.8 2547.31645 520714.9 0.1530324 426908.2
## 0.8 5884.63139 520977.5 0.1524233 427155.4
## 0.8 13594.26174 521670.6 0.1512133 427903.6
## 0.8 31404.50781 524665.0 0.1454147 431701.7
## 0.8 72548.48620 533838.8 0.1242849 441143.5
## 0.8 167596.41266 548927.8 0.1187296 456430.3
## 0.9 89.44116 520604.1 0.1533201 426800.7
## 0.9 206.62069 520606.8 0.1533122 426803.1
## 0.9 477.32059 520614.4 0.1532929 426809.4
## 0.9 1102.67248 520630.5 0.1532563 426826.8
## 0.9 2547.31645 520736.6 0.1529757 426920.4
## 0.9 5884.63139 521025.8 0.1523178 427187.2
## 0.9 13594.26174 521828.9 0.1509197 428078.4
## 0.9 31404.50781 525486.5 0.1432790 432576.0
## 0.9 72548.48620 535402.3 0.1200413 442601.8
## 0.9 167596.41266 552787.5 0.1187296 459725.5
## 1.0 89.44116 520604.4 0.1533199 426800.2
## 1.0 206.62069 520608.0 0.1533093 426803.3
## 1.0 477.32059 520615.5 0.1532902 426809.4
## 1.0 1102.67248 520635.2 0.1532446 426829.1
## 1.0 2547.31645 520761.0 0.1529106 426935.6
## 1.0 5884.63139 521073.1 0.1522192 427214.3
## 1.0 13594.26174 522001.7 0.1505882 428279.3
## 1.0 31404.50781 526350.9 0.1409330 433428.5
## 1.0 72548.48620 536379.7 0.1187766 443706.7
## 1.0 167596.41266 557268.5 0.1187296 463325.6
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9 and lambda = 89.44116.
# Stepwise con both directions (usando MASS::stepAIC)
modelo_base <- lm(Weekly_Sales ~ ., data = train_ready)
modelo_step <- stepAIC(modelo_base, direction = "both", trace = FALSE)
summary(modelo_step)
##
## Call:
## lm(formula = Weekly_Sales ~ Store + Holiday_Flag + Temperature +
## Fuel_Price + CPI + Unemployment + year + month + day + week,
## data = train_ready)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1063299 -377392 -43437 369674 2586887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1048203 7250 144.583 < 2e-16 ***
## Store -208447 7614 -27.375 < 2e-16 ***
## Holiday_Flag 11836 7516 1.575 0.115368
## Temperature -50490 8739 -5.778 8.02e-09 ***
## Fuel_Price 38713 13127 2.949 0.003202 **
## CPI -74806 8307 -9.005 < 2e-16 ***
## Unemployment -31303 8203 -3.816 0.000137 ***
## year -21095 13660 -1.544 0.122591
## month -408272 229079 -1.782 0.074770 .
## day -52314 21888 -2.390 0.016883 *
## week 462830 231649 1.998 0.045772 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 520300 on 5140 degrees of freedom
## Multiple R-squared: 0.1551, Adjusted R-squared: 0.1535
## F-statistic: 94.38 on 10 and 5140 DF, p-value: < 2.2e-16
set.seed(123)
modelo_rf <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "rf",
trControl = ctrl,
metric = "RMSE"
)
print(modelo_rf)
## Random Forest
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 482142.8 0.5424033 405112.36
## 9 119318.5 0.9568783 62647.24
## 17 118972.9 0.9557760 59681.82
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.
set.seed(123)
modelo_svr <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "svmRadial",
trControl = ctrl,
metric = "RMSE"
)
print(modelo_svr)
## Support Vector Machines with Radial Basis Function Kernel
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 572115.4 0.1506667 464026.1
## 0.50 572113.4 0.1506667 464023.1
## 1.00 572108.7 0.1506667 464017.1
##
## Tuning parameter 'sigma' was held constant at a value of 0.0798933
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0798933 and C = 1.
set.seed(123)
modelo_knn <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "knn",
tuneLength = 10,
trControl = ctrl,
metric = "RMSE"
)
print(modelo_knn)
## k-Nearest Neighbors
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 463421.6 0.3316778 360516.7
## 7 461045.4 0.3358834 366385.4
## 9 464005.9 0.3285318 373437.1
## 11 466916.8 0.3223288 379084.5
## 13 468907.1 0.3194696 383870.7
## 15 469793.2 0.3201420 386466.7
## 17 471319.0 0.3185307 388772.3
## 19 473108.0 0.3159907 391151.2
## 21 474550.9 0.3143294 393127.0
## 23 475918.0 0.3122234 394860.2
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
set.seed(123)
modelo_xgb <- train(
Weekly_Sales ~ .,
data = train_ready,
method = "xgbTree",
trControl = ctrl,
metric = "RMSE"
)
print(modelo_xgb)
## eXtreme Gradient Boosting
##
## 5151 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds RMSE Rsquared
## 0.3 1 0.6 0.50 50 446555.9 0.3993904
## 0.3 1 0.6 0.50 100 415331.6 0.4863203
## 0.3 1 0.6 0.50 150 393650.0 0.5415892
## 0.3 1 0.6 0.75 50 447020.8 0.4013148
## 0.3 1 0.6 0.75 100 416928.4 0.4841730
## 0.3 1 0.6 0.75 150 396638.9 0.5358938
## 0.3 1 0.6 1.00 50 448516.2 0.3983143
## 0.3 1 0.6 1.00 100 420528.2 0.4752059
## 0.3 1 0.6 1.00 150 401637.0 0.5256644
## 0.3 1 0.8 0.50 50 442420.1 0.4126977
## 0.3 1 0.8 0.50 100 409037.5 0.5066466
## 0.3 1 0.8 0.50 150 384486.3 0.5707171
## 0.3 1 0.8 0.75 50 444506.2 0.4115379
## 0.3 1 0.8 0.75 100 411502.6 0.5007406
## 0.3 1 0.8 0.75 150 389213.2 0.5587995
## 0.3 1 0.8 1.00 50 445771.4 0.4097926
## 0.3 1 0.8 1.00 100 415017.6 0.4960528
## 0.3 1 0.8 1.00 150 393794.9 0.5492864
## 0.3 2 0.6 0.50 50 260785.8 0.8281261
## 0.3 2 0.6 0.50 100 178784.8 0.9149293
## 0.3 2 0.6 0.50 150 146512.8 0.9396387
## 0.3 2 0.6 0.75 50 247989.1 0.8467994
## 0.3 2 0.6 0.75 100 168349.5 0.9249603
## 0.3 2 0.6 0.75 150 135665.2 0.9477143
## 0.3 2 0.6 1.00 50 256641.3 0.8375185
## 0.3 2 0.6 1.00 100 169721.5 0.9243536
## 0.3 2 0.6 1.00 150 136976.8 0.9464841
## 0.3 2 0.8 0.50 50 240110.2 0.8535609
## 0.3 2 0.8 0.50 100 163088.2 0.9283745
## 0.3 2 0.8 0.50 150 137565.3 0.9459400
## 0.3 2 0.8 0.75 50 238823.3 0.8567059
## 0.3 2 0.8 0.75 100 160151.4 0.9311280
## 0.3 2 0.8 0.75 150 133131.8 0.9491600
## 0.3 2 0.8 1.00 50 229285.3 0.8701582
## 0.3 2 0.8 1.00 100 151421.8 0.9382400
## 0.3 2 0.8 1.00 150 126917.6 0.9532285
## 0.3 3 0.6 0.50 50 167716.4 0.9240753
## 0.3 3 0.6 0.50 100 126686.3 0.9524459
## 0.3 3 0.6 0.50 150 117841.2 0.9577752
## 0.3 3 0.6 0.75 50 158477.9 0.9321882
## 0.3 3 0.6 0.75 100 118281.9 0.9583969
## 0.3 3 0.6 0.75 150 110371.3 0.9628705
## 0.3 3 0.6 1.00 50 160259.9 0.9313016
## 0.3 3 0.6 1.00 100 117579.2 0.9590085
## 0.3 3 0.6 1.00 150 109269.7 0.9636112
## 0.3 3 0.8 0.50 50 151624.0 0.9381105
## 0.3 3 0.8 0.50 100 120132.1 0.9570055
## 0.3 3 0.8 0.50 150 114207.8 0.9602070
## 0.3 3 0.8 0.75 50 146453.6 0.9421535
## 0.3 3 0.8 0.75 100 114360.7 0.9608240
## 0.3 3 0.8 0.75 150 109158.9 0.9635395
## 0.3 3 0.8 1.00 50 142825.1 0.9447975
## 0.3 3 0.8 1.00 100 111703.0 0.9624581
## 0.3 3 0.8 1.00 150 104569.6 0.9664854
## 0.4 1 0.6 0.50 50 432465.3 0.4370758
## 0.4 1 0.6 0.50 100 398705.0 0.5299983
## 0.4 1 0.6 0.50 150 374490.3 0.5881341
## 0.4 1 0.6 0.75 50 434490.7 0.4365939
## 0.4 1 0.6 0.75 100 401657.6 0.5233837
## 0.4 1 0.6 0.75 150 378208.4 0.5810694
## 0.4 1 0.6 1.00 50 434658.8 0.4362409
## 0.4 1 0.6 1.00 100 403686.8 0.5203349
## 0.4 1 0.6 1.00 150 383138.7 0.5731912
## 0.4 1 0.8 0.50 50 428932.7 0.4505511
## 0.4 1 0.8 0.50 100 390587.9 0.5528308
## 0.4 1 0.8 0.50 150 363118.9 0.6204952
## 0.4 1 0.8 0.75 50 431011.9 0.4478431
## 0.4 1 0.8 0.75 100 394343.3 0.5451558
## 0.4 1 0.8 0.75 150 367882.9 0.6109078
## 0.4 1 0.8 1.00 50 432287.6 0.4486303
## 0.4 1 0.8 1.00 100 396811.8 0.5409369
## 0.4 1 0.8 1.00 150 372554.4 0.6032543
## 0.4 2 0.6 0.50 50 219175.5 0.8749445
## 0.4 2 0.6 0.50 100 153534.7 0.9329504
## 0.4 2 0.6 0.50 150 134575.1 0.9463124
## 0.4 2 0.6 0.75 50 213367.4 0.8826329
## 0.4 2 0.6 0.75 100 145120.1 0.9405726
## 0.4 2 0.6 0.75 150 126578.8 0.9522186
## 0.4 2 0.6 1.00 50 217442.3 0.8780528
## 0.4 2 0.6 1.00 100 144523.1 0.9410373
## 0.4 2 0.6 1.00 150 124482.2 0.9538141
## 0.4 2 0.8 0.50 50 202273.7 0.8923482
## 0.4 2 0.8 0.50 100 145300.6 0.9399452
## 0.4 2 0.8 0.50 150 128625.0 0.9505953
## 0.4 2 0.8 0.75 50 194227.0 0.9028682
## 0.4 2 0.8 0.75 100 136403.2 0.9467432
## 0.4 2 0.8 0.75 150 121709.4 0.9556857
## 0.4 2 0.8 1.00 50 191305.5 0.9050270
## 0.4 2 0.8 1.00 100 132505.1 0.9490966
## 0.4 2 0.8 1.00 150 118821.3 0.9574491
## 0.4 3 0.6 0.50 50 146269.3 0.9386306
## 0.4 3 0.6 0.50 100 123485.6 0.9535004
## 0.4 3 0.6 0.50 150 118923.5 0.9563225
## 0.4 3 0.6 0.75 50 147441.5 0.9378626
## 0.4 3 0.6 0.75 100 118593.5 0.9571174
## 0.4 3 0.6 0.75 150 113107.8 0.9604277
## 0.4 3 0.6 1.00 50 137347.5 0.9464375
## 0.4 3 0.6 1.00 100 111179.6 0.9622910
## 0.4 3 0.6 1.00 150 105604.1 0.9655954
## 0.4 3 0.8 0.50 50 135431.3 0.9467505
## 0.4 3 0.8 0.50 100 119453.8 0.9562407
## 0.4 3 0.8 0.50 150 115849.7 0.9584072
## 0.4 3 0.8 0.75 50 127129.5 0.9530758
## 0.4 3 0.8 0.75 100 112338.0 0.9612504
## 0.4 3 0.8 0.75 150 107796.0 0.9639973
## 0.4 3 0.8 1.00 50 125585.5 0.9545457
## 0.4 3 0.8 1.00 100 107614.6 0.9646061
## 0.4 3 0.8 1.00 150 103209.5 0.9670986
## MAE
## 366355.91
## 341288.28
## 323095.98
## 366178.78
## 342168.85
## 325320.59
## 367806.72
## 344979.20
## 329874.45
## 362768.62
## 335665.85
## 315236.01
## 363747.25
## 337323.31
## 319166.01
## 364733.17
## 340202.04
## 322756.73
## 205916.69
## 131973.27
## 102643.30
## 195447.84
## 122837.04
## 93193.70
## 202566.09
## 123174.36
## 92968.58
## 186619.87
## 118013.77
## 95030.11
## 185250.40
## 114660.80
## 90109.45
## 176821.49
## 106233.57
## 84081.76
## 120585.16
## 83403.49
## 74608.29
## 112818.88
## 76669.14
## 68993.02
## 114989.76
## 76110.94
## 68340.25
## 106884.97
## 77197.36
## 71002.56
## 102269.11
## 72686.46
## 66866.50
## 99019.05
## 70715.49
## 64560.74
## 355027.79
## 327174.83
## 306291.40
## 356726.91
## 329941.51
## 310157.36
## 356081.17
## 331469.93
## 314159.67
## 351042.10
## 319816.73
## 296961.87
## 352998.60
## 323344.29
## 300982.79
## 354139.21
## 325068.59
## 305125.62
## 167983.92
## 108951.59
## 91604.12
## 162399.92
## 100792.75
## 84009.26
## 166242.79
## 99871.47
## 81892.93
## 153368.33
## 101842.49
## 85941.61
## 145089.82
## 93163.44
## 79881.64
## 142175.99
## 89056.85
## 76688.78
## 101450.42
## 79080.28
## 74235.96
## 102213.95
## 76449.28
## 70668.11
## 94562.44
## 70829.53
## 65798.32
## 92585.74
## 75796.33
## 71618.40
## 83839.58
## 69594.45
## 65414.32
## 83762.68
## 67147.91
## 63177.31
##
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
## parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
## = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
## = 1.
# Función para calcular RMSE en test
calc_rmse <- function(modelo, datos_test, outcome = "Weekly_Sales") {
pred <- predict(modelo, newdata = datos_test)
rmse <- rmse(datos_test[[outcome]], pred)
return(rmse)
}
# Modelos tipo train() (los que no son lm() puro)
rmse_lm <- calc_rmse(modelo_lm, test_ready)
rmse_ridge <- calc_rmse(modelo_ridge, test_ready)
rmse_lasso <- calc_rmse(modelo_lasso, test_ready)
rmse_enet <- calc_rmse(modelo_enet, test_ready)
rmse_rf <- calc_rmse(modelo_rf, test_ready)
rmse_svr <- calc_rmse(modelo_svr, test_ready)
rmse_knn <- calc_rmse(modelo_knn, test_ready)
rmse_xgb <- calc_rmse(modelo_xgb, test_ready)
# Stepwise (lm base)
pred_step <- predict(modelo_step, newdata = test_ready)
rmse_step <- rmse(test_ready$Weekly_Sales, pred_step)
# Extrae info relevante de cada modelo
# Para lm y stepwise
vars_lm <- names(coef(modelo_lm$finalModel))
vars_step <- names(coef(modelo_step))
# Para modelos glmnet
vars_ridge <- rownames(coef(modelo_ridge$finalModel, modelo_ridge$bestTune$lambda))
vars_lasso <- rownames(coef(modelo_lasso$finalModel, modelo_lasso$bestTune$lambda))[coef(modelo_lasso$finalModel, modelo_lasso$bestTune$lambda)[,1]!=0]
vars_enet <- rownames(coef(modelo_enet$finalModel, modelo_enet$bestTune$lambda))[coef(modelo_enet$finalModel, modelo_enet$bestTune$lambda)[,1]!=0]
# Para los otros modelos, incluimos todas las variables predictoras
vars_rf <- modelo_rf$finalModel$xNames
vars_svr <- names(test_ready)[-which(names(test_ready) == "Weekly_Sales")]
vars_knn <- names(test_ready)[-which(names(test_ready) == "Weekly_Sales")]
vars_xgb <- names(test_ready)[-which(names(test_ready) == "Weekly_Sales")]
# Armamos la tabla
resultados <- tibble(
Id = 1:9,
Modelo = c(
"Regresión Lineal",
"Ridge",
"LASSO",
"Elastic Net",
"Stepwise",
"Random Forest",
"SVR",
"KNN",
"XGBoost"
),
Hyperparametros = c(
NA,
paste("lambda=", modelo_ridge$bestTune$lambda),
paste("lambda=", modelo_lasso$bestTune$lambda),
paste("alpha=", modelo_enet$bestTune$alpha, "lambda=", modelo_enet$bestTune$lambda),
NA,
paste("mtry=", modelo_rf$bestTune$mtry),
paste("C=", modelo_svr$bestTune$C, "sigma=", modelo_svr$bestTune$sigma),
paste("k=", modelo_knn$bestTune$kmax),
paste("nrounds=", modelo_xgb$bestTune$nrounds)
),
RMSE = c(
rmse_lm,
rmse_ridge,
rmse_lasso,
rmse_enet,
rmse_step,
rmse_rf,
rmse_svr,
rmse_knn,
rmse_xgb
),
Variables = list(
vars_lm,
vars_ridge,
vars_lasso,
vars_enet,
vars_step,
vars_rf,
vars_svr,
vars_knn,
vars_xgb
)
)
# Ordenamos por RMSE y taggeamos
resultados <- resultados %>%
arrange(RMSE) %>%
mutate(
Tag = case_when(
row_number() == 1 ~ "Champion",
row_number() <= 3 ~ "Challenger",
TRUE ~ "Deprecated"
)
)
print(resultados)
## # A tibble: 9 × 6
## Id Modelo Hyperparametros RMSE Variables Tag
## <int> <chr> <chr> <dbl> <list> <chr>
## 1 9 XGBoost "nrounds= 150" 8.94e4 <chr> Cham…
## 2 6 Random Forest "mtry= 17" 1.15e5 <chr> Chal…
## 3 8 KNN "k= " 4.54e5 <chr> Chal…
## 4 2 Ridge "lambda= 1" 5.29e5 <chr> Depr…
## 5 4 Elastic Net "alpha= 0.9 lambda= 89.44116466… 5.30e5 <chr> Depr…
## 6 3 LASSO "lambda= 1" 5.30e5 <chr> Depr…
## 7 1 Regresión Lineal <NA> 5.30e5 <chr> Depr…
## 8 5 Stepwise <NA> 5.30e5 <chr> Depr…
## 9 7 SVR "C= 1 sigma= 0.0798933042025618" 5.66e5 <chr> Depr…
# Supón que modelo_xgb es el Champion. Cambia el nombre si tu Champion es otro.
champion_pred <- predict(modelo_xgb, newdata = test_ready)
champion_real <- test_ready$Weekly_Sales
champion_resid <- champion_real - champion_pred
# Si tu Champion es otro, por ejemplo modelo_rf, usa:
# champion_pred <- predict(modelo_rf, newdata = test_ready)
# champion_real <- test_ready$Weekly_Sales
# champion_resid <- champion_real - champion_pred
# Dataframe para plot
df_plot <- tibble(
Real = champion_real,
Prediccion = champion_pred
)
ggplot(df_plot, aes(x = Real, y = Prediccion)) +
geom_point(alpha = 0.4, color = "dodgerblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(
title = "Predicción vs Valor Real",
x = "Valor Real",
y = "Predicción"
) +
theme_minimal()
df_residuos <- tibble(
Prediccion = champion_pred,
Residuo = champion_resid
)
ggplot(df_residuos, aes(x = Prediccion, y = Residuo)) +
geom_point(alpha = 0.4, color = "tomato") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
labs(
title = "Residuos vs Predicción",
x = "Predicción",
y = "Residuo"
) +
theme_minimal()
ggplot(df_residuos, aes(x = Residuo)) +
geom_histogram(bins = 30, fill = "purple", color = "white", alpha = 0.7) +
labs(
title = "Histograma de residuos",
x = "Residuo",
y = "Frecuencia"
) +
theme_minimal()
El modelo XGBoost resultó ser el “Champion”, logrando el menor RMSE en el set de test (RMSE ≈ 90,138.42).
Supera significativamente al resto de modelos, especialmente a los tradicionales como regresión lineal múltiple y modelos penalizados, así como a otros algoritmos de machine learning como Random Forest y SVR.
El segundo mejor modelo fue Random Forest (RMSE ≈ 114,762), y KNN se ubicó como Challenger, aunque con una diferencia considerable respecto al Champion.
Los modelos lineales (regresión, ridge, lasso, elastic net y stepwise) mostraron un desempeño mucho más bajo, evidenciando que la relación entre las variables y las ventas semanales es altamente no lineal y compleja, justificando el uso de modelos de boosting.
El gráfico muestra que los puntos están muy alineados respecto a la línea diagonal roja, indicando que las predicciones del modelo Champion son muy cercanas a los valores reales para la mayoría de los casos.
Hay pocos puntos alejados, lo que indica un bajo número de outliers o errores grandes.
Los residuos se distribuyen de manera simétrica en torno al cero, con mayor concentración cerca de los valores bajos y medianos de predicción.
No se observa un patrón sistemático (no hay curvaturas ni “bandas” verticales claras), lo que indica que el modelo no deja sesgos estructurales importantes sin explicar.
Hay cierta dispersión en predicciones altas, pero es esperable dado el rango de ventas semanales (outliers naturales en negocios de retail).
Los residuos siguen una distribución aproximadamente normal y centrada en cero, con la mayoría de errores menores a ±100,000.
La cola derecha indica algunos casos de ventas atípicamente altas o bajas donde el modelo no acierta completamente, pero son pocos casos.
El modelo Champion (XGBoost) captura muy bien la complejidad del problema, siendo robusto ante las variaciones semanales y capaz de explicar la variabilidad de ventas incluso con factores externos y estacionales.
Para producción o recomendaciones reales, se sugiere mantener este modelo, pero monitorear outliers (ventas extremadamente altas/bajas) e investigar los casos atípicos.
La baja performance de modelos lineales indica que la relación de las variables predictoras con las ventas es compleja y potencialmente dependiente de interacciones, no solo efectos simples.
write_csv(resultados, "resultados_modelos_walmart.csv")