Lab 3: Walmart Sales Prediction

Nombre: Francisco Gonzalez

Carnet: 24002914

1. EDA – Exploratory Data Analysis

1.1. Cargar paquetes y datos

library(tidyverse)   # Incluye dplyr, ggplot2, etc.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(recipes)

## 
## Attaching package: 'recipes'
## 
## The following object is masked from 'package:stringr':
## 
##     fixed
## 
## The following object is masked from 'package:stats':
## 
##     step

library(MASS)        # Esto puede causar conflicto con select()

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(Metrics)     # Para RMSE

## 
## Attaching package: 'Metrics'
## 
## The following objects are masked from 'package:caret':
## 
##     precision, recall

# función select es la de dplyr
select <- dplyr::select


# lectura del archivo
walmart <- read_csv("walmart.csv")

## Rows: 6435 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Date
## dbl (7): Store, Weekly_Sales, Holiday_Flag, Temperature, Fuel_Price, CPI, Un...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Estructura general
glimpse(walmart)

## Rows: 6,435
## Columns: 8
## $ Store        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Date         <chr> "05-02-2010", "12-02-2010", "19-02-2010", "26-02-2010", "…
## $ Weekly_Sales <dbl> 1643691, 1641957, 1611968, 1409728, 1554807, 1439542, 147…
## $ Holiday_Flag <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Temperature  <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58, 51.45, 6…
## $ Fuel_Price   <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720, 2.732, 2…
## $ CPI          <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501, 211.380…
## $ Unemployment <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 7…

summary(walmart)

##      Store        Date            Weekly_Sales      Holiday_Flag    
##  Min.   : 1   Length:6435        Min.   : 209986   Min.   :0.00000  
##  1st Qu.:12   Class :character   1st Qu.: 553350   1st Qu.:0.00000  
##  Median :23   Mode  :character   Median : 960746   Median :0.00000  
##  Mean   :23                      Mean   :1046965   Mean   :0.06993  
##  3rd Qu.:34                      3rd Qu.:1420159   3rd Qu.:0.00000  
##  Max.   :45                      Max.   :3818686   Max.   :1.00000  
##   Temperature       Fuel_Price         CPI         Unemployment   
##  Min.   : -2.06   Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.: 47.46   1st Qu.:2.933   1st Qu.:131.7   1st Qu.: 6.891  
##  Median : 62.67   Median :3.445   Median :182.6   Median : 7.874  
##  Mean   : 60.66   Mean   :3.359   Mean   :171.6   Mean   : 7.999  
##  3rd Qu.: 74.94   3rd Qu.:3.735   3rd Qu.:212.7   3rd Qu.: 8.622  
##  Max.   :100.14   Max.   :4.468   Max.   :227.2   Max.   :14.313

1.2. Gráficas de densidad para variables continuas

# Variables continuas a analizar
vars_cont <- c("Weekly_Sales", "Temperature", "Fuel_Price", "CPI", "Unemployment")

# Gráficas de densidad
walmart %>%
  pivot_longer(all_of(vars_cont), names_to = "variable", values_to = "valor") %>%
  ggplot(aes(x = valor, fill = variable)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Gráficas de densidad para variables continuas")

1.3. Gráficas de barras para Holiday_Flag y Store + boxplot con Weekly_Sales

# Gráfica de barras para Holiday_Flag
ggplot(walmart, aes(x = as.factor(Holiday_Flag))) +
  geom_bar(fill = "skyblue") +
  labs(title = "Cantidad de semanas con y sin feriado", x = "Holiday_Flag", y = "Cantidad")

# Gráfica de barras para Store
ggplot(walmart, aes(x = as.factor(Store))) +
  geom_bar(fill = "orange") +
  labs(title = "Cantidad de registros por tienda", x = "Store", y = "Cantidad")

# Boxplot: Weekly_Sales vs Holiday_Flag
ggplot(walmart, aes(x = as.factor(Holiday_Flag), y = Weekly_Sales, fill = as.factor(Holiday_Flag))) +
  geom_boxplot() +
  labs(title = "Distribución de ventas por semanas con/ sin feriado", x = "Holiday_Flag")

1.4. Serie temporal de Weekly_Sales

# Convertimos la fecha a formato Date
walmart <- walmart %>%
  mutate(Date = dmy(Date))

# Serie temporal
ggplot(walmart, aes(x = Date, y = Weekly_Sales)) +
  geom_line(color = "dodgerblue") +
  labs(title = "Ventas semanales a lo largo del tiempo", x = "Fecha", y = "Weekly Sales")

1.5. Matriz de correlación entre variables continuas

cor_data <- walmart %>%
  select(all_of(vars_cont)) %>%
  na.omit()
ggcorr(cor_data, label = TRUE, label_round = 2, hjust = 0.75, size = 3) +
  labs(title = "Matriz de correlación entre variables continuas")

Conclusiones del EDA

A) Análisis de distribución de semanas con y sin feriado

La gran mayoría de las semanas en el dataset no corresponden a semanas con feriado (Holiday_Flag = 0). Solamente un pequeño porcentaje representa semanas con feriado (Holiday_Flag = 1). Esto indica que los feriados son eventos poco frecuentes, por lo que su efecto debe analizarse cuidadosamente ya que los datos están desbalanceados respecto a esta variable.

B) Registros por tienda

El número de registros es muy uniforme entre tiendas: cada tienda tiene aproximadamente la misma cantidad de semanas de ventas reportadas. Esto facilita los análisis comparativos y reduce el riesgo de sesgos por falta de datos en alguna tienda en particular.

C) Ventas en semanas con y sin feriado

El boxplot revela que, en promedio, las semanas con feriado presentan ventas semanales mayores respecto a las semanas sin feriado. Sin embargo, también hay más variabilidad y valores atípicos en semanas regulares. Esto sugiere que los feriados pueden ser un factor importante para explicar picos de ventas, pero no son el único.

D) Evolución temporal de ventas

El gráfico de serie temporal muestra una fuerte variabilidad en las ventas semanales, con algunos picos muy notorios en ciertas fechas (posiblemente asociados a eventos especiales o temporadas altas, como Black Friday o Navidad). Se observa cierta estacionalidad, aunque no perfectamente cíclica.

E) Correlación entre variables continuas

La matriz de correlación evidencia que no existen relaciones lineales fuertes entre las variables numéricas continuas y las ventas (Weekly_Sales).

La mayor correlación (en valor absoluto) es entre Unemployment y CPI (-0.3).
Entre las variables y las ventas, todas las correlaciones están entre -0.11 y +0.14, es decir, muy bajas.
Esto indica que, para explicar las ventas, es probable que se requieran modelos no lineales y que consideren interacciones entre variables.

F) Distribución de variables continuas

Las densidades muestran que:

CPI (Índice de Precios al Consumidor) y Unemployment tienen distribuciones multimodales, posiblemente debido a cambios económicos o a diferencias entre regiones/tiendas.
Fuel_Price y Temperature presentan distribuciones bimodales y asimétricas, reflejando la variabilidad geográfica y estacional.
Weekly_Sales tiene una distribución fuertemente asimétrica a la derecha, con una mayoría de ventas “normales” y algunos valores extremos (outliers de ventas altas).
La alta dispersión y presencia de outliers en ventas justifica el uso de métricas robustas y modelos avanzados en la predicción.

Modelos de Regresión

2.1 Partición de datos Train/Test

set.seed(123) # Para reproducibilidad
trainIndex <- createDataPartition(walmart$Weekly_Sales, p = 0.8, list = FALSE)
walmart_train <- walmart[trainIndex, ]
walmart_test  <- walmart[-trainIndex, ]

# Revisa proporciones
nrow(walmart_train)

## [1] 5151

nrow(walmart_test)

## [1] 1284

2.2. Ingeniería de variables sobre Date (con `lubridate`)

# Descomposición de la fecha y creación de nuevas variables
walmart_train <- walmart_train %>%
  mutate(
    year = year(Date),
    month = month(Date),
    day = day(Date),
    week = week(Date),
    weekday = wday(Date, label = TRUE),
    is_weekend = ifelse(weekday %in% c("Sat", "Sun"), 1, 0)
  ) %>%
  select(-Date) # Eliminamos la columna Date original

walmart_test <- walmart_test %>%
  mutate(
    year = year(Date),
    month = month(Date),
    day = day(Date),
    week = week(Date),
    weekday = wday(Date, label = TRUE),
    is_weekend = ifelse(weekday %in% c("Sat", "Sun"), 1, 0)
  ) %>%
  select(-Date)

2.3. Pipeline de preprocesamiento con `recipes`

# Creamos el recipe
receta <- recipe(Weekly_Sales ~ ., data = walmart_train) %>%
  # 1. Imputación de valores faltantes
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  # 2. Codificación de variables categóricas
  step_dummy(all_nominal_predictors()) %>%
  # 3. Tratamiento de outliers (Winsorización por ejemplo)
  step_mutate_at(all_numeric_predictors(), fn = ~scales::squish(.x, quantile(.x, c(0.01, 0.99)))) %>%
  # 4. Transformación de variables continuas (normalización)
  step_YeoJohnson(all_numeric_predictors()) %>%
  # 5. Escalado de variables
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())

# Preparamos el recipe con el train
receta_prep <- prep(receta, training = walmart_train)

# Aplicamos la receta a train y test
train_ready <- bake(receta_prep, walmart_train)
test_ready  <- bake(receta_prep, walmart_test)

2.4. Definir control de entrenamiento y validación cruzada (repeatedcv)

# Configuración de cross-validation repetida
ctrl <- trainControl(
  method = "repeatedcv",
  number = 5, # número de folds
  repeats = 3, # repeticiones
  verboseIter = TRUE
)

2.5. Entrenamiento de modelos (usando `caret`)

2.5.1. Regresión Lineal Múltiple

set.seed(123)
modelo_lm <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "lm",
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_lm)

## Linear Regression 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   520584.6  0.1534752  426649.2
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

2.5.2. Ridge Regression (`glmnet`)

set.seed(123)
modelo_ridge <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 0, lambda = seq(0.0001, 1, length = 10)),
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_ridge)

## glmnet 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0001  520651.9  0.1532886  427234.1
##   0.1112  520651.9  0.1532886  427234.1
##   0.2223  520651.9  0.1532886  427234.1
##   0.3334  520651.9  0.1532886  427234.1
##   0.4445  520651.9  0.1532886  427234.1
##   0.5556  520651.9  0.1532886  427234.1
##   0.6667  520651.9  0.1532886  427234.1
##   0.7778  520651.9  0.1532886  427234.1
##   0.8889  520651.9  0.1532886  427234.1
##   1.0000  520651.9  0.1532886  427234.1
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 1.

2.5.3. LASSO Regression

set.seed(123)
modelo_lasso <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "glmnet",
  tuneGrid = expand.grid(alpha = 1, lambda = seq(0.0001, 1, length = 10)),
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_lasso)

## glmnet 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.0001  520604.4  0.1533199  426800.2
##   0.1112  520604.4  0.1533199  426800.2
##   0.2223  520604.4  0.1533199  426800.2
##   0.3334  520604.4  0.1533199  426800.2
##   0.4445  520604.4  0.1533199  426800.2
##   0.5556  520604.4  0.1533199  426800.2
##   0.6667  520604.4  0.1533199  426800.2
##   0.7778  520604.4  0.1533199  426800.2
##   0.8889  520604.4  0.1533199  426800.2
##   1.0000  520604.4  0.1533199  426800.2
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 1.

2.5.4. Elastic Net

set.seed(123)
modelo_enet <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "glmnet",
  tuneLength = 10,
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_enet)

## glmnet 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared   MAE     
##   0.1        89.44116  520654.6  0.1531705  426859.1
##   0.1       206.62069  520653.4  0.1531742  426858.3
##   0.1       477.32059  520629.8  0.1532437  426844.2
##   0.1      1102.67248  520611.3  0.1532964  426833.6
##   0.1      2547.31645  520611.9  0.1532967  426854.3
##   0.1      5884.63139  520629.7  0.1532625  426934.2
##   0.1     13594.26174  520696.7  0.1531903  427133.6
##   0.1     31404.50781  521023.9  0.1528662  427732.8
##   0.1     72548.48620  522311.8  0.1519171  429769.2
##   0.1    167596.41266  526686.7  0.1490333  435560.0
##   0.2        89.44116  520673.6  0.1531194  426862.3
##   0.2       206.62069  520652.0  0.1531834  426849.1
##   0.2       477.32059  520625.6  0.1532610  426833.0
##   0.2      1102.67248  520606.4  0.1533148  426821.9
##   0.2      2547.31645  520619.1  0.1532806  426855.4
##   0.2      5884.63139  520655.6  0.1532084  426946.8
##   0.2     13594.26174  520810.9  0.1529502  427205.0
##   0.2     31404.50781  521394.0  0.1521987  428049.9
##   0.2     72548.48620  523486.0  0.1501696  431049.9
##   0.2    167596.41266  531310.7  0.1375391  439885.2
##   0.3        89.44116  520643.5  0.1532095  426840.5
##   0.3       206.62069  520629.6  0.1532492  426831.2
##   0.3       477.32059  520613.8  0.1532942  426821.2
##   0.3      1102.67248  520608.2  0.1533112  426819.6
##   0.3      2547.31645  520628.2  0.1532593  426859.1
##   0.3      5884.63139  520692.6  0.1531231  426966.8
##   0.3     13594.26174  520953.0  0.1526353  427303.6
##   0.3     31404.50781  521796.3  0.1514775  428416.9
##   0.3     72548.48620  525006.7  0.1471976  432734.4
##   0.3    167596.41266  535209.3  0.1267333  443284.2
##   0.4        89.44116  520607.1  0.1533116  426811.9
##   0.4       206.62069  520607.1  0.1533116  426811.9
##   0.4       477.32059  520607.1  0.1533116  426811.9
##   0.4      1102.67248  520610.8  0.1533052  426819.3
##   0.4      2547.31645  520639.8  0.1532309  426864.8
##   0.4      5884.63139  520741.1  0.1530045  426996.1
##   0.4     13594.26174  521094.0  0.1523292  427404.3
##   0.4     31404.50781  522211.4  0.1507742  428870.9
##   0.4     72548.48620  526950.9  0.1423497  434707.9
##   0.4    167596.41266  538334.3  0.1188935  446079.7
##   0.5        89.44116  520609.4  0.1533052  426810.9
##   0.5       206.62069  520609.4  0.1533052  426810.9
##   0.5       477.32059  520609.4  0.1533052  426810.9
##   0.5      1102.67248  520614.2  0.1532965  426820.3
##   0.5      2547.31645  520654.3  0.1531940  426873.1
##   0.5      5884.63139  520796.7  0.1528673  427033.4
##   0.5     13594.26174  521231.7  0.1520508  427501.5
##   0.5     31404.50781  522684.6  0.1499140  429427.6
##   0.5     72548.48620  529042.2  0.1364896  436651.2
##   0.5    167596.41266  540383.5  0.1187296  448324.4
##   0.6        89.44116  520609.5  0.1533042  426808.9
##   0.6       206.62069  520609.5  0.1533042  426808.9
##   0.6       477.32059  520609.6  0.1533040  426809.0
##   0.6      1102.67248  520618.2  0.1532865  426821.8
##   0.6      2547.31645  520671.8  0.1531482  426883.3
##   0.6      5884.63139  520857.7  0.1527159  427073.5
##   0.6     13594.26174  521378.0  0.1517581  427623.8
##   0.6     31404.50781  523252.8  0.1487480  430102.7
##   0.6     72548.48620  530704.6  0.1322401  438184.6
##   0.6    167596.41266  542786.0  0.1187296  450772.8
##   0.7        89.44116  520605.6  0.1533151  426803.8
##   0.7       206.62069  520605.6  0.1533151  426803.8
##   0.7       477.32059  520611.0  0.1533021  426808.9
##   0.7      1102.67248  520621.9  0.1532773  426823.1
##   0.7      2547.31645  520691.7  0.1530953  426895.1
##   0.7      5884.63139  520921.4  0.1525568  427117.6
##   0.7     13594.26174  521519.5  0.1514946  427751.4
##   0.7     31404.50781  523917.7  0.1472371  430871.9
##   0.7     72548.48620  532171.4  0.1288016  439598.7
##   0.7    167596.41266  545615.1  0.1187296  453455.7
##   0.8        89.44116  520605.1  0.1533170  426802.4
##   0.8       206.62069  520606.2  0.1533140  426803.3
##   0.8       477.32059  520613.4  0.1532952  426809.7
##   0.8      1102.67248  520626.5  0.1532658  426825.0
##   0.8      2547.31645  520714.9  0.1530324  426908.2
##   0.8      5884.63139  520977.5  0.1524233  427155.4
##   0.8     13594.26174  521670.6  0.1512133  427903.6
##   0.8     31404.50781  524665.0  0.1454147  431701.7
##   0.8     72548.48620  533838.8  0.1242849  441143.5
##   0.8    167596.41266  548927.8  0.1187296  456430.3
##   0.9        89.44116  520604.1  0.1533201  426800.7
##   0.9       206.62069  520606.8  0.1533122  426803.1
##   0.9       477.32059  520614.4  0.1532929  426809.4
##   0.9      1102.67248  520630.5  0.1532563  426826.8
##   0.9      2547.31645  520736.6  0.1529757  426920.4
##   0.9      5884.63139  521025.8  0.1523178  427187.2
##   0.9     13594.26174  521828.9  0.1509197  428078.4
##   0.9     31404.50781  525486.5  0.1432790  432576.0
##   0.9     72548.48620  535402.3  0.1200413  442601.8
##   0.9    167596.41266  552787.5  0.1187296  459725.5
##   1.0        89.44116  520604.4  0.1533199  426800.2
##   1.0       206.62069  520608.0  0.1533093  426803.3
##   1.0       477.32059  520615.5  0.1532902  426809.4
##   1.0      1102.67248  520635.2  0.1532446  426829.1
##   1.0      2547.31645  520761.0  0.1529106  426935.6
##   1.0      5884.63139  521073.1  0.1522192  427214.3
##   1.0     13594.26174  522001.7  0.1505882  428279.3
##   1.0     31404.50781  526350.9  0.1409330  433428.5
##   1.0     72548.48620  536379.7  0.1187766  443706.7
##   1.0    167596.41266  557268.5  0.1187296  463325.6
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9 and lambda = 89.44116.

2.5.5. Stepwise Regression

# Stepwise con both directions (usando MASS::stepAIC)
modelo_base <- lm(Weekly_Sales ~ ., data = train_ready)
modelo_step <- stepAIC(modelo_base, direction = "both", trace = FALSE)

summary(modelo_step)

## 
## Call:
## lm(formula = Weekly_Sales ~ Store + Holiday_Flag + Temperature + 
##     Fuel_Price + CPI + Unemployment + year + month + day + week, 
##     data = train_ready)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1063299  -377392   -43437   369674  2586887 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1048203       7250 144.583  < 2e-16 ***
## Store         -208447       7614 -27.375  < 2e-16 ***
## Holiday_Flag    11836       7516   1.575 0.115368    
## Temperature    -50490       8739  -5.778 8.02e-09 ***
## Fuel_Price      38713      13127   2.949 0.003202 ** 
## CPI            -74806       8307  -9.005  < 2e-16 ***
## Unemployment   -31303       8203  -3.816 0.000137 ***
## year           -21095      13660  -1.544 0.122591    
## month         -408272     229079  -1.782 0.074770 .  
## day            -52314      21888  -2.390 0.016883 *  
## week           462830     231649   1.998 0.045772 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 520300 on 5140 degrees of freedom
## Multiple R-squared:  0.1551, Adjusted R-squared:  0.1535 
## F-statistic: 94.38 on 10 and 5140 DF,  p-value: < 2.2e-16

2.5.6. Random Forest

set.seed(123)
modelo_rf <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "rf",
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_rf)

## Random Forest 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    2    482142.8  0.5424033  405112.36
##    9    119318.5  0.9568783   62647.24
##   17    118972.9  0.9557760   59681.82
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 17.

2.5.7. SVR (Support Vector Regression)

set.seed(123)
modelo_svr <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "svmRadial",
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_svr)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE      Rsquared   MAE     
##   0.25  572115.4  0.1506667  464026.1
##   0.50  572113.4  0.1506667  464023.1
##   1.00  572108.7  0.1506667  464017.1
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0798933
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.0798933 and C = 1.

2.5.8. KNN

set.seed(123)
modelo_knn <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "knn",
  tuneLength = 10,
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_knn)

## k-Nearest Neighbors 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   MAE     
##    5  463421.6  0.3316778  360516.7
##    7  461045.4  0.3358834  366385.4
##    9  464005.9  0.3285318  373437.1
##   11  466916.8  0.3223288  379084.5
##   13  468907.1  0.3194696  383870.7
##   15  469793.2  0.3201420  386466.7
##   17  471319.0  0.3185307  388772.3
##   19  473108.0  0.3159907  391151.2
##   21  474550.9  0.3143294  393127.0
##   23  475918.0  0.3122234  394860.2
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.

2.5.9. XGBoost

set.seed(123)
modelo_xgb <- train(
  Weekly_Sales ~ ., 
  data = train_ready,
  method = "xgbTree",
  trControl = ctrl,
  metric = "RMSE"
)

print(modelo_xgb)

## eXtreme Gradient Boosting 
## 
## 5151 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 4119, 4122, 4120, 4122, 4121, 4120, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  RMSE      Rsquared 
##   0.3  1          0.6               0.50        50      446555.9  0.3993904
##   0.3  1          0.6               0.50       100      415331.6  0.4863203
##   0.3  1          0.6               0.50       150      393650.0  0.5415892
##   0.3  1          0.6               0.75        50      447020.8  0.4013148
##   0.3  1          0.6               0.75       100      416928.4  0.4841730
##   0.3  1          0.6               0.75       150      396638.9  0.5358938
##   0.3  1          0.6               1.00        50      448516.2  0.3983143
##   0.3  1          0.6               1.00       100      420528.2  0.4752059
##   0.3  1          0.6               1.00       150      401637.0  0.5256644
##   0.3  1          0.8               0.50        50      442420.1  0.4126977
##   0.3  1          0.8               0.50       100      409037.5  0.5066466
##   0.3  1          0.8               0.50       150      384486.3  0.5707171
##   0.3  1          0.8               0.75        50      444506.2  0.4115379
##   0.3  1          0.8               0.75       100      411502.6  0.5007406
##   0.3  1          0.8               0.75       150      389213.2  0.5587995
##   0.3  1          0.8               1.00        50      445771.4  0.4097926
##   0.3  1          0.8               1.00       100      415017.6  0.4960528
##   0.3  1          0.8               1.00       150      393794.9  0.5492864
##   0.3  2          0.6               0.50        50      260785.8  0.8281261
##   0.3  2          0.6               0.50       100      178784.8  0.9149293
##   0.3  2          0.6               0.50       150      146512.8  0.9396387
##   0.3  2          0.6               0.75        50      247989.1  0.8467994
##   0.3  2          0.6               0.75       100      168349.5  0.9249603
##   0.3  2          0.6               0.75       150      135665.2  0.9477143
##   0.3  2          0.6               1.00        50      256641.3  0.8375185
##   0.3  2          0.6               1.00       100      169721.5  0.9243536
##   0.3  2          0.6               1.00       150      136976.8  0.9464841
##   0.3  2          0.8               0.50        50      240110.2  0.8535609
##   0.3  2          0.8               0.50       100      163088.2  0.9283745
##   0.3  2          0.8               0.50       150      137565.3  0.9459400
##   0.3  2          0.8               0.75        50      238823.3  0.8567059
##   0.3  2          0.8               0.75       100      160151.4  0.9311280
##   0.3  2          0.8               0.75       150      133131.8  0.9491600
##   0.3  2          0.8               1.00        50      229285.3  0.8701582
##   0.3  2          0.8               1.00       100      151421.8  0.9382400
##   0.3  2          0.8               1.00       150      126917.6  0.9532285
##   0.3  3          0.6               0.50        50      167716.4  0.9240753
##   0.3  3          0.6               0.50       100      126686.3  0.9524459
##   0.3  3          0.6               0.50       150      117841.2  0.9577752
##   0.3  3          0.6               0.75        50      158477.9  0.9321882
##   0.3  3          0.6               0.75       100      118281.9  0.9583969
##   0.3  3          0.6               0.75       150      110371.3  0.9628705
##   0.3  3          0.6               1.00        50      160259.9  0.9313016
##   0.3  3          0.6               1.00       100      117579.2  0.9590085
##   0.3  3          0.6               1.00       150      109269.7  0.9636112
##   0.3  3          0.8               0.50        50      151624.0  0.9381105
##   0.3  3          0.8               0.50       100      120132.1  0.9570055
##   0.3  3          0.8               0.50       150      114207.8  0.9602070
##   0.3  3          0.8               0.75        50      146453.6  0.9421535
##   0.3  3          0.8               0.75       100      114360.7  0.9608240
##   0.3  3          0.8               0.75       150      109158.9  0.9635395
##   0.3  3          0.8               1.00        50      142825.1  0.9447975
##   0.3  3          0.8               1.00       100      111703.0  0.9624581
##   0.3  3          0.8               1.00       150      104569.6  0.9664854
##   0.4  1          0.6               0.50        50      432465.3  0.4370758
##   0.4  1          0.6               0.50       100      398705.0  0.5299983
##   0.4  1          0.6               0.50       150      374490.3  0.5881341
##   0.4  1          0.6               0.75        50      434490.7  0.4365939
##   0.4  1          0.6               0.75       100      401657.6  0.5233837
##   0.4  1          0.6               0.75       150      378208.4  0.5810694
##   0.4  1          0.6               1.00        50      434658.8  0.4362409
##   0.4  1          0.6               1.00       100      403686.8  0.5203349
##   0.4  1          0.6               1.00       150      383138.7  0.5731912
##   0.4  1          0.8               0.50        50      428932.7  0.4505511
##   0.4  1          0.8               0.50       100      390587.9  0.5528308
##   0.4  1          0.8               0.50       150      363118.9  0.6204952
##   0.4  1          0.8               0.75        50      431011.9  0.4478431
##   0.4  1          0.8               0.75       100      394343.3  0.5451558
##   0.4  1          0.8               0.75       150      367882.9  0.6109078
##   0.4  1          0.8               1.00        50      432287.6  0.4486303
##   0.4  1          0.8               1.00       100      396811.8  0.5409369
##   0.4  1          0.8               1.00       150      372554.4  0.6032543
##   0.4  2          0.6               0.50        50      219175.5  0.8749445
##   0.4  2          0.6               0.50       100      153534.7  0.9329504
##   0.4  2          0.6               0.50       150      134575.1  0.9463124
##   0.4  2          0.6               0.75        50      213367.4  0.8826329
##   0.4  2          0.6               0.75       100      145120.1  0.9405726
##   0.4  2          0.6               0.75       150      126578.8  0.9522186
##   0.4  2          0.6               1.00        50      217442.3  0.8780528
##   0.4  2          0.6               1.00       100      144523.1  0.9410373
##   0.4  2          0.6               1.00       150      124482.2  0.9538141
##   0.4  2          0.8               0.50        50      202273.7  0.8923482
##   0.4  2          0.8               0.50       100      145300.6  0.9399452
##   0.4  2          0.8               0.50       150      128625.0  0.9505953
##   0.4  2          0.8               0.75        50      194227.0  0.9028682
##   0.4  2          0.8               0.75       100      136403.2  0.9467432
##   0.4  2          0.8               0.75       150      121709.4  0.9556857
##   0.4  2          0.8               1.00        50      191305.5  0.9050270
##   0.4  2          0.8               1.00       100      132505.1  0.9490966
##   0.4  2          0.8               1.00       150      118821.3  0.9574491
##   0.4  3          0.6               0.50        50      146269.3  0.9386306
##   0.4  3          0.6               0.50       100      123485.6  0.9535004
##   0.4  3          0.6               0.50       150      118923.5  0.9563225
##   0.4  3          0.6               0.75        50      147441.5  0.9378626
##   0.4  3          0.6               0.75       100      118593.5  0.9571174
##   0.4  3          0.6               0.75       150      113107.8  0.9604277
##   0.4  3          0.6               1.00        50      137347.5  0.9464375
##   0.4  3          0.6               1.00       100      111179.6  0.9622910
##   0.4  3          0.6               1.00       150      105604.1  0.9655954
##   0.4  3          0.8               0.50        50      135431.3  0.9467505
##   0.4  3          0.8               0.50       100      119453.8  0.9562407
##   0.4  3          0.8               0.50       150      115849.7  0.9584072
##   0.4  3          0.8               0.75        50      127129.5  0.9530758
##   0.4  3          0.8               0.75       100      112338.0  0.9612504
##   0.4  3          0.8               0.75       150      107796.0  0.9639973
##   0.4  3          0.8               1.00        50      125585.5  0.9545457
##   0.4  3          0.8               1.00       100      107614.6  0.9646061
##   0.4  3          0.8               1.00       150      103209.5  0.9670986
##   MAE      
##   366355.91
##   341288.28
##   323095.98
##   366178.78
##   342168.85
##   325320.59
##   367806.72
##   344979.20
##   329874.45
##   362768.62
##   335665.85
##   315236.01
##   363747.25
##   337323.31
##   319166.01
##   364733.17
##   340202.04
##   322756.73
##   205916.69
##   131973.27
##   102643.30
##   195447.84
##   122837.04
##    93193.70
##   202566.09
##   123174.36
##    92968.58
##   186619.87
##   118013.77
##    95030.11
##   185250.40
##   114660.80
##    90109.45
##   176821.49
##   106233.57
##    84081.76
##   120585.16
##    83403.49
##    74608.29
##   112818.88
##    76669.14
##    68993.02
##   114989.76
##    76110.94
##    68340.25
##   106884.97
##    77197.36
##    71002.56
##   102269.11
##    72686.46
##    66866.50
##    99019.05
##    70715.49
##    64560.74
##   355027.79
##   327174.83
##   306291.40
##   356726.91
##   329941.51
##   310157.36
##   356081.17
##   331469.93
##   314159.67
##   351042.10
##   319816.73
##   296961.87
##   352998.60
##   323344.29
##   300982.79
##   354139.21
##   325068.59
##   305125.62
##   167983.92
##   108951.59
##    91604.12
##   162399.92
##   100792.75
##    84009.26
##   166242.79
##    99871.47
##    81892.93
##   153368.33
##   101842.49
##    85941.61
##   145089.82
##    93163.44
##    79881.64
##   142175.99
##    89056.85
##    76688.78
##   101450.42
##    79080.28
##    74235.96
##   102213.95
##    76449.28
##    70668.11
##    94562.44
##    70829.53
##    65798.32
##    92585.74
##    75796.33
##    71618.40
##    83839.58
##    69594.45
##    65414.32
##    83762.68
##    67147.91
##    63177.31
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nrounds = 150, max_depth = 3, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
##  = 1.

3. Evaluación y Resumen de Modelos

3.1. Calcular predicciones y RMSE en test

# Función para calcular RMSE en test
calc_rmse <- function(modelo, datos_test, outcome = "Weekly_Sales") {
  pred <- predict(modelo, newdata = datos_test)
  rmse <- rmse(datos_test[[outcome]], pred)
  return(rmse)
}

# Modelos tipo train() (los que no son lm() puro)
rmse_lm     <- calc_rmse(modelo_lm, test_ready)
rmse_ridge  <- calc_rmse(modelo_ridge, test_ready)
rmse_lasso  <- calc_rmse(modelo_lasso, test_ready)
rmse_enet   <- calc_rmse(modelo_enet, test_ready)
rmse_rf     <- calc_rmse(modelo_rf, test_ready)
rmse_svr    <- calc_rmse(modelo_svr, test_ready)
rmse_knn    <- calc_rmse(modelo_knn, test_ready)
rmse_xgb    <- calc_rmse(modelo_xgb, test_ready)

# Stepwise (lm base)
pred_step <- predict(modelo_step, newdata = test_ready)
rmse_step <- rmse(test_ready$Weekly_Sales, pred_step)

3.2. Extraer hiperparámetros y variables seleccionadas

# Extrae info relevante de cada modelo
# Para lm y stepwise
vars_lm     <- names(coef(modelo_lm$finalModel))
vars_step   <- names(coef(modelo_step))

# Para modelos glmnet
vars_ridge  <- rownames(coef(modelo_ridge$finalModel, modelo_ridge$bestTune$lambda))
vars_lasso  <- rownames(coef(modelo_lasso$finalModel, modelo_lasso$bestTune$lambda))[coef(modelo_lasso$finalModel, modelo_lasso$bestTune$lambda)[,1]!=0]
vars_enet   <- rownames(coef(modelo_enet$finalModel, modelo_enet$bestTune$lambda))[coef(modelo_enet$finalModel, modelo_enet$bestTune$lambda)[,1]!=0]

# Para los otros modelos, incluimos todas las variables predictoras
vars_rf     <- modelo_rf$finalModel$xNames
vars_svr    <- names(test_ready)[-which(names(test_ready) == "Weekly_Sales")]
vars_knn    <- names(test_ready)[-which(names(test_ready) == "Weekly_Sales")]
vars_xgb    <- names(test_ready)[-which(names(test_ready) == "Weekly_Sales")]

3.3. Armar dataframe de resumen de resultados

# Armamos la tabla
resultados <- tibble(
  Id = 1:9,
  Modelo = c(
    "Regresión Lineal",
    "Ridge",
    "LASSO",
    "Elastic Net",
    "Stepwise",
    "Random Forest",
    "SVR",
    "KNN",
    "XGBoost"
  ),
  Hyperparametros = c(
    NA,
    paste("lambda=", modelo_ridge$bestTune$lambda),
    paste("lambda=", modelo_lasso$bestTune$lambda),
    paste("alpha=", modelo_enet$bestTune$alpha, "lambda=", modelo_enet$bestTune$lambda),
    NA,
    paste("mtry=", modelo_rf$bestTune$mtry),
    paste("C=", modelo_svr$bestTune$C, "sigma=", modelo_svr$bestTune$sigma),
    paste("k=", modelo_knn$bestTune$kmax),
    paste("nrounds=", modelo_xgb$bestTune$nrounds)
  ),
  RMSE = c(
    rmse_lm,
    rmse_ridge,
    rmse_lasso,
    rmse_enet,
    rmse_step,
    rmse_rf,
    rmse_svr,
    rmse_knn,
    rmse_xgb
  ),
  Variables = list(
    vars_lm,
    vars_ridge,
    vars_lasso,
    vars_enet,
    vars_step,
    vars_rf,
    vars_svr,
    vars_knn,
    vars_xgb
  )
)

3.4. Asignar el “Tag” (Champion, Challenger, Deprecated)

# Ordenamos por RMSE y taggeamos
resultados <- resultados %>%
  arrange(RMSE) %>%
  mutate(
    Tag = case_when(
      row_number() == 1 ~ "Champion",
      row_number() <= 3 ~ "Challenger",
      TRUE ~ "Deprecated"
    )
  )
print(resultados)

## # A tibble: 9 × 6
##      Id Modelo           Hyperparametros                    RMSE Variables Tag  
##   <int> <chr>            <chr>                             <dbl> <list>    <chr>
## 1     9 XGBoost          "nrounds= 150"                   8.94e4 <chr>     Cham…
## 2     6 Random Forest    "mtry= 17"                       1.15e5 <chr>     Chal…
## 3     8 KNN              "k= "                            4.54e5 <chr>     Chal…
## 4     2 Ridge            "lambda= 1"                      5.29e5 <chr>     Depr…
## 5     4 Elastic Net      "alpha= 0.9 lambda= 89.44116466… 5.30e5 <chr>     Depr…
## 6     3 LASSO            "lambda= 1"                      5.30e5 <chr>     Depr…
## 7     1 Regresión Lineal  <NA>                            5.30e5 <chr>     Depr…
## 8     5 Stepwise          <NA>                            5.30e5 <chr>     Depr…
## 9     7 SVR              "C= 1 sigma= 0.0798933042025618" 5.66e5 <chr>     Depr…

3.5. Predicciones del modelo Champion

# Supón que modelo_xgb es el Champion. Cambia el nombre si tu Champion es otro.
champion_pred <- predict(modelo_xgb, newdata = test_ready)
champion_real <- test_ready$Weekly_Sales
champion_resid <- champion_real - champion_pred

# Si tu Champion es otro, por ejemplo modelo_rf, usa:
# champion_pred <- predict(modelo_rf, newdata = test_ready)
# champion_real <- test_ready$Weekly_Sales
# champion_resid <- champion_real - champion_pred

3.6. Gráfico de Predicciones vs. Valores Reales

# Dataframe para plot
df_plot <- tibble(
  Real = champion_real,
  Prediccion = champion_pred
)

ggplot(df_plot, aes(x = Real, y = Prediccion)) +
  geom_point(alpha = 0.4, color = "dodgerblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Predicción vs Valor Real",
    x = "Valor Real",
    y = "Predicción"
  ) +
  theme_minimal()

3. Gráfico de Residuos vs Predicción

df_residuos <- tibble(
  Prediccion = champion_pred,
  Residuo = champion_resid
)

ggplot(df_residuos, aes(x = Prediccion, y = Residuo)) +
  geom_point(alpha = 0.4, color = "tomato") +
  geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
  labs(
    title = "Residuos vs Predicción",
    x = "Predicción",
    y = "Residuo"
  ) +
  theme_minimal()

4. Histograma de Residuos

ggplot(df_residuos, aes(x = Residuo)) +
  geom_histogram(bins = 30, fill = "purple", color = "white", alpha = 0.7) +
  labs(
    title = "Histograma de residuos",
    x = "Residuo",
    y = "Frecuencia"
  ) +
  theme_minimal()

3.5. Comentario final sobre los resultados

A) Selección de Modelo Champion

El modelo XGBoost resultó ser el “Champion”, logrando el menor RMSE en el set de test (RMSE ≈ 90,138.42).
Supera significativamente al resto de modelos, especialmente a los tradicionales como regresión lineal múltiple y modelos penalizados, así como a otros algoritmos de machine learning como Random Forest y SVR.

B) Desempeño General de Modelos

El segundo mejor modelo fue Random Forest (RMSE ≈ 114,762), y KNN se ubicó como Challenger, aunque con una diferencia considerable respecto al Champion.
Los modelos lineales (regresión, ridge, lasso, elastic net y stepwise) mostraron un desempeño mucho más bajo, evidenciando que la relación entre las variables y las ventas semanales es altamente no lineal y compleja, justificando el uso de modelos de boosting.

C) Diagnóstico Gráfico del Modelo Champion

a) Predicción vs Valor Real

El gráfico muestra que los puntos están muy alineados respecto a la línea diagonal roja, indicando que las predicciones del modelo Champion son muy cercanas a los valores reales para la mayoría de los casos.
Hay pocos puntos alejados, lo que indica un bajo número de outliers o errores grandes.

b) Residuos vs Predicción

Los residuos se distribuyen de manera simétrica en torno al cero, con mayor concentración cerca de los valores bajos y medianos de predicción.
No se observa un patrón sistemático (no hay curvaturas ni “bandas” verticales claras), lo que indica que el modelo no deja sesgos estructurales importantes sin explicar.
Hay cierta dispersión en predicciones altas, pero es esperable dado el rango de ventas semanales (outliers naturales en negocios de retail).

c) Histograma de Residuos

Los residuos siguen una distribución aproximadamente normal y centrada en cero, con la mayoría de errores menores a ±100,000.
La cola derecha indica algunos casos de ventas atípicamente altas o bajas donde el modelo no acierta completamente, pero son pocos casos.

D) Recomendaciones y Hallazgos

El modelo Champion (XGBoost) captura muy bien la complejidad del problema, siendo robusto ante las variaciones semanales y capaz de explicar la variabilidad de ventas incluso con factores externos y estacionales.
Para producción o recomendaciones reales, se sugiere mantener este modelo, pero monitorear outliers (ventas extremadamente altas/bajas) e investigar los casos atípicos.
La baja performance de modelos lineales indica que la relación de las variables predictoras con las ventas es compleja y potencialmente dependiente de interacciones, no solo efectos simples.

3.6. Guardar los resultados a CSV

write_csv(resultados, "resultados_modelos_walmart.csv")

R Notebook