library(ggplot2)
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(tseries)
## Warning: package 'tseries' was built under R version 4.3.3
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(TSA)
## Registered S3 methods overwritten by 'TSA':
## method from
## fitted.Arima forecast
## plot.Arima forecast
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library(xts)
## Warning: package 'xts' was built under R version 4.3.3
library(dplyr)
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:xts':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(scales)
library(vars)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: strucchange
## Loading required package: sandwich
## Loading required package: urca
## Warning: package 'urca' was built under R version 4.3.3
## Loading required package: lmtest
library(readxl)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Time series analysis is a statistical technique that involves analyzing data points collected or recorded at specific time intervals. This type of analysis is used to identify patterns, trends, cycles, and seasonal variations in the data over time. The primary goal of time series analysis is to forecast future values based on historical data, making it particularly useful in fields such as finance, economics, weather forecasting, and many other areas where data evolves over time (Chatfield, 2003).
Stationarity is a fundamental concept in time series analysis, which implies that the statistical properties of the series, such as mean and variance, remain constant over time. Stationary time series are easier to model and predict because their behavior is consistent over time. Non-stationary data, on the other hand, may exhibit trends, seasonality, or varying variance, complicating the analysis and potentially leading to misleading results. Ensuring stationarity often involves transforming or differencing the data to remove trends and seasonality, allowing for more accurate modeling and forecasting (Box et al., 2015).
Time series analysis is crucial for business intelligence because it enables organizations to make data-driven decisions by forecasting future trends, demand, or performance based on historical data. It can be applied to various business problems, such as predicting sales, stock prices, or customer behavior. This analysis helps businesses anticipate changes, manage inventory, optimize marketing strategies, and improve operational efficiency by understanding patterns over time and responding proactively.
Sources Chatfield, C. (2003). The Analysis of Time Series: An Introduction. Chapman and Hall/CRC. Box, G.E.P., Jenkins, G.M., Reinsel, G.C., & Ljung, G.M. (2015). Time Series Analysis: Forecasting and Control. Wiley.
For this problem situation, our objective is to build a predictive model for Arca Continental, mainly for their biggest and most popular branch in Latin America, Coca-Cola. We want to create a model that can successfully forecast the a designated target variable, with which we can predict the company’s box sales (referred to as “grids”) in the metropolitan area of Guadalajara, Jalisco.
# Display the structure of the dataset
str(cocacola)
## tibble [48 × 15] (S3: tbl_df/tbl/data.frame)
## $ tperiod : POSIXct[1:48], format: "2021-01-15" "2021-02-15" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : num [1:48] 0 0 0 1 0 0 0 0 1 0 ...
# Convert 'holiday_month' to a factor
cocacola$holiday_month <- as.factor(cocacola$holiday_month)
# Convert 'tperiod' to a Date object if it isn't already one
cocacola$tperiod <- as.Date(cocacola$tperiod, format = "%d-%m-%Y")
# Extract the day (which represents the year) and the month
day_month <- format(cocacola$tperiod, "%d-%m")
# Convert the day to a proper year (e.g., 0015 to 2015) and create the correct date
corrected_year <- as.numeric(substr(day_month, 1, 2)) + 2000
corrected_month_day <- paste0(corrected_year, "-", substr(day_month, 4, 5))
# Convert back to Date object
cocacola$tperiod <- as.Date(paste0(corrected_month_day, "-01"), format = "%Y-%m-%d")
# Check the results
head(cocacola$tperiod)
## [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01"
## [6] "2015-06-01"
# structure after conversions
str(cocacola)
## tibble [48 × 15] (S3: tbl_df/tbl/data.frame)
## $ tperiod : Date[1:48], format: "2015-01-01" "2015-02-01" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 2 1 ...
ggplot(cocacola, aes(x=tperiod, y=sales_unitboxes)) +
geom_line() +
labs(title='Box Sales Over Time', x='Time', y='sales_unitboxes') +
theme(axis.text.x = element_text(angle=45, hjust=1))
cocacola$Moving_Avg <- rollmean(cocacola$sales_unitboxes, 12, fill = NA)
# Plot moving average
ggplot(cocacola, aes(x=tperiod)) +
geom_line(aes(y=sales_unitboxes, colour="Original")) +
geom_line(aes(y=Moving_Avg, colour="4-Periods Moving Average")) +
labs(title='Moving Average Plot', x='Date', y='Sales (Unit Boxes)') +
scale_colour_manual("", breaks = c("Original", "12-Periods Moving Average"), values = c("blue", "red"))
## Warning: Removed 11 rows containing missing values or values outside the scale range
## (`geom_line()`).
cocacola_ts <- ts(cocacola$sales_unitboxes, frequency=12, start=c(2015, 1))
decomposed <- decompose(cocacola_ts, type="additive")
plot(decomposed)
adf_test <- adf.test(cocacola$sales_unitboxes, alternative="stationary")
## Warning in adf.test(cocacola$sales_unitboxes, alternative = "stationary"):
## p-value smaller than printed p-value
print(paste("ADF Statistic: ", adf_test$statistic))
## [1] "ADF Statistic: -4.42816565181968"
print(paste("p-value: ", adf_test$p.value))
## [1] "p-value: 0.01"
sales_ts <- ts(cocacola$sales_unitboxes, frequency=12)
# Plot the autocorrelation function (ACF)
acf(sales_ts, main="Autocorrelation of Sales (Unit Boxes)")
# Plot the partial autocorrelation function (PACF)
pacf(sales_ts, main="Partial Autocorrelation of Sales (Unit Boxes)")
The dependent variable sales_unitboxes has shown to have an upward trend
that generally increases over time, as well as clear seasonality. It is
also stationary, showing a p-value of 0.01 and autocorrelation in the
ACF plot.
# ARMA
p_values <- 0:5
q_values <- 0:5
best_aic_arma <- Inf
best_order_arma <- c(0, 0)
best_model_arma <- NULL
for (p in p_values) {
for (q in q_values) {
model <- tryCatch({
arima(sales_ts, order = c(p, 0, q)) # Modelo ARMA (sin diferenciación, d = 0)
}, error = function(e) NULL)
if (!is.null(model)) {
current_aic <- AIC(model)
if (current_aic < best_aic_arma) {
best_aic_arma <- current_aic
best_order_arma <- c(p, q)
best_model_arma <- model
}
}
}
}
## Warning in stats::arima(x = x, order = order, seasonal = seasonal, xreg = xreg,
## : possible convergence problem: optim gave code = 1
# ARIMA
p_values <- 0:5
d_values <- 1:2
q_values <- 0:5
best_aic_arima <- Inf
best_order_arima <- c(0, 1, 0)
best_model_arima <- NULL
for (p in p_values) {
for (d in d_values) {
for (q in q_values) {
model <- tryCatch({
arima(sales_ts, order = c(p, d, q)) # Modelo ARIMA
}, error = function(e) NULL)
if (!is.null(model)) {
current_aic <- AIC(model)
if (current_aic < best_aic_arima) {
best_aic_arima <- current_aic
best_order_arima <- c(p, d, q)
best_model_arima <- model
}
}
}
}
}
## Warning in log(s2): NaNs produced
## Warning in log(s2): NaNs produced
## Warning in log(s2): NaNs produced
## Warning in log(s2): NaNs produced
## Warning in log(s2): NaNs produced
# AR (Autoregressive)
p_values <- 0:5
best_aic_ar <- Inf
best_order_ar <- c(0, 0)
best_model_ar <- NULL
for (p in p_values) {
model <- tryCatch({
arima(sales_ts, order = c(p, 0, 0)) # Modelo AR (sin componente MA, q = 0)
}, error = function(e) NULL)
if (!is.null(model)) {
current_aic <- AIC(model)
if (current_aic < best_aic_ar) {
best_aic_ar <- current_aic
best_order_ar <- c(p, 0)
best_model_ar <- model
}
}
}
# MA (Moving Average)
q_values <- 0:5
best_aic_ma <- Inf
best_order_ma <- c(0, 0)
best_model_ma <- NULL
for (q in q_values) {
model <- tryCatch({
arima(sales_ts, order = c(0, 0, q)) # Modelo MA (sin componente AR, p = 0)
}, error = function(e) NULL)
if (!is.null(model)) {
current_aic <- AIC(model)
if (current_aic < best_aic_ma) {
best_aic_ma <- current_aic
best_order_ma <- c(0, q)
best_model_ma <- model
}
}
}
# Comparar los modelos con base en el AIC
modelos <- c("ARMA", "ARIMA", "AR", "MA")
mejores_aic <- c(best_aic_arma, best_aic_arima, best_aic_ar, best_aic_ma)
mejor_modelo <- modelos[which.min(mejores_aic)]
mejor_aic <- min(mejores_aic)
print(paste("El mejor modelo es:", mejor_modelo))
## [1] "El mejor modelo es: ARIMA"
print(paste("El mejor AIC es:", mejor_aic))
## [1] "El mejor AIC es: 1354.53371359874"
# Mostrar el resumen del mejor modelo
if (mejor_modelo == "ARMA") {
summary(best_model_arma)
} else if (mejor_modelo == "ARIMA") {
summary(best_model_arima)
} else if (mejor_modelo == "AR") {
summary(best_model_ar)
} else if (mejor_modelo == "MA") {
summary(best_model_ma)
}
##
## Call:
## arima(x = sales_ts, order = c(p, d, q))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5
## -0.7718 -0.5521 -0.5689 -0.3367 -0.8268 0.0062 0.7522
## s.e. 0.1690 0.1435 0.1924 0.2089 0.1744 0.1985 0.1731
##
## sigma^2 estimated as 1.885e+11: log likelihood = -669.27, aic = 1352.53
##
## Training set error measures:
## Warning in trainingaccuracy(object, test, d, D): test elements must be within
## sample
## ME RMSE MAE MPE MAPE
## Training set NaN NaN NaN NaN NaN
# Definir los valores de p, d y q
p_values <- 0:5
d_values <- 1:2
q_values <- 0:5
# Inicializar el mejor AIC, el mejor orden y el mejor modelo
best_aic_arima <- Inf
best_order_arima <- c(0, 1, 0)
best_model_arima <- NULL
# Iterar sobre los posibles valores de p, d y q
for (p in p_values) {
for (d in d_values) {
for (q in q_values) {
model <- tryCatch({
Arima(sales_ts, order = c(p, d, q)) # Modelo ARIMA
}, error = function(e) NULL)
if (!is.null(model)) {
current_aic <- AIC(model)
if (current_aic < best_aic_arima) {
best_aic_arima <- current_aic
best_order_arima <- c(p, d, q)
best_model_arima <- model
}
}
}
}
}
# Verificar si se encontró un modelo
if (!is.null(best_model_arima)) {
print(paste("Mejor modelo ARIMA:", paste(best_order_arima, collapse = ",")))
print(paste("Mejor AIC ARIMA:", best_aic_arima))
summary(best_model_arima)
# Generar pronóstico a 5 períodos
forecast_arima <- forecast(best_model_arima, h = 5)
print(forecast_arima)
# Graficar el pronóstico
autoplot(forecast_arima) +
labs(title = "Pronóstico a 5 períodos", x = "Tiempo", y = "Ventas (Cajas Unitarias)") +
theme_minimal()
} else {
print("No se encontró un modelo ARIMA válido.")
}
## [1] "Mejor modelo ARIMA: 2,2,5"
## [1] "Mejor AIC ARIMA: 1354.53371359874"
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 5 6823548 6196309 7450787 5864269 7782828
## Feb 5 6790215 6031908 7548521 5630485 7949944
## Mar 5 6909395 6092700 7726089 5660368 8158421
## Apr 5 6735581 5913620 7557542 5478500 7992662
## May 5 6822849 5996149 7649549 5558520 8087177
Instructions: From the time series dataset, select 2 explanatory variables that might affect the sales unitboxes.
# Calcular la correlación entre sales_unitboxes y otras variables
correlation_matrix <- cor(cocacola[, c("sales_unitboxes",
"consumer_sentiment",
"gdp_percapita",
"CPI",
"inflation_rate",
"unemp_rate",
"exchange_rate",
"max_temperature",
"Moving_Avg")],
use = "complete.obs")
# Ver la matriz de correlación
print(correlation_matrix)
## sales_unitboxes consumer_sentiment gdp_percapita CPI
## sales_unitboxes 1.00000000 0.35524547 0.05786549 0.08807625
## consumer_sentiment 0.35524547 1.00000000 -0.52631149 -0.40446243
## gdp_percapita 0.05786549 -0.52631149 1.00000000 0.86618848
## CPI 0.08807625 -0.40446243 0.86618848 1.00000000
## inflation_rate -0.55212057 -0.52620078 -0.01727899 0.10332937
## unemp_rate 0.04231135 0.51952301 -0.76038173 -0.80824729
## exchange_rate -0.06014764 -0.73031564 0.66028999 0.54067761
## max_temperature 0.68348780 0.03848755 0.37713494 0.17746354
## Moving_Avg 0.19129530 -0.20416077 0.73121504 0.65558384
## inflation_rate unemp_rate exchange_rate max_temperature
## sales_unitboxes -0.55212057 0.04231135 -0.06014764 0.68348780
## consumer_sentiment -0.52620078 0.51952301 -0.73031564 0.03848755
## gdp_percapita -0.01727899 -0.76038173 0.66028999 0.37713494
## CPI 0.10332937 -0.80824729 0.54067761 0.17746354
## inflation_rate 1.00000000 -0.19810803 0.41019512 -0.55467320
## unemp_rate -0.19810803 1.00000000 -0.63120125 -0.11731604
## exchange_rate 0.41019512 -0.63120125 1.00000000 0.13404524
## max_temperature -0.55467320 -0.11731604 0.13404524 1.00000000
## Moving_Avg -0.10970141 -0.51257546 0.52480755 0.25792091
## Moving_Avg
## sales_unitboxes 0.1912953
## consumer_sentiment -0.2041608
## gdp_percapita 0.7312150
## CPI 0.6555838
## inflation_rate -0.1097014
## unemp_rate -0.5125755
## exchange_rate 0.5248076
## max_temperature 0.2579209
## Moving_Avg 1.0000000
# Filtrar las correlaciones con sales_unitboxes
sales_correlations <- correlation_matrix[1, -1] # Excluimos la primera fila que es la de sales_unitboxes
# Ver las correlaciones ordenadas
print(sales_correlations[order(abs(sales_correlations), decreasing = TRUE)])
## max_temperature inflation_rate consumer_sentiment Moving_Avg
## 0.68348780 -0.55212057 0.35524547 0.19129530
## CPI exchange_rate gdp_percapita unemp_rate
## 0.08807625 -0.06014764 0.05786549 0.04231135
# Diferenciar las series max_temperature e inflation_rate para hacerlas estacionarias
max_temperature_diff <- diff(cocacola$max_temperature)
inflation_rate_diff <- diff(cocacola$inflation_rate)
# Escalar las variables exógenas (max_temperature_diff e inflation_rate_diff)
# Alineamos las variables eliminando la primera observación de sales_unitboxes
explanatory_variables <- scale(cbind(max_temperature_diff, inflation_rate_diff))
# Verificar que no haya NA en las variables exógenas escaladas y en sales_unitboxes
sales_ts <- ts(na.omit(cocacola$sales_unitboxes)[-1], frequency = 12, start = c(2015, 1)) # Ajustamos sales_unitboxes para que coincida con las variables diferenciadas
# Utilizar el mejor modelo ARIMA encontrado previamente para el ajuste del ARIMAX
best_model_arimax <- Arima(sales_ts, order = best_order_arima, xreg = explanatory_variables)
# Mostrar el resumen del modelo ARIMAX ajustado
summary(best_model_arimax)
## Series: sales_ts
## Regression with ARIMA(2,2,5) errors
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5
## -0.7896 -0.5474 -0.5440 -0.3685 -0.8064 -0.0319 0.7796
## s.e. 0.1707 0.1443 0.1853 0.2102 0.1695 0.2031 0.1736
## max_temperature_diff inflation_rate_diff
## -16830.07 146.7015
## s.e. 58146.01 38965.7032
##
## sigma^2 = 2.352e+11: log likelihood = -654.79
## AIC=1329.59 AICc=1336.06 BIC=1347.65
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -67568.01 424484.1 311138.3 -1.446068 4.960722 0.9214931
## ACF1
## Training set -0.0099564
# Realizar un pronóstico para los próximos 5 períodos
# Diferenciar los próximos valores de inflation_rate y max_temperature para el pronóstico
max_temperature_forecast_diff <- diff(cocacola$max_temperature[1:6]) # Diferenciar los próximos valores de max_temperature
inflation_rate_forecast_diff <- diff(cocacola$inflation_rate[1:6]) # Diferenciar los próximos valores de inflation_rate
# Escalar las variables proyectadas
explanatory_forecast <- scale(cbind(max_temperature_forecast_diff, inflation_rate_forecast_diff))
# Generar el pronóstico usando el modelo ARIMAX con las variables exógenas proyectadas y escaladas
forecast_arimax <- forecast(best_model_arimax, xreg = explanatory_forecast, h = 5)
## Warning in forecast.forecast_ARIMA(best_model_arimax, xreg =
## explanatory_forecast, : xreg contains different column names from the xreg used
## in training. Please check that the regressors are in the same order.
# Crear una tabla con los resultados del pronóstico
forecast_table <- data.frame(
Period = time(forecast_arimax$mean),
Forecast = as.numeric(forecast_arimax$mean),
Lower_95 = as.numeric(forecast_arimax$lower[,2]),
Upper_95 = as.numeric(forecast_arimax$upper[,2])
)
# Imprimir la tabla de pronóstico
print(forecast_table)
## Period Forecast Lower_95 Upper_95
## 1 2018.917 6795278 5808470 7782087
## 2 2019.000 6763246 5564154 7962339
## 3 2019.083 6863740 5571412 8156067
## 4 2019.167 6696910 5394784 7999037
## 5 2019.250 6808127 5496554 8119700
# Graficar el pronóstico
autoplot(forecast_arimax) +
labs(title = "Pronóstico a 5 períodos con ARIMAX usando variables escaladas y diferenciadas",
x = "Tiempo", y = "Ventas (Cajas Unitarias)") +
theme_minimal()
There is highly likely a positive relationship between “max_temperature” and sales unit boxes of Coca-Cola. As the maximum temperature increases, particularly in warmer months or regions, the demand for cold beverages like Coca-Cola tends to rise. People are more inclined to purchase refreshing drinks when the weather is hot to stay hydrated and cool down.On the other hand, lower temperatures may reduce demand as people may prefer hot drinks or other beverages better suited for colder weather.
There is likely a negative relationship between “inflation_rate” and sales unit boxes of Coca-Cola. An increase in the inflation rate generally leads to a rise in the cost of goods and services, which can result in higher prices for Coca-Cola. As prices go up, consumers’ purchasing power declines, potentially reducing the overall sales volume. During periods of high inflation, consumers may prioritize essential items over discretionary spending. Coca-Cola, being a non-essential item, might experience reduced sales volume as consumers cut back on such expenditures.
# Crear un nuevo dataframe que tenga todas las series alineadas correctamente
aligned_data <- data.frame(
tperiod = cocacola$tperiod[-1], # Eliminar la primera observación de tperiod
sales_unitboxes = cocacola$sales_unitboxes[-1], # Eliminar la primera observación
max_temperature_diff = max_temperature_diff, # Ya está alineada
inflation_rate_diff = inflation_rate_diff # Ya está alineada
)
# Graficar las ventas en un eje, y max_temperature_diff e inflation_rate_diff en el eje secundario
ggplot() +
# Graficar sales_unitboxes en el eje primario
geom_line(data = aligned_data, aes(x = tperiod, y = sales_unitboxes, color = "Ventas (Cajas Unitarias)"), size = 1) +
# Graficar max_temperature_diff en el eje secundario (multiplicada para ajuste visual)
geom_line(data = aligned_data, aes(x = tperiod, y = max_temperature_diff * 200000, color = "Temperatura Máxima (Diferenciada)"), size = 1, linetype = "dashed") +
# Graficar inflation_rate_diff en el eje secundario (multiplicada para ajuste visual)
geom_line(data = aligned_data, aes(x = tperiod, y = inflation_rate_diff * 2000000, color = "Tasa de Inflación (Diferenciada)"), size = 1, linetype = "dotted") +
# Configuración de los ejes y sus respectivas escalas
scale_y_continuous(
name = "Ventas (Cajas Unitarias)",
sec.axis = sec_axis(~ . / 100000, name = "Temperatura (°C) y Tasa de Inflación Diferenciada (%)")
) +
# Etiquetas y título del gráfico
labs(title = "Ventas, Temperatura Máxima y Tasa de Inflación Diferenciada con Ejes Secundarios",
x = "Tiempo", color = "") +
# Tema minimalista y ajuste de leyenda
theme_minimal() +
theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
acf_max_temperature_diff <- Acf(max_temperature_diff, main = "ACF de max_temperature (Diferenciada)")
# Calcular la ACF para la serie diferenciada de inflation_rate
acf_inflation_rate_diff <- Acf(inflation_rate_diff, main = "ACF de inflation_rate (Diferenciada)")
# Prueba ADF para la serie diferenciada de max_temperature
adf_test_max_temp_diff <- adf.test(max_temperature_diff, alternative = "stationary")
print(adf_test_max_temp_diff)
##
## Augmented Dickey-Fuller Test
##
## data: max_temperature_diff
## Dickey-Fuller = -4.0323, Lag order = 3, p-value = 0.01639
## alternative hypothesis: stationary
# Prueba ADF para la serie diferenciada de inflation_rate
adf_test_inflation_diff <- adf.test(inflation_rate_diff, alternative = "stationary")
print(adf_test_inflation_diff)
##
## Augmented Dickey-Fuller Test
##
## data: inflation_rate_diff
## Dickey-Fuller = -3.895, Lag order = 3, p-value = 0.02207
## alternative hypothesis: stationary
Max Temperature: The ADF test statistic of -4.0323 is significantly lower than the critical values at common confidence levels (like -3.43 for 5% significance level). The p-value is less than 0.05, indicating that we can reject the null hypothesis in favor of the alternative hypothesis that the series is in fact, stationary.
Inflation Rate: The ADF test statistic of -3.895 is also below the critical values for stationarity (e.g., -3.43 at 5% significance level).The p-value (0.02207) is less than 0.05, allowing us to reject the null hypothesis of non-stationarity, which in turn tells us that this series is also stationary.
ADF Conclusion: These tests suggest that the statistical properties of both series remain constant over time when differenced, making them suitable for time series analysis techniques that assume stationarity.
Autocorrelacion analysis after differentiation: Both series show autocorrelation at lag 1. The significant spike at lag 1 indicates that the differenced data is not completely free of autocorrelation. However, most of the other lags do not show significant autocorrelation, implying that the differencing has reduced much of the autocorrelation in the original series. In summary, while there is some remaining autocorrelation, particularly at lag 1 for both series, the data does not exhibit strong autocorrelation at other lags.
# Crear un dataframe con las series diferenciadas alineadas (sales_unitboxes e inflation_rate_diff)
var_data <- data.frame(
sales_unitboxes = cocacola$sales_unitboxes[-1], # Alinear eliminando la primera observación
inflation_rate_diff = inflation_rate_diff # Serie diferenciada ya preparada
)
# Selección del número óptimo de rezagos basado en AIC (equivalente a maxlags=4 y ic='aic' en Python)
lag_selection <- VARselect(var_data, lag.max = 4, type = "const")
optimal_lag <- lag_selection$selection["AIC(n)"] # Extraemos el número óptimo de rezagos según el criterio AIC
# Estimación del modelo VAR con el número de rezagos óptimo
VAR_model <- VAR(var_data, p = optimal_lag, type = "const")
# Mostrar el resumen del modelo VAR ajustado
summary(VAR_model)
##
## VAR Estimation Results:
## =========================
## Endogenous variables: sales_unitboxes, inflation_rate_diff
## Deterministic variables: const
## Sample size: 44
## Log Likelihood: -636.938
## Roots of the characteristic polynomial:
## 0.7906 0.7906 0.7632 0.7632 0.7044 0.7044
## Call:
## VAR(y = var_data, p = optimal_lag, type = "const")
##
##
## Estimation results for equation sales_unitboxes:
## ================================================
## sales_unitboxes = sales_unitboxes.l1 + inflation_rate_diff.l1 + sales_unitboxes.l2 + inflation_rate_diff.l2 + sales_unitboxes.l3 + inflation_rate_diff.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## sales_unitboxes.l1 4.718e-01 1.514e-01 3.117 0.00353 **
## inflation_rate_diff.l1 -5.059e+05 2.238e+05 -2.260 0.02978 *
## sales_unitboxes.l2 -1.179e-02 1.726e-01 -0.068 0.94590
## inflation_rate_diff.l2 -2.666e+05 2.408e+05 -1.107 0.27529
## sales_unitboxes.l3 6.755e-02 1.649e-01 0.410 0.68445
## inflation_rate_diff.l3 -5.693e+05 2.055e+05 -2.771 0.00870 **
## const 3.105e+06 1.136e+06 2.733 0.00956 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 463100 on 37 degrees of freedom
## Multiple R-Squared: 0.4425, Adjusted R-squared: 0.3521
## F-statistic: 4.895 on 6 and 37 DF, p-value: 0.0008983
##
##
## Estimation results for equation inflation_rate_diff:
## ====================================================
## inflation_rate_diff = sales_unitboxes.l1 + inflation_rate_diff.l1 + sales_unitboxes.l2 + inflation_rate_diff.l2 + sales_unitboxes.l3 + inflation_rate_diff.l3 + const
##
## Estimate Std. Error t value Pr(>|t|)
## sales_unitboxes.l1 6.975e-08 9.946e-08 0.701 0.48752
## inflation_rate_diff.l1 -6.279e-01 1.470e-01 -4.271 0.00013 ***
## sales_unitboxes.l2 1.302e-07 1.134e-07 1.148 0.25825
## inflation_rate_diff.l2 -4.187e-01 1.582e-01 -2.647 0.01186 *
## sales_unitboxes.l3 3.463e-07 1.083e-07 3.196 0.00285 **
## inflation_rate_diff.l3 -2.437e-01 1.350e-01 -1.805 0.07922 .
## const -3.523e+00 7.463e-01 -4.721 3.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## Residual standard error: 0.3042 on 37 degrees of freedom
## Multiple R-Squared: 0.4472, Adjusted R-squared: 0.3576
## F-statistic: 4.989 on 6 and 37 DF, p-value: 0.0007832
##
##
##
## Covariance matrix of residuals:
## sales_unitboxes inflation_rate_diff
## sales_unitboxes 2.145e+11 -4.109e+04
## inflation_rate_diff -4.109e+04 9.256e-02
##
## Correlation matrix of residuals:
## sales_unitboxes inflation_rate_diff
## sales_unitboxes 1.0000 -0.2917
## inflation_rate_diff -0.2917 1.0000
# Realizar el pronóstico para los próximos 5 períodos
forecast_steps <- 5
var_forecast <- predict(VAR_model, n.ahead = forecast_steps)
# Crear un índice para los datos pronosticados
forecast_index <- seq(as.Date(tail(cocacola$tperiod, 1)), by = "month", length.out = forecast_steps)
# Crear un dataframe con los resultados del pronóstico
forecast_df <- data.frame(
Period = forecast_index,
sales_unitboxes_pred = var_forecast$fcst$sales_unitboxes[, 1], # Predicción de sales_unitboxes
inflation_rate_pred = var_forecast$fcst$inflation_rate_diff[, 1] # Predicción de inflation_rate_diff
)
# Mostrar el pronóstico
print(forecast_df)
## Period sales_unitboxes_pred inflation_rate_pred
## 1 2018-12-01 6485870 -0.13676581
## 2 2019-01-01 6423873 -0.02262784
## 3 2019-02-01 6635569 0.14764207
## 4 2019-03-01 6607301 -0.02813486
## 5 2019-04-01 6565818 -0.01274115
# Realizar la prueba de causalidad de Granger para ver si inflation_rate_diff causa sales_unitboxes
granger_test_1 <- causality(VAR_model, cause = "inflation_rate_diff")
print(granger_test_1$Granger)
##
## Granger causality H0: inflation_rate_diff do not Granger-cause
## sales_unitboxes
##
## data: VAR object VAR_model
## F-Test = 3.7763, df1 = 3, df2 = 74, p-value = 0.01403
# Realizar la prueba de causalidad de Granger para ver si sales_unitboxes causa inflation_rate_diff
granger_test_2 <- causality(VAR_model, cause = "sales_unitboxes")
print(granger_test_2$Granger)
##
## Granger causality H0: sales_unitboxes do not Granger-cause
## inflation_rate_diff
##
## data: VAR object VAR_model
## F-Test = 7.9594, df1 = 3, df2 = 74, p-value = 0.0001142
# create df that combines real sales and forecast
combined_data <- data.frame(
Period = c(as.Date(cocacola$tperiod[-1]), forecast_df$Period), # real dates and forecast
sales_unitboxes = c(cocacola$sales_unitboxes[-1], rep(NA, forecast_steps)), #Real Sales
sales_unitboxes_pred = c(rep(NA, length(cocacola$sales_unitboxes[-1])), forecast_df$sales_unitboxes_pred) # Forecast
)
ggplot() +
# Plot sales_unitboxes
geom_line(data = combined_data, aes(x = Period, y = sales_unitboxes, color = "Sales (Real)"), size = 1) +
# Plot sales_unitboxes_pred
geom_line(data = combined_data, aes(x = Period, y = sales_unitboxes_pred, color = "Sales (Forecast)"), size = 1, linetype = "dashed") +
# Vertical line to mark the start of the forecast
geom_vline(xintercept = as.numeric(forecast_df$Period[1]), color = "yellow", linetype = "dashed", size = 1) +
labs(title = "Monthly Sales and VAR forecast of Unit Boxes",
x = "Date", y = "Sales (Unit Boxes)") +
scale_color_manual(values = c("Sales (Real)" = "blue", "Sales (Forecast)" = "green")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "bottom") +
theme(legend.title = element_blank())
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 47 rows containing missing values or values outside the scale range
## (`geom_line()`).
While sales_unitboxes Granger-cause inflation_rate_diff, the reverse is not true. This suggests that past values of sales_unitboxes provide significant information for predicting changes in inflation rate.
The rejection of the null hypothesis in the second Granger-causality test indicates that sales unit boxes have a predictive influence on inflation rates after differencing. This might suggest that changes in sales activity could be an indicator of inflationary trends.
Since the first Granger-causality test fails to reject the null hypothesis , inflation (after differencing) does not appear to Granger-cause sales unit boxes. This could imply that fluctuations in inflation may not be directly influencing sales activity
The covariance matrix of residuals shows a weak negative correlation (-0.2917) between sales_unitboxes and inflation_rate_diff. This suggests that, while the residuals of both variables are interconnected, they do not exhibit a strong linear relationship, implying that other factors may influence the dynamics between sales and inflation beyond what is captured by their past values.
The VAR forecast begins in late 2018, just after a moderate upward trend in real sales. The forecast shows a relatively stable prediction for future sales, with values slightly above the final real sales point. This indicates that the model expects a stabilization of sales based on recent patterns, projecting moderate growth or consistency in the near future.
The forecasted section of the graph maintains the clear seasonal patterns seen in historical data, which suggests that these patterns will persist. Coca-Cola can use this information to prepare for peak demand periods, ensuring inventory and marketing efforts are aligned with these predictable sales surges.
Adjust prices for Coca-Cola products based on the weather. For example, offer discounts or promotions on hotter days to encourage more sales. In colder months, you might focus on bundling with other products.
Implement a pricing strategy that adjusts to inflationary pressures. For example, offering the product in smaller containers, might make it easier for customers to buy it, mainly in countries like Mexico, where the income is low in many families.
We could also consider using machine learning models like neural networks, which can handle time series data and complex non-linear relationships. By feeding these models with relevant time series features, we can improve the accuracy of the forecasts we made.