Coca-Cola FEMSA, the second-largest Coca-Cola bottler in Latin America and one of the world’s most significant, serves over 118 million consumers across Mexico, Argentina, Ecuador, the United States, and Peru. With a broad product portfolio, it plays a crucial role in meeting consumer demand, including in the Guadalajara Metropolitan Area, where predictive sales models are being developed to enhance strategic decision-making.
Linear regression analysis is a powerful tool in business intelligence for identifying and quantifying relationships between variables. By analyzing historical data, it helps businesses predict future trends, such as sales, customer demand, or market behavior. This insight enables data-driven decision-making, optimizing strategies like pricing, marketing, and inventory management, ultimately leading to improved operational efficiency and profitability.
The problem situation involves predicting sales in units of boxes for the Guadalajara Metropolitan Area by analyzing various economic and environmental factors. To address this, the project will begin with Exploratory Data Analysis (EDA) to understand the data, followed by hypothesis development to explore relationships between variables. Various regression models, including multiple linear, polynomial, Lasso, and Ridge regressions, will be employed to estimate and refine the predictive models. Finally, diagnostic tests and accuracy metrics will be used to select the best model and interpret the results for strategic decision-making.
# missing values in each column
vis_miss(cocacola)
# Display the structure of the dataset
str(cocacola)
## tibble [48 × 15] (S3: tbl_df/tbl/data.frame)
## $ tperiod : POSIXct[1:48], format: "2021-01-15" "2021-02-15" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : num [1:48] 0 0 0 1 0 0 0 0 1 0 ...
# Convert 'holiday_month' to a factor
cocacola$holiday_month <- as.factor(cocacola$holiday_month)
# Convert 'tperiod' to a Date object if it isn't already one
cocacola$tperiod <- as.Date(cocacola$tperiod, format = "%d-%m-%Y")
# Extract the day (which represents the year) and the month
day_month <- format(cocacola$tperiod, "%d-%m")
# Convert the day to a proper year (e.g., 0015 to 2015) and create the correct date
corrected_year <- as.numeric(substr(day_month, 1, 2)) + 2000
corrected_month_day <- paste0(corrected_year, "-", substr(day_month, 4, 5))
# Convert back to Date object
cocacola$tperiod <- as.Date(paste0(corrected_month_day, "-01"), format = "%Y-%m-%d")
# Check the results
head(cocacola$tperiod)
## [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01"
## [6] "2015-06-01"
# structure after conversions
str(cocacola)
## tibble [48 × 15] (S3: tbl_df/tbl/data.frame)
## $ tperiod : Date[1:48], format: "2015-01-01" "2015-02-01" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 2 1 ...
sales_unitboxes: This is the dependent variable. Understanding its distribution helps identify patterns, outliers, or trends over time.
inflation_rate: Inflation affects consumer purchasing power, which can influence sales. High inflation might reduce consumer spending, lowering sales, while low inflation could boost sales.
exchange_rate: Fluctuations in the exchange rate can impact costs, especially if raw materials or finished goods are imported. This, in turn, can influence pricing strategies and sales volumes.
consumer_sentiment: This measures how consumers feel about the economy, which can affect their willingness to spend on non-essential items like soft drinks. Positive sentiment often correlates with higher sales.
unemp_rate: Higher unemployment typically means lower disposable income for consumers, which can negatively impact sales. Understanding this variable helps in assessing the impact of economic conditions on sales.
# Compute basic descriptive statistics using summary
summary_stats <- summary(cocacola[, c("sales_unitboxes", "inflation_rate", "exchange_rate", "consumer_sentiment", "unemp_rate")])
# Display the basic summary statistics
print(summary_stats)
## sales_unitboxes inflation_rate exchange_rate consumer_sentiment
## Min. :5301755 Min. :-0.5000 Min. :14.69 Min. :28.67
## 1st Qu.:6171767 1st Qu.: 0.1650 1st Qu.:17.38 1st Qu.:35.64
## Median :6461357 Median : 0.3850 Median :18.62 Median :36.76
## Mean :6473691 Mean : 0.3485 Mean :18.18 Mean :37.15
## 3rd Qu.:6819782 3rd Qu.: 0.5575 3rd Qu.:19.06 3rd Qu.:38.14
## Max. :7963063 Max. : 1.7000 Max. :21.39 Max. :44.87
## unemp_rate
## Min. :0.03466
## 1st Qu.:0.04010
## Median :0.04369
## Mean :0.04442
## 3rd Qu.:0.04897
## Max. :0.05517
# Compute additional descriptive statistics using describe function from psych package
describe_stats <- describe(cocacola[, c("sales_unitboxes", "inflation_rate", "exchange_rate", "consumer_sentiment", "unemp_rate")])
# Display the additional descriptive statistics
print(describe_stats)
## cocacola[, c("sales_unitboxes", "inflation_rate", "exchange_rate", "consumer_sentiment", "unemp_rate")]
##
## 5 Variables 48 Observations
## --------------------------------------------------------------------------------
## sales_unitboxes
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 6473691 680321 5491459 5576844
## .25 .50 .75 .90 .95
## 6171767 6461357 6819782 7288957 7396022
##
## lowest : 5301750 5387500 5477870 5516690 5568550
## highest: 7330140 7345040 7423480 7457470 7963060
## --------------------------------------------------------------------------------
## inflation_rate
## n missing distinct Info Mean Gmd .05 .10
## 48 0 41 0.999 0.3485 0.4164 -0.3330 -0.1900
## .25 .50 .75 .90 .95
## 0.1650 0.3850 0.5575 0.6510 0.8255
##
## lowest : -0.5 -0.45 -0.34 -0.32 -0.26, highest: 0.7 0.78 0.85 1.03 1.7
## --------------------------------------------------------------------------------
## exchange_rate
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 18.18 1.797 15.23 15.42
## .25 .50 .75 .90 .95
## 17.38 18.62 19.06 20.16 20.30
##
## lowest : 14.6926 14.9213 15.2262 15.2283 15.2645
## highest: 20.2612 20.2905 20.3032 20.5206 21.3853
## --------------------------------------------------------------------------------
## consumer_sentiment
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 37.15 3.041 33.93 34.63
## .25 .50 .75 .90 .95
## 35.64 36.76 38.14 41.81 42.84
##
## lowest : 28.6679 31.5156 33.7951 34.1893 34.3367
## highest: 42.1327 42.533 43.0057 43.3411 44.8654
## --------------------------------------------------------------------------------
## unemp_rate
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 0.04442 0.006762 0.03648 0.03747
## .25 .50 .75 .90 .95
## 0.04010 0.04369 0.04897 0.05373 0.05413
##
## lowest : 0.0346622 0.0358722 0.0364139 0.0365965 0.0367783
## highest: 0.0538359 0.0539483 0.0542306 0.0547338 0.0551745
## --------------------------------------------------------------------------------
# Create a density plot for the sales_unitboxes variable
ggplot(cocacola, aes(x = sales_unitboxes)) +
geom_density(fill = "blue", alpha = 0.7) +
labs(title = "Density Plot of Coca-Cola Sales (Unit Boxes)",
x = "Sales Unit Boxes",
y = "Density") +
theme_minimal()
# Scatter plot between sales_unitboxes and inflation_rate
ggplot(cocacola, aes(x = inflation_rate, y = sales_unitboxes)) +
geom_point(color = "blue") +
labs(title = "Sales vs. Inflation Rate",
x = "Inflation Rate",
y = "Sales (Unit Boxes)") +
theme_minimal()
# Scatter plot between sales_unitboxes and consumer_sentiment
ggplot(cocacola, aes(x = consumer_sentiment, y = sales_unitboxes)) +
geom_point(color = "green") +
labs(title = "Sales vs. Consumer Sentiment",
x = "Consumer Sentiment",
y = "Sales (Unit Boxes)") +
theme_minimal()
# Histogram of sales_unitboxes
ggplot(cocacola, aes(x = sales_unitboxes)) +
geom_histogram(binwidth = 50000, fill = "blue", color = "black", alpha = 2) +
labs(title = "Histogram of Coca-Cola Sales (Unit Boxes)",
x = "Sales Unit Boxes",
y = "Frequency") +
theme_minimal()
# Categorize unemployment rate into "Low", "Medium", and "High"
cocacola$unemp_rate_category <- cut(cocacola$unemp_rate,
breaks = quantile(cocacola$unemp_rate, probs = c(0, 0.33, 0.67, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)
# Boxplot of sales by unemployment rate category
ggplot(cocacola, aes(x = unemp_rate_category, y = sales_unitboxes, fill = unemp_rate_category)) +
geom_boxplot() +
labs(title = "Sales Distribution by Unemployment Rate",
x = "Unemployment Rate Category",
y = "Sales (Unit Boxes)") +
theme_minimal() +
scale_fill_manual(values = c("lightblue", "lightgreen", "red"))
# Select relevant numeric variables
numeric_vars <- cocacola[, sapply(cocacola, is.numeric)]
# Calculate correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
# Create a correlation plot
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.cex = 0.8,
title = "Correlation Plot",
mar=c(0,0,1,0))
# Scatter plot of sales vs. max_temperature with a smoother
ggplot(cocacola, aes(x = max_temperature, y = sales_unitboxes)) +
geom_point(color = "pink") +
geom_smooth(method = "loess", color = "blue") +
labs(title = "Sales vs. Maximum Temperature",
x = "Maximum Temperature (°C)",
y = "Sales (Unit Boxes)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Time series plot of sales_unitboxes over time
ggplot(cocacola, aes(x = tperiod, y = sales_unitboxes)) +
geom_line(color = "blue") +
labs(title = "Coca-Cola Sales Over Time",
x = "Time Period",
y = "Sales (Unit Boxes)") +
theme_minimal() +
scale_x_date(date_labels = "%Y-%m", date_breaks = "4 months")
# Perform the Augmented Dickey-Fuller Test
adf_result <- adf.test(cocacola$sales_unitboxes, alternative = "stationary")
## Warning in adf.test(cocacola$sales_unitboxes, alternative = "stationary"):
## p-value smaller than printed p-value
# Display the result
print(adf_result)
##
## Augmented Dickey-Fuller Test
##
## data: cocacola$sales_unitboxes
## Dickey-Fuller = -4.4282, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
sales_ts <- ts(cocacola$sales_unitboxes, start = c(2015, 1), frequency = 12)
# Decompose the time series
decomposed_sales <- decompose(sales_ts)
# Plot the decomposition
plot(decomposed_sales)
Higher temperatures are associated with increased Coca-Cola sales in the Guadalajara Metropolitan Area.
The hypothesis suggests that as temperatures rise, Coca-Cola sales will increase. This relationship is grounded in consumer behavior and market dynamics, where higher temperatures typically drive demand for cold and refreshing beverages like soft drinks. During hotter months, people are more likely to seek out drinks to cool down, leading to increased sales. This pattern is particularly evident in regions with warmer climates or during seasonal peaks like summer, when the consumption of cold drinks tends to surge.
Higher GDP per capita is associated with increased Coca-Cola sales in the Guadalajara Metropolitan Area.
This hypothesis suggests that as GDP per capita increases, indicating higher average income levels, Coca-Cola sales will also rise. Higher GDP per capita generally reflects greater economic prosperity, which can lead to increased consumer spending on both essential and non-essential goods, including beverages like Coca-Cola. As people’s disposable income grows, they are more likely to indulge in such products, driving sales upward in the region.
Positive consumer sentiment is associated with higher Coca-Cola sales in the Guadalajara Metropolitan Area.
Consumer sentiment reflects how optimistic or pessimistic consumers feel about the economy, their financial stability, and their future purchasing power. When consumer sentiment is high, people are more confident in their financial situation, leading them to spend more on non-essential goods, including beverages like Coca-Cola. Conversely, when consumer sentiment is low, individuals are more likely to save money and cut back on discretionary purchases.
modelo_regresion <- lm(sales_unitboxes ~ consumer_sentiment + gdp_percapita + pop_density +
max_temperature, data = cocacola)
summary(modelo_regresion)
##
## Call:
## lm(formula = sales_unitboxes ~ consumer_sentiment + gdp_percapita +
## pop_density + max_temperature, data = cocacola)
##
## Residuals:
## Min 1Q Median 3Q Max
## -779205 -261599 35440 200843 714010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.772e+07 5.913e+06 -4.688 2.79e-05 ***
## consumer_sentiment 3.942e+04 2.131e+04 1.849 0.071288 .
## gdp_percapita -2.308e+03 6.289e+02 -3.669 0.000667 ***
## pop_density 5.451e+05 1.230e+05 4.433 6.32e-05 ***
## max_temperature 1.806e+05 2.294e+04 7.872 7.20e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 374600 on 43 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6086
## F-statistic: 19.27 on 4 and 43 DF, p-value: 3.807e-09
modelo_regresion_polinomial <- lm(sales_unitboxes ~ consumer_sentiment +
gdp_percapita + poly(pop_density, 2) +
poly(max_temperature, 2), data = cocacola)
summary(modelo_regresion_polinomial)
##
## Call:
## lm(formula = sales_unitboxes ~ consumer_sentiment + gdp_percapita +
## poly(pop_density, 2) + poly(max_temperature, 2), data = cocacola)
##
## Residuals:
## Min 1Q Median 3Q Max
## -540894 -245141 46915 230242 767398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26192332.6 7244641.1 3.615 0.000813 ***
## consumer_sentiment 88845.8 26264.3 3.383 0.001589 **
## gdp_percapita -1921.7 570.3 -3.369 0.001650 **
## poly(pop_density, 2)1 3815475.7 1008644.1 3.783 0.000496 ***
## poly(pop_density, 2)2 -1449081.0 481319.5 -3.011 0.004448 **
## poly(max_temperature, 2)1 3040530.6 374009.2 8.130 4.4e-10 ***
## poly(max_temperature, 2)2 806901.8 357052.1 2.260 0.029205 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 333700 on 41 degrees of freedom
## Multiple R-squared: 0.729, Adjusted R-squared: 0.6893
## F-statistic: 18.38 on 6 and 41 DF, p-value: 3.161e-10
X <- model.matrix(sales_unitboxes ~ consumer_sentiment + gdp_percapita + pop_density + max_temperature, data = cocacola)[, -1]
y <- cocacola$sales_unitboxes
modelo_ridge <- glmnet(X, y, alpha = 0)
cv_ridge <- cv.glmnet(X, y, alpha = 0)
best_lambda <- cv_ridge$lambda.min
modelo_ridge_final <- glmnet(X, y, alpha = 0, lambda = best_lambda)
coeficientes_ridge <- coef(modelo_ridge_final)
print(coeficientes_ridge)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -17580715.971
## consumer_sentiment 51880.120
## gdp_percapita -1047.997
## pop_density 298031.172
## max_temperature 153600.309
X <- model.matrix(sales_unitboxes ~ consumer_sentiment + gdp_percapita + pop_density + max_temperature, data = cocacola)[, -1]
y <- cocacola$sales_unitboxes
modelo_lasso <- glmnet(X, y, alpha = 1)
cv_lasso <- cv.glmnet(X, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min
modelo_lasso_final <- glmnet(X, y, alpha = 1, lambda = best_lambda)
coeficientes_lasso <- coef(modelo_lasso_final)
print(coeficientes_lasso)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -27031981.630
## consumer_sentiment 40047.354
## gdp_percapita -2219.646
## pop_density 528119.040
## max_temperature 178788.683
vif(modelo_regresion)
## consumer_sentiment gdp_percapita pop_density max_temperature
## 1.270250 8.367803 8.426768 1.222982
vif(modelo_regresion_polinomial)
## GVIF Df GVIF^(1/(2*Df))
## consumer_sentiment 2.430150 1 1.558894
## gdp_percapita 8.669659 1 2.944429
## poly(pop_density, 2) 18.236868 2 2.066510
## poly(max_temperature, 2) 1.423875 2 1.092366
bptest(modelo_regresion)
##
## studentized Breusch-Pagan test
##
## data: modelo_regresion
## BP = 1.8308, df = 4, p-value = 0.7668
bptest(modelo_regresion_polinomial)
##
## studentized Breusch-Pagan test
##
## data: modelo_regresion_polinomial
## BP = 5.0238, df = 6, p-value = 0.5408
shapiro.test(residuals(modelo_regresion))
##
## Shapiro-Wilk normality test
##
## data: residuals(modelo_regresion)
## W = 0.97373, p-value = 0.3515
shapiro.test(residuals(modelo_regresion_polinomial))
##
## Shapiro-Wilk normality test
##
## data: residuals(modelo_regresion_polinomial)
## W = 0.968, p-value = 0.2115
dwtest(modelo_regresion)
##
## Durbin-Watson test
##
## data: modelo_regresion
## DW = 1.8731, p-value = 0.1713
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(modelo_regresion_polinomial)
##
## Durbin-Watson test
##
## data: modelo_regresion_polinomial
## DW = 1.8868, p-value = 0.1378
## alternative hypothesis: true autocorrelation is greater than 0
# Lineal
pred_regresion <- predict(modelo_regresion, newdata = cocacola)
mse_regresion <- mean((pred_regresion - cocacola$sales_unitboxes)^2)
r2_regresion <- summary(modelo_regresion)$r.squared
rmse_regresion <- sqrt(mse_regresion)
aic_regresion <- AIC(modelo_regresion)
# Polinomial
pred_polinomial <- predict(modelo_regresion_polinomial, newdata = cocacola)
mse_polinomial <- mean((pred_polinomial - cocacola$sales_unitboxes)^2)
r2_polinomial <- summary(modelo_regresion_polinomial)$r.squared
rmse_polinomial <- sqrt(mse_polinomial)
aic_polinomial <- AIC(modelo_regresion_polinomial)
# Ridge
pred_ridge <- predict(modelo_ridge_final, newx = X)
rss_ridge <- sum((pred_ridge - y)^2)
tss <- sum((y - mean(y))^2)
r2_ridge <- 1 - (rss_ridge / tss)
mse_ridge <- mean((pred_ridge - y)^2)
rmse_ridge <- sqrt(mse_ridge)
n_parametros_ridge <- sum(coef(modelo_ridge_final) != 0)
rss_ridge <- sum((y - pred_ridge)^2)
sigma2_ridge <- rss_ridge / length(y)
log_lik_ridge <- -0.5 * length(y) * (log(2 * pi) + log(sigma2_ridge) + 1)
aic_ridge <- 2 * n_parametros_ridge - 2 * log_lik_ridge
pred_lasso <- predict(modelo_lasso_final, newx = X)
# Lasso
rss_lasso <- sum((pred_lasso - y)^2) # Residual sum of squares
tss <- sum((y - mean(y))^2) # Total sum of squares
r2_lasso <- 1 - (rss_lasso / tss)
mse_lasso <- mean((pred_lasso - y)^2)
rmse_lasso <- sqrt(mse_lasso)
n_parametros_lasso <- sum(coef(modelo_lasso_final) != 0)
rss_lasso <- sum((y - pred_lasso)^2) # Suma de los residuos al cuadrado
sigma2_lasso <- rss_lasso / length(y) # Estimación de la varianza residual
log_lik_lasso <- -0.5 * length(y) * (log(2 * pi) + log(sigma2_lasso) + 1)
aic_lasso <- 2 * n_parametros_lasso - 2 * log_lik_lasso
cat("Resultados del Modelo Lineal:\n")
## Resultados del Modelo Lineal:
cat("MSE:", mse_regresion, "\n")
## MSE: 125685868520
cat("R^2:", r2_regresion, "\n")
## R^2: 0.6419302
cat("RMSE:", rmse_regresion, "\n")
## RMSE: 354522
cat("AIC:", aic_regresion, "\n\n")
## AIC: 1374.957
cat("Resultados del Modelo Polinomial:\n")
## Resultados del Modelo Polinomial:
cat("MSE:", mse_polinomial, "\n")
## MSE: 95123042885
cat("R^2:", r2_polinomial, "\n")
## R^2: 0.7290014
cat("RMSE:", rmse_polinomial, "\n")
## RMSE: 308420.2
cat("AIC:", aic_polinomial, "\n\n")
## AIC: 1365.583
cat("Resultados del Modelo Ridge:\n")
## Resultados del Modelo Ridge:
cat("MSE:", mse_ridge, "\n")
## MSE: 1.38552e+11
cat("R^2:", r2_ridge, "\n")
## R^2: 0.6052754
cat("RMSE:", rmse_ridge, "\n")
## RMSE: 372225.8
cat("AIC:", aic_ridge, "\n\n")
## AIC: 1377.635
cat("Resultados del Modelo Lasso:\n")
## Resultados del Modelo Lasso:
cat("MSE:", mse_lasso, "\n")
## MSE: 125748172554
cat("R^2:", r2_lasso, "\n")
## R^2: 0.6417527
cat("RMSE:", rmse_lasso, "\n")
## RMSE: 354609.9
cat("AIC:", aic_lasso, "\n\n")
## AIC: 1372.98
resultados <- data.frame(
Modelo = c("Lineal", "Polinomial", "Ridge", "Lasso"),
MSE = c(mse_regresion, mse_polinomial, mse_ridge, mse_lasso),
R2 = c(r2_regresion, r2_polinomial, r2_ridge, r2_lasso),
RMSE = c(rmse_regresion, rmse_polinomial, rmse_ridge, rmse_lasso),
AIC = c(aic_regresion, aic_polinomial, aic_ridge, aic_lasso)
)
# Gráfica para MSE
ggplot(resultados, aes(x = Modelo, y = MSE)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Comparison of MSE between Models", y = "MSE", x = "Model") +
theme_minimal()
# Gráfica para R^2
ggplot(resultados, aes(x = Modelo, y = R2)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(title = "Comparison of R^2 between Models", y = "R^2", x = "Model") +
theme_minimal()
# Gráfica para RMSE
ggplot(resultados, aes(x = Modelo, y = RMSE)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Comparison of RMSE between Models", y = "RMSE", x = "Model") +
theme_minimal()
# Gráfica para AIC
ggplot(resultados, aes(x = Modelo, y = AIC)) +
geom_bar(stat = "identity", fill = "purple") +
labs(title = "Comparison of AIC between Models", y = "AIC", x = "Model") +
theme_minimal()
After making all of the models, and comparing them between each other, we can see that the most accurate is the polynomial model. We can obtain this conclusion based on the RMSE we obtained for all the models. A lower RMSE indicates that the values obtained are closer to the real selling values, and the polynomial model has the lowest by a considerable margin. Apart from this fact, the R2 is the highest on the polynomial model. This indicates that the variability is well-explained by the model. In other more simple words, the data fits the model nicely, and so it makes the most sense to utilize it. The third metric we used is MSE. In this case, I honestly forgot what this metric measured, but after some investigation, I concluded that it measures the accuracy of the model, and similarly to the RMSE, we are looking for the lowest value possible. Similarly to the past metrics, the polynomial model is also the lowest, which is good. Lastly, the AIC is also the lowest in the polynomial model, even if it is by a small margin. Lower AIC means a good balance between fit and complexity in a model, which means that this model is appropriate for the analysis being made.
Here are some insights obtained from the model:
As consumer sentiment increases, sales also increase. This is statistically significant (p = 0.0016) suggesting this variable is a key factor.
The GDP per-capita has a negative correlation with the sales. This seems counter-intuitive, but it may be based around changing consumer patterns.
There is a very significant positive impact in sales when the temperature is higher. This makes sense, considering Coca-Cola is a popular refresher when temperatures are high.
The R2 is of .689, which indicates a good fit, and the F statistic is highly significant, which tells us the model is effective in predicting sales patterns.
The VIF value indicates moderate multicollinearity, mainly for the GDP per Capita variable. This indicates that this variable is correlated to other variables in the model, which can lead to less prediction.
Based on the Breusch-Pagan test, we can conclude that heteroscedasticity isn’t an issue in the model, considering the p-value is of 0.5408 and the BP is of 5.02.