FEMSA is a Latin American phenomenom that has taken over a variety of industries all around the country, and it sells to over a 100 million costumers. It shapes the economy in almost all of these countries, because of it’s enormous impact and immense array of products sold. For this report, we’ll be specifically talking about Coca-Cola, the worldwide beverage company, since FEMSA is in charge of bottling for Coca-Cola in Latin America.
Linear regression is a tool that helps data analysts spot relationships between different variables that cause fluctuations and other types of reactions. By using this tool, and comparing against data from past years, we can hope to identify how Coca-Cola’s sales are impacted in Latin America. With the use of this tool, we are hoping to identify data that can help Coca-Cola predict variables such as demand, sales, economic fluctuations, and the behavior of their market as a whole.
For this problem situation, we are looking towards predicting the sales in units in the Metropolitan Area of Guadalajara, Jalisco. Being one of the most important states in Mexican economy, it is imperative we develop hypotheses, and prove them, to be able to improve Coca-Cola’s sales in the future. We will be using a wide variety of regression models to estimate the impact of these variables, and afterwards we will run them against each other to spot the most accurate. After these steps, we will develop insights to make decisions based on our analysis and past data.
# missing values in each column
vis_miss(cocacola)
str(cocacola)
## tibble [48 × 15] (S3: tbl_df/tbl/data.frame)
## $ tperiod : POSIXct[1:48], format: "2021-01-15" "2021-02-15" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : num [1:48] 0 0 0 1 0 0 0 0 1 0 ...
cocacola$holiday_month <- as.factor(cocacola$holiday_month)
# Convert 'tperiod' to a Date object if it isn't already one
cocacola$tperiod <- as.Date(cocacola$tperiod, format = "%d-%m-%Y")
# Extract the day (which represents the year) and the month
day_month <- format(cocacola$tperiod, "%d-%m")
# Convert the day to a proper year (e.g., 0015 to 2015) and create the correct date
corrected_year <- as.numeric(substr(day_month, 1, 2)) + 2000
corrected_month_day <- paste0(corrected_year, "-", substr(day_month, 4, 5))
# Convert back to Date object
cocacola$tperiod <- as.Date(paste0(corrected_month_day, "-01"), format = "%Y-%m-%d")
# Check the results
head(cocacola$tperiod)
## [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01"
## [6] "2015-06-01"
str(cocacola)
## tibble [48 × 15] (S3: tbl_df/tbl/data.frame)
## $ tperiod : Date[1:48], format: "2015-01-01" "2015-02-01" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 2 1 ...
sales_unitboxes: This variable is dependent, and we will be using the other variables to try and prove why this occurs.
inflation_rate: Inflation is a variable that is very logically intertwined with sales. With a higher inflation, people’s money is worth less, and so they have less to spend o regular basis products that aren’t essential.
exchange_rate: The exchange rate is highly important when different economies are involved. Because Coca-Cola is an international product, it is highly likely import values will be different in the distinct countries involved.
consumer_sentiment: Consumer sentiment reflects how customers feel towards their current income and money being generated. If customers feel a high degree of safety, they are more likely to indulge in Coca-Cola’s products.
unemp_rate: Unemployment rate is an incredibly important variable, since a higher unemployment reflects in less money circulating between Mexican citizens. Since unemployment is more common in lower-classes (which is Coca-Cola’s biggest market), this variable is likely to be of significance.
# Compute basic descriptive statistics using summary
summary_stats <- summary(cocacola[, c("sales_unitboxes", "inflation_rate", "exchange_rate", "consumer_sentiment", "unemp_rate")])
# Display the basic summary statistics
print(summary_stats)
## sales_unitboxes inflation_rate exchange_rate consumer_sentiment
## Min. :5301755 Min. :-0.5000 Min. :14.69 Min. :28.67
## 1st Qu.:6171767 1st Qu.: 0.1650 1st Qu.:17.38 1st Qu.:35.64
## Median :6461357 Median : 0.3850 Median :18.62 Median :36.76
## Mean :6473691 Mean : 0.3485 Mean :18.18 Mean :37.15
## 3rd Qu.:6819782 3rd Qu.: 0.5575 3rd Qu.:19.06 3rd Qu.:38.14
## Max. :7963063 Max. : 1.7000 Max. :21.39 Max. :44.87
## unemp_rate
## Min. :0.03466
## 1st Qu.:0.04010
## Median :0.04369
## Mean :0.04442
## 3rd Qu.:0.04897
## Max. :0.05517
# Compute additional descriptive statistics using describe function from psych package
describe_stats <- describe(cocacola[, c("sales_unitboxes", "inflation_rate", "exchange_rate", "consumer_sentiment", "unemp_rate")])
# Display the additional descriptive statistics
print(describe_stats)
## cocacola[, c("sales_unitboxes", "inflation_rate", "exchange_rate", "consumer_sentiment", "unemp_rate")]
##
## 5 Variables 48 Observations
## --------------------------------------------------------------------------------
## sales_unitboxes
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 6473691 680321 5491459 5576844
## .25 .50 .75 .90 .95
## 6171767 6461357 6819782 7288957 7396022
##
## lowest : 5301750 5387500 5477870 5516690 5568550
## highest: 7330140 7345040 7423480 7457470 7963060
## --------------------------------------------------------------------------------
## inflation_rate
## n missing distinct Info Mean Gmd .05 .10
## 48 0 41 0.999 0.3485 0.4164 -0.3330 -0.1900
## .25 .50 .75 .90 .95
## 0.1650 0.3850 0.5575 0.6510 0.8255
##
## lowest : -0.5 -0.45 -0.34 -0.32 -0.26, highest: 0.7 0.78 0.85 1.03 1.7
## --------------------------------------------------------------------------------
## exchange_rate
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 18.18 1.797 15.23 15.42
## .25 .50 .75 .90 .95
## 17.38 18.62 19.06 20.16 20.30
##
## lowest : 14.6926 14.9213 15.2262 15.2283 15.2645
## highest: 20.2612 20.2905 20.3032 20.5206 21.3853
## --------------------------------------------------------------------------------
## consumer_sentiment
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 37.15 3.041 33.93 34.63
## .25 .50 .75 .90 .95
## 35.64 36.76 38.14 41.81 42.84
##
## lowest : 28.6679 31.5156 33.7951 34.1893 34.3367
## highest: 42.1327 42.533 43.0057 43.3411 44.8654
## --------------------------------------------------------------------------------
## unemp_rate
## n missing distinct Info Mean Gmd .05 .10
## 48 0 48 1 0.04442 0.006762 0.03648 0.03747
## .25 .50 .75 .90 .95
## 0.04010 0.04369 0.04897 0.05373 0.05413
##
## lowest : 0.0346622 0.0358722 0.0364139 0.0365965 0.0367783
## highest: 0.0538359 0.0539483 0.0542306 0.0547338 0.0551745
## --------------------------------------------------------------------------------
# Create a density plot for the sales_unitboxes variable
ggplot(cocacola, aes(x = sales_unitboxes)) +
geom_density(fill = "blue", alpha = 0.7) +
labs(title = "Density Plot of Coca-Cola Sales (Unit Boxes)",
x = "Sales Unit Boxes",
y = "Density") +
theme_minimal()
# Scatter plot between sales_unitboxes and inflation_rate
ggplot(cocacola, aes(x = inflation_rate, y = sales_unitboxes)) +
geom_point(color = "blue") +
labs(title = "Sales vs. Inflation Rate",
x = "Inflation Rate",
y = "Sales (Unit Boxes)") +
theme_minimal()
# Scatter plot between sales_unitboxes and consumer_sentiment
ggplot(cocacola, aes(x = consumer_sentiment, y = sales_unitboxes)) +
geom_point(color = "green") +
labs(title = "Sales vs. Consumer Sentiment",
x = "Consumer Sentiment",
y = "Sales (Unit Boxes)") +
theme_minimal()
# Histogram of sales_unitboxes
ggplot(cocacola, aes(x = sales_unitboxes)) +
geom_histogram(binwidth = 50000, fill = "blue", color = "black", alpha = 2) +
labs(title = "Histogram of Coca-Cola Sales (Unit Boxes)",
x = "Sales Unit Boxes",
y = "Frequency") +
theme_minimal()
# Categorize unemployment rate into "Low", "Medium", and "High"
cocacola$unemp_rate_category <- cut(cocacola$unemp_rate,
breaks = quantile(cocacola$unemp_rate, probs = c(0, 0.33, 0.67, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)
# Boxplot of sales by unemployment rate category
ggplot(cocacola, aes(x = unemp_rate_category, y = sales_unitboxes, fill = unemp_rate_category)) +
geom_boxplot() +
labs(title = "Sales Distribution by Unemployment Rate",
x = "Unemployment Rate Category",
y = "Sales (Unit Boxes)") +
theme_minimal() +
scale_fill_manual(values = c("lightblue", "lightgreen", "red"))
# Select relevant numeric variables
numeric_vars <- cocacola[, sapply(cocacola, is.numeric)]
# Calculate correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
# Create a correlation plot
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.cex = 0.8,
title = "Correlation Plot",
mar=c(0,0,1,0))
# Scatter plot of sales vs. max_temperature with a smoother
ggplot(cocacola, aes(x = max_temperature, y = sales_unitboxes)) +
geom_point(color = "pink") +
geom_smooth(method = "loess", color = "blue") +
labs(title = "Sales vs. Maximum Temperature",
x = "Maximum Temperature (°C)",
y = "Sales (Unit Boxes)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Time series plot of sales_unitboxes over time
ggplot(cocacola, aes(x = tperiod, y = sales_unitboxes)) +
geom_line(color = "blue") +
labs(title = "Coca-Cola Sales Over Time",
x = "Time Period",
y = "Sales (Unit Boxes)") +
theme_minimal() +
scale_x_date(date_labels = "%Y-%m", date_breaks = "4 months")
# Perform the Augmented Dickey-Fuller Test
adf_result <- adf.test(cocacola$sales_unitboxes, alternative = "stationary")
## Warning in adf.test(cocacola$sales_unitboxes, alternative = "stationary"):
## p-value smaller than printed p-value
# Display the result
print(adf_result)
##
## Augmented Dickey-Fuller Test
##
## data: cocacola$sales_unitboxes
## Dickey-Fuller = -4.4282, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
sales_ts <- ts(cocacola$sales_unitboxes, start = c(2015, 1), frequency = 12)
# Decompose the time series
decomposed_sales <- decompose(sales_ts)
# Plot the decomposition
plot(decomposed_sales)
High temperatures (and in return, summer months) have a positive impact on Coca-Cola’s sales.
This hypothesis is self-explanatory if you understand how Coca-Cola’s products function. Since Coca-Cola sells fizzy drinks, which are extremely refreshing and best served cold, it makes sense that summer months are likely to be the peak in Coca-Cola’s sales. On the other hand, cold months are associated with warm drinks, such as coffee and tea. Since these drinks compete with Coca-Cola, their sales are likely to go down.
Lower GDP per capita is negatively correlated with Coca-Cola sales in the Metropolitan Area of Guadalajara.
This hypothesis is suggesting that a lower GDP in a certain area, will decrease Coca-Cola’s sales. This is due to the fact that a low GDP indicates that there isn’t much money being generated in a certain geographical sector, and so there is less money to spend. Because of this, people in these specific areas either don’t have money to spend, or simply differ from buying Coca-Cola since there are other basic needs that require their income to go into.
Higher consumer sentiment in the Guadalajara Metropolitan Area is positively correlated with sales
What the consumer sentiment tells us is how confident people are in their financial stability. If they consider their financial stability to be at risk, whether it be based on economic or political factors, they are less likely to spend on products they can avoid. A big example of this would be Coca-Cola, due to the fact that it’s a low-income centered good, and people are likely to avoid buying it if they feel the need to.
modelo_regresion <- lm(sales_unitboxes ~ consumer_sentiment + gdp_percapita + pop_density +
max_temperature, data = cocacola)
summary(modelo_regresion)
##
## Call:
## lm(formula = sales_unitboxes ~ consumer_sentiment + gdp_percapita +
## pop_density + max_temperature, data = cocacola)
##
## Residuals:
## Min 1Q Median 3Q Max
## -779205 -261599 35440 200843 714010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.772e+07 5.913e+06 -4.688 2.79e-05 ***
## consumer_sentiment 3.942e+04 2.131e+04 1.849 0.071288 .
## gdp_percapita -2.308e+03 6.289e+02 -3.669 0.000667 ***
## pop_density 5.451e+05 1.230e+05 4.433 6.32e-05 ***
## max_temperature 1.806e+05 2.294e+04 7.872 7.20e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 374600 on 43 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6086
## F-statistic: 19.27 on 4 and 43 DF, p-value: 3.807e-09
modelo_regresion_polinomial <- lm(sales_unitboxes ~ consumer_sentiment +
gdp_percapita + poly(pop_density, 2) +
poly(max_temperature, 2), data = cocacola)
summary(modelo_regresion_polinomial)
##
## Call:
## lm(formula = sales_unitboxes ~ consumer_sentiment + gdp_percapita +
## poly(pop_density, 2) + poly(max_temperature, 2), data = cocacola)
##
## Residuals:
## Min 1Q Median 3Q Max
## -540894 -245141 46915 230242 767398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26192332.6 7244641.1 3.615 0.000813 ***
## consumer_sentiment 88845.8 26264.3 3.383 0.001589 **
## gdp_percapita -1921.7 570.3 -3.369 0.001650 **
## poly(pop_density, 2)1 3815475.7 1008644.1 3.783 0.000496 ***
## poly(pop_density, 2)2 -1449081.0 481319.5 -3.011 0.004448 **
## poly(max_temperature, 2)1 3040530.6 374009.2 8.130 4.4e-10 ***
## poly(max_temperature, 2)2 806901.8 357052.1 2.260 0.029205 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 333700 on 41 degrees of freedom
## Multiple R-squared: 0.729, Adjusted R-squared: 0.6893
## F-statistic: 18.38 on 6 and 41 DF, p-value: 3.161e-10
X <- model.matrix(sales_unitboxes ~ consumer_sentiment + gdp_percapita + pop_density + max_temperature, data = cocacola)[, -1]
y <- cocacola$sales_unitboxes
modelo_ridge <- glmnet(X, y, alpha = 0)
cv_ridge <- cv.glmnet(X, y, alpha = 0)
best_lambda <- cv_ridge$lambda.min
modelo_ridge_final <- glmnet(X, y, alpha = 0, lambda = best_lambda)
coeficientes_ridge <- coef(modelo_ridge_final)
print(coeficientes_ridge)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -17580715.971
## consumer_sentiment 51880.120
## gdp_percapita -1047.997
## pop_density 298031.172
## max_temperature 153600.309
X <- model.matrix(sales_unitboxes ~ consumer_sentiment + gdp_percapita + pop_density + max_temperature, data = cocacola)[, -1]
y <- cocacola$sales_unitboxes
modelo_lasso <- glmnet(X, y, alpha = 1)
cv_lasso <- cv.glmnet(X, y, alpha = 1)
best_lambda <- cv_lasso$lambda.min
modelo_lasso_final <- glmnet(X, y, alpha = 1, lambda = best_lambda)
coeficientes_lasso <- coef(modelo_lasso_final)
print(coeficientes_lasso)
## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -27427273.838
## consumer_sentiment 39710.818
## gdp_percapita -2269.745
## pop_density 537812.435
## max_temperature 179847.854
vif(modelo_regresion)
## consumer_sentiment gdp_percapita pop_density max_temperature
## 1.270250 8.367803 8.426768 1.222982
vif(modelo_regresion_polinomial)
## GVIF Df GVIF^(1/(2*Df))
## consumer_sentiment 2.430150 1 1.558894
## gdp_percapita 8.669659 1 2.944429
## poly(pop_density, 2) 18.236868 2 2.066510
## poly(max_temperature, 2) 1.423875 2 1.092366
bptest(modelo_regresion)
##
## studentized Breusch-Pagan test
##
## data: modelo_regresion
## BP = 1.8308, df = 4, p-value = 0.7668
bptest(modelo_regresion_polinomial)
##
## studentized Breusch-Pagan test
##
## data: modelo_regresion_polinomial
## BP = 5.0238, df = 6, p-value = 0.5408
shapiro.test(residuals(modelo_regresion))
##
## Shapiro-Wilk normality test
##
## data: residuals(modelo_regresion)
## W = 0.97373, p-value = 0.3515
shapiro.test(residuals(modelo_regresion_polinomial))
##
## Shapiro-Wilk normality test
##
## data: residuals(modelo_regresion_polinomial)
## W = 0.968, p-value = 0.2115
dwtest(modelo_regresion)
##
## Durbin-Watson test
##
## data: modelo_regresion
## DW = 1.8731, p-value = 0.1713
## alternative hypothesis: true autocorrelation is greater than 0
dwtest(modelo_regresion_polinomial)
##
## Durbin-Watson test
##
## data: modelo_regresion_polinomial
## DW = 1.8868, p-value = 0.1378
## alternative hypothesis: true autocorrelation is greater than 0
# Lineal
pred_regresion <- predict(modelo_regresion, newdata = cocacola)
mse_regresion <- mean((pred_regresion - cocacola$sales_unitboxes)^2)
r2_regresion <- summary(modelo_regresion)$r.squared
rmse_regresion <- sqrt(mse_regresion)
aic_regresion <- AIC(modelo_regresion)
# Polinomial
pred_polinomial <- predict(modelo_regresion_polinomial, newdata = cocacola)
mse_polinomial <- mean((pred_polinomial - cocacola$sales_unitboxes)^2)
r2_polinomial <- summary(modelo_regresion_polinomial)$r.squared
rmse_polinomial <- sqrt(mse_polinomial)
aic_polinomial <- AIC(modelo_regresion_polinomial)
# Ridge
pred_ridge <- predict(modelo_ridge_final, newx = X)
rss_ridge <- sum((pred_ridge - y)^2)
tss <- sum((y - mean(y))^2)
r2_ridge <- 1 - (rss_ridge / tss)
mse_ridge <- mean((pred_ridge - y)^2)
rmse_ridge <- sqrt(mse_ridge)
n_parametros_ridge <- sum(coef(modelo_ridge_final) != 0)
rss_ridge <- sum((y - pred_ridge)^2)
sigma2_ridge <- rss_ridge / length(y)
log_lik_ridge <- -0.5 * length(y) * (log(2 * pi) + log(sigma2_ridge) + 1)
aic_ridge <- 2 * n_parametros_ridge - 2 * log_lik_ridge
pred_lasso <- predict(modelo_lasso_final, newx = X)
# Lasso
rss_lasso <- sum((pred_lasso - y)^2) # Residual sum of squares
tss <- sum((y - mean(y))^2) # Total sum of squares
r2_lasso <- 1 - (rss_lasso / tss)
mse_lasso <- mean((pred_lasso - y)^2)
rmse_lasso <- sqrt(mse_lasso)
n_parametros_lasso <- sum(coef(modelo_lasso_final) != 0)
rss_lasso <- sum((y - pred_lasso)^2) # Suma de los residuos al cuadrado
sigma2_lasso <- rss_lasso / length(y) # Estimación de la varianza residual
log_lik_lasso <- -0.5 * length(y) * (log(2 * pi) + log(sigma2_lasso) + 1)
aic_lasso <- 2 * n_parametros_lasso - 2 * log_lik_lasso
cat("Resultados del Modelo Lineal:\n")
## Resultados del Modelo Lineal:
cat("MSE:", mse_regresion, "\n")
## MSE: 125685868520
cat("R^2:", r2_regresion, "\n")
## R^2: 0.6419302
cat("RMSE:", rmse_regresion, "\n")
## RMSE: 354522
cat("AIC:", aic_regresion, "\n\n")
## AIC: 1374.957
cat("Resultados del Modelo Polinomial:\n")
## Resultados del Modelo Polinomial:
cat("MSE:", mse_polinomial, "\n")
## MSE: 95123042885
cat("R^2:", r2_polinomial, "\n")
## R^2: 0.7290014
cat("RMSE:", rmse_polinomial, "\n")
## RMSE: 308420.2
cat("AIC:", aic_polinomial, "\n\n")
## AIC: 1365.583
cat("Resultados del Modelo Ridge:\n")
## Resultados del Modelo Ridge:
cat("MSE:", mse_ridge, "\n")
## MSE: 1.38552e+11
cat("R^2:", r2_ridge, "\n")
## R^2: 0.6052754
cat("RMSE:", rmse_ridge, "\n")
## RMSE: 372225.8
cat("AIC:", aic_ridge, "\n\n")
## AIC: 1377.635
cat("Resultados del Modelo Lasso:\n")
## Resultados del Modelo Lasso:
cat("MSE:", mse_lasso, "\n")
## MSE: 125697195057
cat("R^2:", r2_lasso, "\n")
## R^2: 0.6418979
cat("RMSE:", rmse_lasso, "\n")
## RMSE: 354538
cat("AIC:", aic_lasso, "\n\n")
## AIC: 1372.961
resultados <- data.frame(
Modelo = c("Lineal", "Polinomial", "Ridge", "Lasso"),
MSE = c(mse_regresion, mse_polinomial, mse_ridge, mse_lasso),
R2 = c(r2_regresion, r2_polinomial, r2_ridge, r2_lasso),
RMSE = c(rmse_regresion, rmse_polinomial, rmse_ridge, rmse_lasso),
AIC = c(aic_regresion, aic_polinomial, aic_ridge, aic_lasso)
)
# Gráfica para MSE
ggplot(resultados, aes(x = Modelo, y = MSE)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Comparison of MSE between Models", y = "MSE", x = "Model") +
theme_minimal()
# Gráfica para R^2
ggplot(resultados, aes(x = Modelo, y = R2)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(title = "Comparison of R^2 between Models", y = "R^2", x = "Model") +
theme_minimal()
# Gráfica para RMSE
ggplot(resultados, aes(x = Modelo, y = RMSE)) +
geom_bar(stat = "identity", fill = "coral") +
labs(title = "Comparison of RMSE between Models", y = "RMSE", x = "Model") +
theme_minimal()
# Gráfica para AIC
ggplot(resultados, aes(x = Modelo, y = AIC)) +
geom_bar(stat = "identity", fill = "purple") +
labs(title = "Comparison of AIC between Models", y = "AIC", x = "Model") +
theme_minimal()
After making all of the models, and comparing them between each other, we can see that the most accurate is the polynomial model. We can obtain this conclusion based on the RMSE we obtained for all the models. A lower RMSE indicates that the values obtained are closer to the real selling values, and the polynomial model has the lowest by a considerable margin. Apart from this fact, the R2 is the highest on the polynomial model. This indicates that the variability is well-explained by the model. In other more simple words, the data fits the model nicely, and so it makes the most sense to utilize it. The third metric we used is MSE. In this case, I honestly forgot what this metric measured, but after some investigation, I concluded that it measures the accuracy of the model, and similarly to the RMSE, we are looking for the lowest value possible. Similarly to the past metrics, the polynomial model is also the lowest, which is good. Lastly, the AIC is also the lowest in the polynomial model, even if it is by a small margin. Lower AIC means a good balance between fit and complexity in a model, which means that this model is appropriate for the analysis being made.
Here are some insights obtained from the model:
As consumer sentiment increases, sales also increase. This is statistically significant (p = 0.0016) suggesting this variable is a key factor.
The GDP per-capita has a negative correlation with the sales. This seems counter-intuitive, but it may be based around changing consumer patterns.
There is a very significant positive impact in sales when the temperature is higher. This makes sense, considering Coca-Cola is a popular refresher when temperatures are high.
The R2 is of .689, which indicates a good fit, and the F statistic is highly significant, which tells us the model is effective in predicting sales patterns.
The VIF value indicates moderate multicollinearity, mainly for the GDP per Capita variable. This indicates that this variable is correlated to other variables in the model, which can lead to less prediction.
Based on the Breusch-Pagan test, we can conclude that heteroscedasticity isn’t an issue in the model, considering the p-value is of 0.5408 and the BP is of 5.02.