To estimate a linear regression analysis is the most common application of econometrics. A linear regression analysis consists in estimating the linear relationship between one dependent variable and one or more independent variables.
The three main purpose of linear regression analysis are: • To analyze whether the independent variables explain the change in the dependent variable. • Which independent variables are significant predictors in explaining the change in the dependent variable. • What is the magnitude and sign of these significant predictors?
For the aforementioned, a linear regression analysis is useful to predict an outcome. For the enterprises these outcomes are very useful for their success.
Coca Cola Femsa is the largest bottler in the Coca-Cola System by sales volume with a portfolio of 129 leading multi-category brands, they serve more than 261 million people and 2 million points of sale, through 49 bottling plants and 268 distribution centers in 10 Latin American countries.
About the finances of Coca-Cola Femsa, the results in 2020 were significantly impacted by COVID-19 and the related changes in consumer mobility and behavior across markets. As they look at their 2021 results, the comparison base of 2020 is only a partial benchmark. Therefore, to facilitate the reader’s assessment of their business unit’s performance in 2021, they provide the following table that includes variations versus 2019 a’s well. (FEMSA, 2021)
PHASE A: Have you ever wondered how many cases of soft drinks Coca-Cola produces to supply the demand of the Guadalajara Metropolitan area?
Considering the information generated over time in a Coca-Cola (for example: sales, production, inventories, macroeconomic variables, etc.) and those factors that influence its behavior, we intend to build a model that forecasts the target variable, and thus contribute to strategic decision making.
Arca Continental Coca-Cola is the second largest Coca-Cola bottling company in Latin America and one of the most important in the world. It serves more than 118 million consumers in Mexico, Argentina, Ecuador, United States and Peru. It has a broad and innovative product portfolio that has managed to satisfy the needs of its consumers at competitive prices. The objective of this situation is to learn all that is involved in building a predictive model for the data of this large company, specifically for the Guadalajara Metropolitan Area.
Prior to the construction of the model, a documentary immersion to the company’s data and a review of the various methods of forecasting prediction must be done. The final objective is to deliver a sales forecast in boxes (grids) at an aggregate level for the Guadalajara Metropolitan Area and for the main brands.
#DEPENDENT AND INDEPENDENT VARIABLES #Dependent variable: “sales_unitboxes” #Independent variables:“exchange_rate”, “gdp_percapita”, “itaee”, “holiday_month”.
I put that the dependent variable is sales_unitboxes, because for the Coca-Cola FEMSA is the most important variable, the sales are the reason to being of the company, and the sales allow to the company to earn income. Also I put the independent variables exchange_rate, gdp_percapita, itaee and holiday_month. I choose the variable exchange_rate, because I think is and indicator of the competitiveness of a country with the rest of the world and this is very important for the sales of Coca-Cola. Also, I choose gdp_percapita (gross domestic population by population) because the sales of the company depends of the money that have the population. As well, I choose itaee, it’s mean Indicator of the State Economic Activity, and the sales depends of this indicator. Finally I put holiday_month, this variable means that 1 month that includes a holiday week including public holiday, easter holiday and christmas, the sales depends of this variable because, on holiday months are more sales.
#Data sources and study period (2015-2018 monthly basis) The data source is an excel of Cocacolasales, and now I’m going to upload the information about that, in a study period (2015-2018 monthly basis)
#Installation of packages before loading the libraries.
#install.packages("tidyverse")
#install.packages("lubridate")
#install.packages("scales")
#install.packages("ggplot2")
#install.packages("hrbrthemes")
#install.packages("extrafont")
#install.packages("pastecs")
#install.packages("dbplyr")
#install.packages("funModeling")
#install.packages("Hmsic")
#install.packages("regclass")
#install.packages("stargazer")
#install.packages("jtools")
#install.packages("effects")
#install.packages("huxtable")
#install.packages("Metrics")
#install.packages("ggstance")
#install.packages("lmtest")
#Loading libraries
library(foreign)
library(stats)
library(tidyverse) ## A set of tools for Data manipulation and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) ## for date time manipulation
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(scales) ## Formatting numbers and values
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2) ## Graphs
library(graphics) ## Graphs
library(hrbrthemes) ## For changing ggplot theme
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(extrafont) ## More font options
## Registering fonts with R
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following object is masked from 'package:tidyr':
##
## extract
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(readr)
library(funModeling) ## data visualization
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(Hmisc)
library(regclass)
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:tidyr':
##
## fill
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
##
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
##
## qq
library(stargazer) ## presenting regression results
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(jtools) ## presenting and visualizing regression results
##
## Attaching package: 'jtools'
## The following object is masked from 'package:Hmisc':
##
## %nin%
library(effects)
## Loading required package: carData
## Use the command
## lattice::trellis.par.set(effectsTheme())
## to customize lattice options for effects plots.
## See ?effectsTheme for details.
library(huxtable)
##
## Attaching package: 'huxtable'
## The following objects are masked from 'package:Hmisc':
##
## contents, label, label<-
## The following object is masked from 'package:scales':
##
## number_format
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:ggplot2':
##
## theme_grey
library(Metrics)
library(ggstance)
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
library(openxlsx)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'lmtest'
## The following object is masked from 'package:VGAM':
##
## lrtest
##Read xlsx of cocacolasales.
cocacolasales <- read.xlsx("/Users/yamileth/Downloads/cocacolasales.xlsx")
#Exploratory data analysis (descriptive statistics and data visualization) Now I’m going to do a descriptive statistics and data visualization. Both are very important.Descriptive statistics make the data easier to visualize. They make it possible to present in a meaningful and understandable way, which in turn leads to a simplified interpretation of the data set in question.
##Display the structure of the dataset
str(cocacolasales)
## 'data.frame': 48 obs. of 15 variables:
## $ tperiod : num 44211 44242 44270 44301 44331 ...
## $ sales_unitboxes : num 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num 0.0523 0.0531 0.0461 0.051 0.0552 ...
## $ gdp_percapita : num 11660 11660 11660 11626 11626 ...
## $ itaee : num 104 104 104 108 108 ...
## $ itaee_growth : num 0.0497 0.0497 0.0497 0.0318 0.0318 ...
## $ pop_density : num 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : num 0 0 0 1 0 0 0 0 1 0 ...
The function str() helps to display the meta data the object, in this case cocacolasales. I can see now the type of the every varable.
#View de names of the columns of the dataframe cocacolasales.
names(cocacolasales)
## [1] "tperiod" "sales_unitboxes" "consumer_sentiment"
## [4] "CPI" "inflation_rate" "unemp_rate"
## [7] "gdp_percapita" "itaee" "itaee_growth"
## [10] "pop_density" "job_density" "pop_minwage"
## [13] "exchange_rate" "max_temperature" "holiday_month"
The function names() helps to display the names of the every column of the data frame cocacolasales. This help me to a visualize better the names.
##Convert characters and integers to factors
cocacolasales$holiday_month<-as.factor(cocacolasales$holiday_month)
##Data visualization
freq(cocacolasales) #Frequency of holiday_month
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
| holiday_month | frequency | percentage | cumulative_perc |
|---|---|---|---|
| 0 | 36 | 75 | 75 |
| 1 | 12 | 25 | 100 |
summary(cocacolasales) #Descriptive statistics
## tperiod sales_unitboxes consumer_sentiment CPI
## Min. :44211 Min. :5301755 Min. :28.67 Min. : 86.97
## 1st Qu.:44294 1st Qu.:6171767 1st Qu.:35.64 1st Qu.: 89.18
## Median :44378 Median :6461357 Median :36.76 Median : 92.82
## Mean :44379 Mean :6473691 Mean :37.15 Mean : 93.40
## 3rd Qu.:44464 3rd Qu.:6819782 3rd Qu.:38.14 3rd Qu.: 98.40
## Max. :44548 Max. :7963063 Max. :44.87 Max. :103.02
## inflation_rate unemp_rate gdp_percapita itaee
## Min. :-0.5000 Min. :0.03466 Min. :11559 Min. :103.8
## 1st Qu.: 0.1650 1st Qu.:0.04010 1st Qu.:11830 1st Qu.:111.5
## Median : 0.3850 Median :0.04369 Median :12014 Median :113.5
## Mean : 0.3485 Mean :0.04442 Mean :11979 Mean :113.9
## 3rd Qu.: 0.5575 3rd Qu.:0.04897 3rd Qu.:12162 3rd Qu.:117.1
## Max. : 1.7000 Max. :0.05517 Max. :12329 Max. :122.5
## itaee_growth pop_density job_density pop_minwage
## Min. :0.005571 Min. : 98.54 Min. :18.26 Min. : 9.398
## 1st Qu.:0.022376 1st Qu.: 99.61 1st Qu.:19.28 1st Qu.:10.794
## Median :0.029977 Median :100.67 Median :20.39 Median :11.139
## Mean :0.031736 Mean :100.65 Mean :20.38 Mean :11.116
## 3rd Qu.:0.043038 3rd Qu.:101.69 3rd Qu.:21.60 3rd Qu.:11.413
## Max. :0.056536 Max. :102.69 Max. :22.36 Max. :13.026
## exchange_rate max_temperature holiday_month
## Min. :14.69 Min. :26.00 0:36
## 1st Qu.:17.38 1st Qu.:29.00 1:12
## Median :18.62 Median :30.00
## Mean :18.18 Mean :30.50
## 3rd Qu.:19.06 3rd Qu.:32.25
## Max. :21.39 Max. :37.00
##Box plots display numerical data by considering quartiles - how data are spread?
boxplot(sales_unitboxes~itaee, data = cocacolasales)
boxplot(sales_unitboxes~exchange_rate, data = cocacolasales)
#Estimation method: Ordinary Least Finally to find the best model regression for this analysis, I’m going to use the Ordinary Least Squares.
Now I’m going to do 5 regression models specifications. In these models I’m going to estimate the linear relationship between one dependent variable (sales_unitboxes), and one or more independent variables.
modelcc1<-lm(sales_unitboxes~inflation_rate+exchange_rate+unemp_rate+itaee+consumer_sentiment+pop_density+max_temperature+holiday_month,data = cocacolasales)
summary(modelcc1)
##
## Call:
## lm(formula = sales_unitboxes ~ inflation_rate + exchange_rate +
## unemp_rate + itaee + consumer_sentiment + pop_density + max_temperature +
## holiday_month, data = cocacolasales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -570644 -269597 57824 213039 825593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6415123 11552094 0.555 0.58185
## inflation_rate -117838 235742 -0.500 0.61998
## exchange_rate -1625 75303 -0.022 0.98289
## unemp_rate 10086544 17634993 0.572 0.57063
## itaee 114699 33449 3.429 0.00144 **
## consumer_sentiment 52527 25786 2.037 0.04847 *
## pop_density -204957 140944 -1.454 0.15390
## max_temperature 172956 31777 5.443 3.07e-06 ***
## holiday_month1 96311 130682 0.737 0.46554
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 377400 on 39 degrees of freedom
## Multiple R-squared: 0.6703, Adjusted R-squared: 0.6027
## F-statistic: 9.91 on 8 and 39 DF, p-value: 2.149e-07
AIC(modelcc1) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results.
## [1] 1378.997
predictions1= modelcc1 %>% predict(cocacolasales)
rmse(predictions1, cocacolasales$sales_unitboxes) #340,196
## [1] 340195.7
VIF(modelcc1)
## inflation_rate exchange_rate unemp_rate itaee
## 2.783449 4.826452 3.548693 8.340600
## consumer_sentiment pop_density max_temperature holiday_month
## 1.831380 10.906222 2.311084 1.079047
bptest(modelcc1)
##
## studentized Breusch-Pagan test
##
## data: modelcc1
## BP = 3.795, df = 8, p-value = 0.8751
stargazer(modelcc1,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
##
## Regression Results
## ===================================================
## Dependent variable:
## -------------------------------
## sales_unitboxes
## ---------------------------------------------------
## inflation_rate -117,838.100 (235,741.600)
## exchange_rate -1,625.119 (75,303.100)
## unemp_rate 10,086,544.000 (17,634,993.000)
## itaee 114,699.300*** (33,448.620)
## consumer_sentiment 52,526.900** (25,785.990)
## pop_density -204,957.200 (140,944.300)
## max_temperature 172,956.500*** (31,777.240)
## holiday_month1 96,310.840 (130,682.100)
## Constant 6,415,123.000 (11,552,094.000)
## ---------------------------------------------------
## Observations 48
## R2 0.670
## Adjusted R2 0.603
## Residual Std. Error 377,413.200 (df = 39)
## F Statistic 9.910*** (df = 8; 39)
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the model 1, I can see that the AIC is high: 1379, this means that is a bad quality of regression model results. Also the median is 57824, this should be as close to 0 as possible. The residual standard error is very high, this means that the model is doing a bad job at predicting the Y values.
modelcc2 = lm(log(sales_unitboxes)~inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+max_temperature+holiday_month, data = cocacolasales)
AIC(modelcc2) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results.
## [1] -127.1754
predictions2= modelcc2 %>% predict(cocacolasales)
rmse(predictions2, cocacolasales$sales_unitboxes) #6,500,729
## [1] 6500729
VIF(modelcc2)
## inflation_rate log(exchange_rate) log(unemp_rate)
## 2.756557 5.041757 3.481691
## log(itaee) log(consumer_sentiment) log(pop_density)
## 8.338746 1.887820 10.688383
## max_temperature holiday_month
## 2.292941 1.082271
bptest(modelcc2)
##
## studentized Breusch-Pagan test
##
## data: modelcc2
## BP = 4.3684, df = 8, p-value = 0.8224
stargazer(modelcc2,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
##
## Regression Results
## ===================================================
## Dependent variable:
## ---------------------------
## log(sales_unitboxes)
## ---------------------------------------------------
## inflation_rate -0.013 (0.036)
## log(exchange_rate) -0.032 (0.208)
## log(unemp_rate) 0.071 (0.120)
## log(itaee) 2.190*** (0.581)
## log(consumer_sentiment) 0.313** (0.149)
## log(pop_density) -3.607 (2.156)
## max_temperature 0.027*** (0.005)
## holiday_month1 0.019 (0.020)
## Constant 20.311** (7.649)
## ---------------------------------------------------
## Observations 48
## R2 0.678
## Adjusted R2 0.612
## Residual Std. Error 0.058 (df = 39)
## F Statistic 10.275*** (df = 8; 39)
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the model 2, I see that the median is close to 0 as possible, so it’s good. The residual standard error is very low, it’s mean that the model is doing a good job. The AIC is low, it’s indicate a good quality of the regression model result.
modelcc3 = lm(log(sales_unitboxes)~lag(log(sales_unitboxes))+inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+max_temperature+holiday_month, data = cocacolasales)
AIC(modelcc3) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results.
## [1] -126.2114
predictions3= modelcc3 %>% predict(cocacolasales)
rmse(predictions3, cocacolasales$sales_unitboxes) #NA
## [1] NA
VIF(modelcc3)
## lag(log(sales_unitboxes)) inflation_rate log(exchange_rate)
## 1.634605 2.755881 4.541737
## log(unemp_rate) log(itaee) log(consumer_sentiment)
## 3.773070 8.516018 1.918313
## log(pop_density) max_temperature holiday_month
## 10.511782 2.639880 1.120721
bptest(modelcc3)
##
## studentized Breusch-Pagan test
##
## data: modelcc3
## BP = 8.2182, df = 9, p-value = 0.5123
stargazer(modelcc3,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
##
## Regression Results
## =====================================================
## Dependent variable:
## ---------------------------
## log(sales_unitboxes)
## -----------------------------------------------------
## lag(log(sales_unitboxes)) 0.232** (0.113)
## inflation_rate -0.011 (0.036)
## log(exchange_rate) -0.080 (0.204)
## log(unemp_rate) -0.015 (0.124)
## log(itaee) 1.933*** (0.603)
## log(consumer_sentiment) 0.276* (0.147)
## log(pop_density) -3.847* (2.143)
## max_temperature 0.024*** (0.005)
## holiday_month1 0.026 (0.020)
## Constant 19.077** (7.558)
## -----------------------------------------------------
## Observations 47
## R2 0.693
## Adjusted R2 0.618
## Residual Std. Error 0.056 (df = 37)
## F Statistic 9.260*** (df = 9; 37)
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In model 3, I can see that the residual standard error is very low, it’s means that is doing a good job. The estimates are low for most of the variables. The min and max residuals have the same magnitude, and this is good. The median is close to 0 as possible. The AIC is incredibly low, it’s means that has a good quality of regression model results.
modelcc4 = lm(log(sales_unitboxes)~lag(log(sales_unitboxes))+inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+max_temperature+I(max_temperature^2)+holiday_month, data = cocacolasales)
AIC(modelcc4) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results.
## [1] -129.1158
predictions4= modelcc4 %>% predict(cocacolasales)
rmse(predictions4, cocacolasales$sales_unitboxes) #NA
## [1] NA
VIF(modelcc4)
## lag(log(sales_unitboxes)) inflation_rate log(exchange_rate)
## 1.653994 2.946641 4.786665
## log(unemp_rate) log(itaee) log(consumer_sentiment)
## 3.812752 8.549272 1.979111
## log(pop_density) max_temperature I(max_temperature^2)
## 10.512274 550.950061 560.052580
## holiday_month
## 1.133199
bptest(modelcc4)
##
## studentized Breusch-Pagan test
##
## data: modelcc4
## BP = 10.267, df = 10, p-value = 0.4174
stargazer(modelcc4,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
##
## Regression Results
## =====================================================
## Dependent variable:
## ---------------------------
## log(sales_unitboxes)
## -----------------------------------------------------
## lag(log(sales_unitboxes)) 0.208* (0.109)
## inflation_rate 0.007 (0.035)
## log(exchange_rate) -0.171 (0.202)
## log(unemp_rate) 0.010 (0.120)
## log(itaee) 2.005*** (0.582)
## log(consumer_sentiment) 0.226 (0.143)
## log(pop_density) -3.875* (2.062)
## max_temperature -0.117 (0.071)
## I(max_temperature2) 0.002* (0.001)
## holiday_month1 0.022 (0.019)
## Constant 21.918*** (7.412)
## -----------------------------------------------------
## Observations 47
## R2 0.723
## Adjusted R2 0.646
## Residual Std. Error 0.054 (df = 36)
## F Statistic 9.397*** (df = 10; 36)
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
In the model 4, I can see that the residual standard error is very low, it’s means that is doing a good job. The estimates are low for most of the variables. The min and max residuals have the same magnitude, and this is good. The median is close to 0 as possible. The AIC is incredibly low, it’s means that has a good quality of regression model results.
modelcc5 = lm(log(sales_unitboxes)~lag(log(sales_unitboxes))+inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+I(max_temperature^2)+holiday_month, data = cocacolasales)
summary(modelcc5)
##
## Call:
## lm(formula = log(sales_unitboxes) ~ lag(log(sales_unitboxes)) +
## inflation_rate + log(exchange_rate) + log(unemp_rate) + log(itaee) +
## log(consumer_sentiment) + log(pop_density) + I(max_temperature^2) +
## holiday_month, data = cocacolasales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.078985 -0.041619 0.006513 0.034603 0.114113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.7521455 7.4585854 2.648 0.01182 *
## lag(log(sales_unitboxes)) 0.2244024 0.1113765 2.015 0.05123 .
## inflation_rate -0.0050333 0.0353375 -0.142 0.88751
## log(exchange_rate) -0.1018981 0.2017748 -0.505 0.61655
## log(unemp_rate) -0.0105573 0.1222770 -0.086 0.93166
## log(itaee) 1.9707000 0.5943501 3.316 0.00206 **
## log(consumer_sentiment) 0.2716689 0.1437096 1.890 0.06656 .
## log(pop_density) -3.9082666 2.1084995 -1.854 0.07179 .
## I(max_temperature^2) 0.0004043 0.0000813 4.972 1.54e-05 ***
## holiday_month1 0.0260390 0.0196142 1.328 0.19246
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05546 on 37 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7023, Adjusted R-squared: 0.6299
## F-statistic: 9.697 on 9 and 37 DF, p-value: 2.014e-07
AIC(modelcc5) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results.
## [1] -127.7222
predictions5= modelcc5 %>% predict(cocacolasales)
rmse(predictions5, cocacolasales$sales_unitboxes) #NA
## [1] NA
VIF(modelcc5)
## lag(log(sales_unitboxes)) inflation_rate log(exchange_rate)
## 1.640331 2.816760 4.578820
## log(unemp_rate) log(itaee) log(consumer_sentiment)
## 3.772462 8.537993 1.905754
## log(pop_density) I(max_temperature^2) holiday_month
## 10.511280 2.683495 1.117767
bptest(modelcc5)
##
## studentized Breusch-Pagan test
##
## data: modelcc5
## BP = 8.4037, df = 9, p-value = 0.494
stargazer(modelcc5,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
##
## Regression Results
## =====================================================
## Dependent variable:
## ---------------------------
## log(sales_unitboxes)
## -----------------------------------------------------
## lag(log(sales_unitboxes)) 0.224* (0.111)
## inflation_rate -0.005 (0.035)
## log(exchange_rate) -0.102 (0.202)
## log(unemp_rate) -0.011 (0.122)
## log(itaee) 1.971*** (0.594)
## log(consumer_sentiment) 0.272* (0.144)
## log(pop_density) -3.908* (2.108)
## I(max_temperature2) 0.0004*** (0.0001)
## holiday_month1 0.026 (0.020)
## Constant 19.752** (7.459)
## -----------------------------------------------------
## Observations 47
## R2 0.702
## Adjusted R2 0.630
## Residual Std. Error 0.055 (df = 37)
## F Statistic 9.697*** (df = 9; 37)
## =====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
For a better analysis, I did another regression model. In the model 5, I can see that the residual standard error is very low, it’s means that is doing a good job. The estimates are low for most of the variables. The min and max residuals have the same magnitude, and this is good. The median is close to 0 as possible. The AIC is incredibly low, it’s means that has a good quality of regression model results.
Selecting the regression model that better fits the data (AIC,RMSE) and model diagnostics (multicollinearity and heteroscedasticity)
regression_model <- c('modelcc1','modelcc2','modelcc3','modelcc4','modelocc5')
RMSE <- c(340196,6500729,NA,NA,NA)
AIC <- c(1379,-127.18,-126.21,-129.115,-127.722)
Adjusted_R2 <- c(0.603, 0.612, 0.618, 0.646, 0.630)
selection_criteria <- data.frame(regression_model,RMSE,AIC,Adjusted_R2)
selection_criteria #Regression model 4 is selected because the lowest AIC, greater R2, and low multicollinearity problem.
| regression_model | RMSE | AIC | Adjusted_R2 |
|---|---|---|---|
| modelcc1 | 3.4e+05 | 1.38e+03 | 0.603 |
| modelcc2 | 6.5e+06 | -127 | 0.612 |
| modelcc3 | -126 | 0.618 | |
| modelcc4 | -129 | 0.646 | |
| modelocc5 | -128 | 0.63 |
Finally, after I did the five model regressions, I can choose only one. The regression model that I choose is the MODEL 4, because has the lowest AIC, greater R2, and low multicoolinearity problem. As I had commented, the AIC means Akaike Information Criterion (AIC) that provides a means for model selection. The lower the AIC the better the quality of the regression model results, so the model regression 4 complete with that characteristic. In conclusion, as we now, Coca-Cola FEMSA is a company that covers all the world, but is important that they give it continuity to their success. Because of that, I bealive that is very important to the compaany do this type of analysis, and also is very important to consider the variables that on which sales depends, for example, one variable that is very important to the sales of the company, is “itaee” Indicator of the State Economic Activity.
COCA-COLA FEMSA. (2019). BUSINESS KEY FACTS AND FIGURES. August 25th, 2021., de Coca-Cola FEMSA Sitio web: https://coca-colafemsa.com/en/about/we-are-coca-cola-femsa/business-key-facts-and-figures/
Heij, Christiaan; de Boer, Paul; Hans Franses, Philip; Kloek, Teun; and van Dijk, Herman K. (2004). Econometric Method with Applications in Business and Economics. Oxford University Press
Gujarati, Damodar N. y Porter, Dawn C. (2008). Basic Econometrics (5th Edition). McGraw-Hill Education
Data science 101: How to use linear regression as your predictive model. dbSeer. (2019, August 2). https://dbseer.com/data-science-101-how-to-use-linear-regression-as-your- predictive-model/.
plot(effect("lag(log(sales_unitboxes))",modelcc4),main="Predicted Sales according to the lag(log(sales_unitboxes))", xlab="lag(log(sales_unitboxes))", ylab="log(sales_unitboxes)")
## Warning in effect.llines(x[good], y[good], lwd = lwd, col = colors[1], lty =
## lines, : spline interpolation may be unstable with only 4 points
## Warning in panel.bands(x[good], y[good], upper[good], lower[good], fill =
## band.colors[1], : spline interpolation may be unstable with only 4 points