EVIDENCE 1 - Predictive Model Report

INTRODUCTION

To estimate a linear regression analysis is the most common application of econometrics. A linear regression analysis consists in estimating the linear relationship between one dependent variable and one or more independent variables.

The three main purpose of linear regression analysis are: • To analyze whether the independent variables explain the change in the dependent variable. • Which independent variables are significant predictors in explaining the change in the dependent variable. • What is the magnitude and sign of these significant predictors?

For the aforementioned, a linear regression analysis is useful to predict an outcome. For the enterprises these outcomes are very useful for their success.

BACKGROUND

Coca Cola Femsa is the largest bottler in the Coca-Cola System by sales volume with a portfolio of 129 leading multi-category brands, they serve more than 261 million people and 2 million points of sale, through 49 bottling plants and 268 distribution centers in 10 Latin American countries.

About the finances of Coca-Cola Femsa, the results in 2020 were significantly impacted by COVID-19 and the related changes in consumer mobility and behavior across markets. As they look at their 2021 results, the comparison base of 2020 is only a partial benchmark. Therefore, to facilitate the reader’s assessment of their business unit’s performance in 2021, they provide the following table that includes variations versus 2019 a’s well. (FEMSA, 2021)

DESCRIPTION OF THE PROBLEM SITUATION

PHASE A: Have you ever wondered how many cases of soft drinks Coca-Cola produces to supply the demand of the Guadalajara Metropolitan area?

Considering the information generated over time in a Coca-Cola (for example: sales, production, inventories, macroeconomic variables, etc.) and those factors that influence its behavior, we intend to build a model that forecasts the target variable, and thus contribute to strategic decision making.

Arca Continental Coca-Cola is the second largest Coca-Cola bottling company in Latin America and one of the most important in the world. It serves more than 118 million consumers in Mexico, Argentina, Ecuador, United States and Peru. It has a broad and innovative product portfolio that has managed to satisfy the needs of its consumers at competitive prices. The objective of this situation is to learn all that is involved in building a predictive model for the data of this large company, specifically for the Guadalajara Metropolitan Area.

Prior to the construction of the model, a documentary immersion to the company’s data and a review of the various methods of forecasting prediction must be done. The final objective is to deliver a sales forecast in boxes (grids) at an aggregate level for the Guadalajara Metropolitan Area and for the main brands.

DATA AND METHODOLOGY

#DEPENDENT AND INDEPENDENT VARIABLES #Dependent variable: “sales_unitboxes” #Independent variables:“exchange_rate”, “gdp_percapita”, “itaee”, “holiday_month”.

I put that the dependent variable is sales_unitboxes, because for the Coca-Cola FEMSA is the most important variable, the sales are the reason to being of the company, and the sales allow to the company to earn income. Also I put the independent variables exchange_rate, gdp_percapita, itaee and holiday_month. I choose the variable exchange_rate, because I think is and indicator of the competitiveness of a country with the rest of the world and this is very important for the sales of Coca-Cola. Also, I choose gdp_percapita (gross domestic population by population) because the sales of the company depends of the money that have the population. As well, I choose itaee, it’s mean Indicator of the State Economic Activity, and the sales depends of this indicator. Finally I put holiday_month, this variable means that 1 month that includes a holiday week including public holiday, easter holiday and christmas, the sales depends of this variable because, on holiday months are more sales.

#Data sources and study period (2015-2018 monthly basis) The data source is an excel of Cocacolasales, and now I’m going to upload the information about that, in a study period (2015-2018 monthly basis)

#Installation of packages before loading the libraries.
#install.packages("tidyverse")
#install.packages("lubridate")
#install.packages("scales")
#install.packages("ggplot2")
#install.packages("hrbrthemes")
#install.packages("extrafont")
#install.packages("pastecs")
#install.packages("dbplyr")
#install.packages("funModeling")
#install.packages("Hmsic")
#install.packages("regclass")
#install.packages("stargazer")
#install.packages("jtools")
#install.packages("effects")
#install.packages("huxtable")
#install.packages("Metrics")
#install.packages("ggstance")
#install.packages("lmtest")
#Loading libraries 
library(foreign)
library(stats)
library(tidyverse)  ## A set of tools for Data manipulation and visualization
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)  ## for date time manipulation
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(scales)     ## Formatting numbers and values
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggplot2)    ## Graphs
library(graphics)   ## Graphs 
library(hrbrthemes) ## For changing ggplot theme
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(extrafont)  ## More font options
## Registering fonts with R
library(pastecs)
## 
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following object is masked from 'package:tidyr':
## 
##     extract
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(readr)
library(funModeling) ## data visualization 
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai
library(Hmisc)
library(regclass) 
## Loading required package: bestglm
## Loading required package: leaps
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked from 'package:tidyr':
## 
##     fill
## Loading required package: rpart
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
## 
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
## 
##     qq
library(stargazer)   ## presenting regression results 
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(jtools)      ## presenting and visualizing regression results
## 
## Attaching package: 'jtools'
## The following object is masked from 'package:Hmisc':
## 
##     %nin%
library(effects)
## Loading required package: carData
## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.
library(huxtable)
## 
## Attaching package: 'huxtable'
## The following objects are masked from 'package:Hmisc':
## 
##     contents, label, label<-
## The following object is masked from 'package:scales':
## 
##     number_format
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey
library(Metrics)
library(ggstance)
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
library(openxlsx)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'lmtest'
## The following object is masked from 'package:VGAM':
## 
##     lrtest
##Read xlsx of cocacolasales.
cocacolasales <- read.xlsx("/Users/yamileth/Downloads/cocacolasales.xlsx")

#Exploratory data analysis (descriptive statistics and data visualization) Now I’m going to do a descriptive statistics and data visualization. Both are very important.Descriptive statistics make the data easier to visualize. They make it possible to present in a meaningful and understandable way, which in turn leads to a simplified interpretation of the data set in question.

##Display the structure of the dataset
str(cocacolasales)  
## 'data.frame':    48 obs. of  15 variables:
##  $ tperiod           : num  44211 44242 44270 44301 44331 ...
##  $ sales_unitboxes   : num  5516689 5387496 5886747 6389182 6448275 ...
##  $ consumer_sentiment: num  38.1 37.5 38.5 37.8 38 ...
##  $ CPI               : num  87.1 87.3 87.6 87.4 87 ...
##  $ inflation_rate    : num  -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
##  $ unemp_rate        : num  0.0523 0.0531 0.0461 0.051 0.0552 ...
##  $ gdp_percapita     : num  11660 11660 11660 11626 11626 ...
##  $ itaee             : num  104 104 104 108 108 ...
##  $ itaee_growth      : num  0.0497 0.0497 0.0497 0.0318 0.0318 ...
##  $ pop_density       : num  98.5 98.5 98.5 98.8 98.8 ...
##  $ job_density       : num  18.3 18.5 18.6 18.7 18.7 ...
##  $ pop_minwage       : num  9.66 9.66 9.66 9.59 9.59 ...
##  $ exchange_rate     : num  14.7 14.9 15.2 15.2 15.3 ...
##  $ max_temperature   : num  28 31 29 32 34 32 29 29 29 29 ...
##  $ holiday_month     : num  0 0 0 1 0 0 0 0 1 0 ...

The function str() helps to display the meta data the object, in this case cocacolasales. I can see now the type of the every varable.

#View de names of the columns of the dataframe cocacolasales.
names(cocacolasales)
##  [1] "tperiod"            "sales_unitboxes"    "consumer_sentiment"
##  [4] "CPI"                "inflation_rate"     "unemp_rate"        
##  [7] "gdp_percapita"      "itaee"              "itaee_growth"      
## [10] "pop_density"        "job_density"        "pop_minwage"       
## [13] "exchange_rate"      "max_temperature"    "holiday_month"

The function names() helps to display the names of the every column of the data frame cocacolasales. This help me to a visualize better the names.

##Convert characters and integers to factors
cocacolasales$holiday_month<-as.factor(cocacolasales$holiday_month) 
##Data visualization 
freq(cocacolasales)  #Frequency of holiday_month
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

holiday_monthfrequencypercentagecumulative_perc
0367575
11225100

summary(cocacolasales) #Descriptive statistics 
##     tperiod      sales_unitboxes   consumer_sentiment      CPI        
##  Min.   :44211   Min.   :5301755   Min.   :28.67      Min.   : 86.97  
##  1st Qu.:44294   1st Qu.:6171767   1st Qu.:35.64      1st Qu.: 89.18  
##  Median :44378   Median :6461357   Median :36.76      Median : 92.82  
##  Mean   :44379   Mean   :6473691   Mean   :37.15      Mean   : 93.40  
##  3rd Qu.:44464   3rd Qu.:6819782   3rd Qu.:38.14      3rd Qu.: 98.40  
##  Max.   :44548   Max.   :7963063   Max.   :44.87      Max.   :103.02  
##  inflation_rate      unemp_rate      gdp_percapita       itaee      
##  Min.   :-0.5000   Min.   :0.03466   Min.   :11559   Min.   :103.8  
##  1st Qu.: 0.1650   1st Qu.:0.04010   1st Qu.:11830   1st Qu.:111.5  
##  Median : 0.3850   Median :0.04369   Median :12014   Median :113.5  
##  Mean   : 0.3485   Mean   :0.04442   Mean   :11979   Mean   :113.9  
##  3rd Qu.: 0.5575   3rd Qu.:0.04897   3rd Qu.:12162   3rd Qu.:117.1  
##  Max.   : 1.7000   Max.   :0.05517   Max.   :12329   Max.   :122.5  
##   itaee_growth       pop_density      job_density     pop_minwage    
##  Min.   :0.005571   Min.   : 98.54   Min.   :18.26   Min.   : 9.398  
##  1st Qu.:0.022376   1st Qu.: 99.61   1st Qu.:19.28   1st Qu.:10.794  
##  Median :0.029977   Median :100.67   Median :20.39   Median :11.139  
##  Mean   :0.031736   Mean   :100.65   Mean   :20.38   Mean   :11.116  
##  3rd Qu.:0.043038   3rd Qu.:101.69   3rd Qu.:21.60   3rd Qu.:11.413  
##  Max.   :0.056536   Max.   :102.69   Max.   :22.36   Max.   :13.026  
##  exchange_rate   max_temperature holiday_month
##  Min.   :14.69   Min.   :26.00   0:36         
##  1st Qu.:17.38   1st Qu.:29.00   1:12         
##  Median :18.62   Median :30.00                
##  Mean   :18.18   Mean   :30.50                
##  3rd Qu.:19.06   3rd Qu.:32.25                
##  Max.   :21.39   Max.   :37.00
##Box plots display numerical data by considering quartiles - how data are spread? 
boxplot(sales_unitboxes~itaee, data = cocacolasales)

boxplot(sales_unitboxes~exchange_rate, data = cocacolasales)

#Estimation method: Ordinary Least Finally to find the best model regression for this analysis, I’m going to use the Ordinary Least Squares.

RESULTS ANALYSIS

Regression models specifications (5 differents)

Now I’m going to do 5 regression models specifications. In these models I’m going to estimate the linear relationship between one dependent variable (sales_unitboxes), and one or more independent variables.

MODEL 1

modelcc1<-lm(sales_unitboxes~inflation_rate+exchange_rate+unemp_rate+itaee+consumer_sentiment+pop_density+max_temperature+holiday_month,data = cocacolasales) 
summary(modelcc1)
## 
## Call:
## lm(formula = sales_unitboxes ~ inflation_rate + exchange_rate + 
##     unemp_rate + itaee + consumer_sentiment + pop_density + max_temperature + 
##     holiday_month, data = cocacolasales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570644 -269597   57824  213039  825593 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6415123   11552094   0.555  0.58185    
## inflation_rate      -117838     235742  -0.500  0.61998    
## exchange_rate         -1625      75303  -0.022  0.98289    
## unemp_rate         10086544   17634993   0.572  0.57063    
## itaee                114699      33449   3.429  0.00144 ** 
## consumer_sentiment    52527      25786   2.037  0.04847 *  
## pop_density         -204957     140944  -1.454  0.15390    
## max_temperature      172956      31777   5.443 3.07e-06 ***
## holiday_month1        96311     130682   0.737  0.46554    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 377400 on 39 degrees of freedom
## Multiple R-squared:  0.6703, Adjusted R-squared:  0.6027 
## F-statistic:  9.91 on 8 and 39 DF,  p-value: 2.149e-07
AIC(modelcc1) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results. 
## [1] 1378.997
predictions1= modelcc1 %>% predict(cocacolasales)
rmse(predictions1, cocacolasales$sales_unitboxes) #340,196
## [1] 340195.7
VIF(modelcc1)
##     inflation_rate      exchange_rate         unemp_rate              itaee 
##           2.783449           4.826452           3.548693           8.340600 
## consumer_sentiment        pop_density    max_temperature      holiday_month 
##           1.831380          10.906222           2.311084           1.079047
bptest(modelcc1)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelcc1
## BP = 3.795, df = 8, p-value = 0.8751
stargazer(modelcc1,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
## 
## Regression Results
## ===================================================
##                           Dependent variable:      
##                     -------------------------------
##                             sales_unitboxes        
## ---------------------------------------------------
## inflation_rate        -117,838.100 (235,741.600)   
## exchange_rate           -1,625.119 (75,303.100)    
## unemp_rate          10,086,544.000 (17,634,993.000)
## itaee                 114,699.300*** (33,448.620)  
## consumer_sentiment     52,526.900** (25,785.990)   
## pop_density           -204,957.200 (140,944.300)   
## max_temperature       172,956.500*** (31,777.240)  
## holiday_month1         96,310.840 (130,682.100)    
## Constant            6,415,123.000 (11,552,094.000) 
## ---------------------------------------------------
## Observations                      48               
## R2                               0.670             
## Adjusted R2                      0.603             
## Residual Std. Error      377,413.200 (df = 39)     
## F Statistic              9.910*** (df = 8; 39)     
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01

In the model 1, I can see that the AIC is high: 1379, this means that is a bad quality of regression model results. Also the median is 57824, this should be as close to 0 as possible. The residual standard error is very high, this means that the model is doing a bad job at predicting the Y values.

MODEL 2

modelcc2 = lm(log(sales_unitboxes)~inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+max_temperature+holiday_month, data = cocacolasales)
AIC(modelcc2) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results. 
## [1] -127.1754
predictions2= modelcc2 %>% predict(cocacolasales)
rmse(predictions2, cocacolasales$sales_unitboxes) #6,500,729
## [1] 6500729
VIF(modelcc2)
##          inflation_rate      log(exchange_rate)         log(unemp_rate) 
##                2.756557                5.041757                3.481691 
##              log(itaee) log(consumer_sentiment)        log(pop_density) 
##                8.338746                1.887820               10.688383 
##         max_temperature           holiday_month 
##                2.292941                1.082271
bptest(modelcc2)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelcc2
## BP = 4.3684, df = 8, p-value = 0.8224
stargazer(modelcc2,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
## 
## Regression Results
## ===================================================
##                             Dependent variable:    
##                         ---------------------------
##                            log(sales_unitboxes)    
## ---------------------------------------------------
## inflation_rate                -0.013 (0.036)       
## log(exchange_rate)            -0.032 (0.208)       
## log(unemp_rate)                0.071 (0.120)       
## log(itaee)                   2.190*** (0.581)      
## log(consumer_sentiment)       0.313** (0.149)      
## log(pop_density)              -3.607 (2.156)       
## max_temperature              0.027*** (0.005)      
## holiday_month1                 0.019 (0.020)       
## Constant                     20.311** (7.649)      
## ---------------------------------------------------
## Observations                        48             
## R2                                 0.678           
## Adjusted R2                        0.612           
## Residual Std. Error           0.058 (df = 39)      
## F Statistic               10.275*** (df = 8; 39)   
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01

In the model 2, I see that the median is close to 0 as possible, so it’s good. The residual standard error is very low, it’s mean that the model is doing a good job. The AIC is low, it’s indicate a good quality of the regression model result.

MODEL 3

modelcc3 = lm(log(sales_unitboxes)~lag(log(sales_unitboxes))+inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+max_temperature+holiday_month, data = cocacolasales)
AIC(modelcc3) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results. 
## [1] -126.2114
predictions3= modelcc3 %>% predict(cocacolasales)
rmse(predictions3, cocacolasales$sales_unitboxes) #NA
## [1] NA
VIF(modelcc3)
## lag(log(sales_unitboxes))            inflation_rate        log(exchange_rate) 
##                  1.634605                  2.755881                  4.541737 
##           log(unemp_rate)                log(itaee)   log(consumer_sentiment) 
##                  3.773070                  8.516018                  1.918313 
##          log(pop_density)           max_temperature             holiday_month 
##                 10.511782                  2.639880                  1.120721
bptest(modelcc3)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelcc3
## BP = 8.2182, df = 9, p-value = 0.5123
stargazer(modelcc3,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
## 
## Regression Results
## =====================================================
##                               Dependent variable:    
##                           ---------------------------
##                              log(sales_unitboxes)    
## -----------------------------------------------------
## lag(log(sales_unitboxes))       0.232** (0.113)      
## inflation_rate                  -0.011 (0.036)       
## log(exchange_rate)              -0.080 (0.204)       
## log(unemp_rate)                 -0.015 (0.124)       
## log(itaee)                     1.933*** (0.603)      
## log(consumer_sentiment)         0.276* (0.147)       
## log(pop_density)                -3.847* (2.143)      
## max_temperature                0.024*** (0.005)      
## holiday_month1                   0.026 (0.020)       
## Constant                       19.077** (7.558)      
## -----------------------------------------------------
## Observations                          47             
## R2                                   0.693           
## Adjusted R2                          0.618           
## Residual Std. Error             0.056 (df = 37)      
## F Statistic                  9.260*** (df = 9; 37)   
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01

In model 3, I can see that the residual standard error is very low, it’s means that is doing a good job. The estimates are low for most of the variables. The min and max residuals have the same magnitude, and this is good. The median is close to 0 as possible. The AIC is incredibly low, it’s means that has a good quality of regression model results.

MODEL 4

modelcc4 = lm(log(sales_unitboxes)~lag(log(sales_unitboxes))+inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+max_temperature+I(max_temperature^2)+holiday_month, data = cocacolasales)
AIC(modelcc4) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results. 
## [1] -129.1158
predictions4= modelcc4 %>% predict(cocacolasales)
rmse(predictions4, cocacolasales$sales_unitboxes) #NA
## [1] NA
VIF(modelcc4)
## lag(log(sales_unitboxes))            inflation_rate        log(exchange_rate) 
##                  1.653994                  2.946641                  4.786665 
##           log(unemp_rate)                log(itaee)   log(consumer_sentiment) 
##                  3.812752                  8.549272                  1.979111 
##          log(pop_density)           max_temperature      I(max_temperature^2) 
##                 10.512274                550.950061                560.052580 
##             holiday_month 
##                  1.133199
bptest(modelcc4)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelcc4
## BP = 10.267, df = 10, p-value = 0.4174
stargazer(modelcc4,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
## 
## Regression Results
## =====================================================
##                               Dependent variable:    
##                           ---------------------------
##                              log(sales_unitboxes)    
## -----------------------------------------------------
## lag(log(sales_unitboxes))       0.208* (0.109)       
## inflation_rate                   0.007 (0.035)       
## log(exchange_rate)              -0.171 (0.202)       
## log(unemp_rate)                  0.010 (0.120)       
## log(itaee)                     2.005*** (0.582)      
## log(consumer_sentiment)          0.226 (0.143)       
## log(pop_density)                -3.875* (2.062)      
## max_temperature                 -0.117 (0.071)       
## I(max_temperature2)             0.002* (0.001)       
## holiday_month1                   0.022 (0.019)       
## Constant                       21.918*** (7.412)     
## -----------------------------------------------------
## Observations                          47             
## R2                                   0.723           
## Adjusted R2                          0.646           
## Residual Std. Error             0.054 (df = 36)      
## F Statistic                 9.397*** (df = 10; 36)   
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01

In the model 4, I can see that the residual standard error is very low, it’s means that is doing a good job. The estimates are low for most of the variables. The min and max residuals have the same magnitude, and this is good. The median is close to 0 as possible. The AIC is incredibly low, it’s means that has a good quality of regression model results.

MODEL 5

modelcc5 = lm(log(sales_unitboxes)~lag(log(sales_unitboxes))+inflation_rate+log(exchange_rate)+log(unemp_rate)+log(itaee)+log(consumer_sentiment)+log(pop_density)+I(max_temperature^2)+holiday_month, data = cocacolasales)
summary(modelcc5)
## 
## Call:
## lm(formula = log(sales_unitboxes) ~ lag(log(sales_unitboxes)) + 
##     inflation_rate + log(exchange_rate) + log(unemp_rate) + log(itaee) + 
##     log(consumer_sentiment) + log(pop_density) + I(max_temperature^2) + 
##     holiday_month, data = cocacolasales)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.078985 -0.041619  0.006513  0.034603  0.114113 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               19.7521455  7.4585854   2.648  0.01182 *  
## lag(log(sales_unitboxes))  0.2244024  0.1113765   2.015  0.05123 .  
## inflation_rate            -0.0050333  0.0353375  -0.142  0.88751    
## log(exchange_rate)        -0.1018981  0.2017748  -0.505  0.61655    
## log(unemp_rate)           -0.0105573  0.1222770  -0.086  0.93166    
## log(itaee)                 1.9707000  0.5943501   3.316  0.00206 ** 
## log(consumer_sentiment)    0.2716689  0.1437096   1.890  0.06656 .  
## log(pop_density)          -3.9082666  2.1084995  -1.854  0.07179 .  
## I(max_temperature^2)       0.0004043  0.0000813   4.972 1.54e-05 ***
## holiday_month1             0.0260390  0.0196142   1.328  0.19246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05546 on 37 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7023, Adjusted R-squared:  0.6299 
## F-statistic: 9.697 on 9 and 37 DF,  p-value: 2.014e-07
AIC(modelcc5) ### Akaike Information Criterion (AIC) provides a means for model selection. The lower the AIC the better the quality of the regression model results. 
## [1] -127.7222
predictions5= modelcc5 %>% predict(cocacolasales)
rmse(predictions5, cocacolasales$sales_unitboxes) #NA
## [1] NA
VIF(modelcc5)
## lag(log(sales_unitboxes))            inflation_rate        log(exchange_rate) 
##                  1.640331                  2.816760                  4.578820 
##           log(unemp_rate)                log(itaee)   log(consumer_sentiment) 
##                  3.772462                  8.537993                  1.905754 
##          log(pop_density)      I(max_temperature^2)             holiday_month 
##                 10.511280                  2.683495                  1.117767
bptest(modelcc5)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelcc5
## BP = 8.4037, df = 9, p-value = 0.494
stargazer(modelcc5,type="text",title="Regression Results",single.row=TRUE,ci=FALSE,ci.level=0.95)
## 
## Regression Results
## =====================================================
##                               Dependent variable:    
##                           ---------------------------
##                              log(sales_unitboxes)    
## -----------------------------------------------------
## lag(log(sales_unitboxes))       0.224* (0.111)       
## inflation_rate                  -0.005 (0.035)       
## log(exchange_rate)              -0.102 (0.202)       
## log(unemp_rate)                 -0.011 (0.122)       
## log(itaee)                     1.971*** (0.594)      
## log(consumer_sentiment)         0.272* (0.144)       
## log(pop_density)                -3.908* (2.108)      
## I(max_temperature2)           0.0004*** (0.0001)     
## holiday_month1                   0.026 (0.020)       
## Constant                       19.752** (7.459)      
## -----------------------------------------------------
## Observations                          47             
## R2                                   0.702           
## Adjusted R2                          0.630           
## Residual Std. Error             0.055 (df = 37)      
## F Statistic                  9.697*** (df = 9; 37)   
## =====================================================
## Note:                     *p<0.1; **p<0.05; ***p<0.01

For a better analysis, I did another regression model. In the model 5, I can see that the residual standard error is very low, it’s means that is doing a good job. The estimates are low for most of the variables. The min and max residuals have the same magnitude, and this is good. The median is close to 0 as possible. The AIC is incredibly low, it’s means that has a good quality of regression model results.

Selecting the regression model that better fits the data (AIC,RMSE) and model diagnostics (multicollinearity and heteroscedasticity)

regression_model <- c('modelcc1','modelcc2','modelcc3','modelcc4','modelocc5')
RMSE <- c(340196,6500729,NA,NA,NA)
AIC <- c(1379,-127.18,-126.21,-129.115,-127.722)
Adjusted_R2 <- c(0.603, 0.612, 0.618, 0.646, 0.630)
selection_criteria <- data.frame(regression_model,RMSE,AIC,Adjusted_R2)
selection_criteria #Regression model 4 is selected because the lowest AIC, greater R2, and low multicollinearity problem.
regression_modelRMSEAICAdjusted_R2
modelcc13.4e+051.38e+030.603
modelcc26.5e+06-127       0.612
modelcc3      -126       0.618
modelcc4      -129       0.646
modelocc5      -128       0.63 

CONCLUSIONS AND RECOMMENDATIONS

Finally, after I did the five model regressions, I can choose only one. The regression model that I choose is the MODEL 4, because has the lowest AIC, greater R2, and low multicoolinearity problem. As I had commented, the AIC means Akaike Information Criterion (AIC) that provides a means for model selection. The lower the AIC the better the quality of the regression model results, so the model regression 4 complete with that characteristic. In conclusion, as we now, Coca-Cola FEMSA is a company that covers all the world, but is important that they give it continuity to their success. Because of that, I bealive that is very important to the compaany do this type of analysis, and also is very important to consider the variables that on which sales depends, for example, one variable that is very important to the sales of the company, is “itaee” Indicator of the State Economic Activity.

REFERENCES

APPENDINX

plot(effect("lag(log(sales_unitboxes))",modelcc4),main="Predicted Sales according to the lag(log(sales_unitboxes))", xlab="lag(log(sales_unitboxes))", ylab="log(sales_unitboxes)")
## Warning in effect.llines(x[good], y[good], lwd = lwd, col = colors[1], lty =
## lines, : spline interpolation may be unstable with only 4 points
## Warning in panel.bands(x[good], y[good], upper[good], lower[good], fill =
## band.colors[1], : spline interpolation may be unstable with only 4 points