#Libraries uploading
library(tseries)
library(plotly)
library(xts)
library(dplyr)
library(zoo)
library(tseries)
library(stats)
library(forecast)
library(astsa)
library(corrplot)
library(AER)
library(vars)
library(dynlm)
library(vars)
library(TSstudio)
library(tidyverse)
library(sarima)
library(dygraphs)
library(TSstudio)
library(dlookr)
library(kableExtra)
library(effects)
library(visdat)
library(foreign)
library(dplyr) # data manipulation
library(forcats) # to work with categorical variables
library(ggplot2) # data visualization
library(readr) # read specific csv files
library(janitor) # data exploration and cleaning
library(Hmisc) # several useful functions for data analysis
library(psych) # functions for multivariate analysis
library(naniar) # summaries and visualization of missing values NAs
library(dlookr) # summaries and visualization of missing values NAs
library(corrplot) # correlation plots
library(jtools) # presentation of regression analysis
library(lmtest) # diagnostic checks - linear regression analysis
library(car) # diagnostic checks - linear regression analysis
library(kableExtra) # HTML table attributes
library(tidyverse)
library(caret)
library(glmnet)
library(Metrics)
CC <- read.csv("C:\\Users\\danyb\\OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey\\Docs\\Documentos\\Business Intelligence\\Quinto Semestre\\Introduction to Econometrics\\Examen\\cocacola.csv")
coca-cola dataset will be used for this analysis
The Coca-Cola Company: “Founded in 1886, The Coca-Cola Company is the world’s largest beverage company, refreshing consumers with more than 500 sparkling and still brands. The Coca-Cola Company’s corporate headquarters are in Atlanta with local operations in more than 200 countries around the world”. (Coca-Cola FEMSA / 2016 Annual Report, 2016)
Variables for this data set are the following:
tperiod: date
sales_unitboxes dependent variable: sales coca-cola unit boxes
consumer_sentiment: how consumers feel about the state of the economy
CPI: consumer price index 2018=100
inflation_rate: change in the consumer price index 2018=100
unemp_rate: percentage of the labor force that is unemployed
gdp_percapita: gross domestic population by population
itaee: Indicator of the State Economic Activity - ITAEE
itaee_growth: itaee’s growth rate
pop_density: population per km2
job_density: employed population per km2
pop_minwage: population per km2 earning 1-2 miniumum wages
exchange_rate: exchange rate U.S. - MXN
max_temperature: average max temperature
holiday_month: 1 if month includes a holiday week including: public holiday, easter holiday, and christmas; 0 otherwise
str(CC)
## 'data.frame': 48 obs. of 15 variables:
## $ tperiod : chr "15-ene" "15-feb" "15-mar" "15-abr" ...
## $ sales_unitboxes : num 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num 0.0523 0.0531 0.0461 0.051 0.0552 0.0507 0.0542 0.0547 0.0538 0.0539 ...
## $ gdp_percapita : num 11660 11660 11660 11626 11626 ...
## $ itaee : num 104 104 104 108 108 ...
## $ itaee_growth : num 0.0497 0.0497 0.0497 0.0318 0.0318 0.0318 0.0565 0.0565 0.0565 0.0056 ...
## $ pop_density : num 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : int 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : int 0 0 0 1 0 0 0 0 1 0 ...
It is possible to see that the data set mostly has numeric variables, except for preiod, max_temperature and holiday_month. It will be necessary to transform some of them.
CC$tperiod <- as.Date(CC$tperiod, format = "yyyy-mm-dd") #CORRECT FORMAT FOR DATES
CC$max_temperature<-as.numeric(CC$max_temperature)# In order to make some plots, it needs to be numeric.
dim(CC)
## [1] 48 15
The dataset includes 48 observations, and 15 different variables.
sum(is.na(CC))
## [1] 48
There are no missing values for this dataset.
summary(CC)
## tperiod sales_unitboxes consumer_sentiment CPI
## Min. :NA Min. :5301755 Min. :28.67 Min. : 86.97
## 1st Qu.:NA 1st Qu.:6171767 1st Qu.:35.64 1st Qu.: 89.18
## Median :NA Median :6461357 Median :36.76 Median : 92.82
## Mean :NaN Mean :6473691 Mean :37.15 Mean : 93.40
## 3rd Qu.:NA 3rd Qu.:6819782 3rd Qu.:38.14 3rd Qu.: 98.40
## Max. :NA Max. :7963063 Max. :44.87 Max. :103.02
## NA's :48
## inflation_rate unemp_rate gdp_percapita itaee
## Min. :-0.5000 Min. :0.03470 Min. :11559 Min. :103.8
## 1st Qu.: 0.1650 1st Qu.:0.04010 1st Qu.:11830 1st Qu.:111.5
## Median : 0.3850 Median :0.04370 Median :12014 Median :113.5
## Mean : 0.3485 Mean :0.04442 Mean :11979 Mean :113.9
## 3rd Qu.: 0.5575 3rd Qu.:0.04895 3rd Qu.:12162 3rd Qu.:117.1
## Max. : 1.7000 Max. :0.05520 Max. :12329 Max. :122.5
##
## itaee_growth pop_density job_density pop_minwage
## Min. :0.00560 Min. : 98.54 Min. :18.26 Min. : 9.398
## 1st Qu.:0.02237 1st Qu.: 99.61 1st Qu.:19.28 1st Qu.:10.794
## Median :0.02995 Median :100.67 Median :20.39 Median :11.139
## Mean :0.03172 Mean :100.65 Mean :20.38 Mean :11.116
## 3rd Qu.:0.04300 3rd Qu.:101.69 3rd Qu.:21.60 3rd Qu.:11.413
## Max. :0.05650 Max. :102.69 Max. :22.36 Max. :13.026
##
## exchange_rate max_temperature holiday_month
## Min. :14.69 Min. :26.00 Min. :0.00
## 1st Qu.:17.38 1st Qu.:29.00 1st Qu.:0.00
## Median :18.62 Median :30.00 Median :0.00
## Mean :18.18 Mean :30.50 Mean :0.25
## 3rd Qu.:19.06 3rd Qu.:32.25 3rd Qu.:0.25
## Max. :21.39 Max. :37.00 Max. :1.00
##
Looking at this summary, it is possible to identify that there are not outliers for this data set. All the variables have a similar value for their mean and median. Values for dependent variable are:
tsplot(CC$sales_unitboxes)
It is interesting to see how in this time series plot, sales in unit boxes seems to have an stational pattern, with rises and falls.
ggplot(CC, aes(x = inflation_rate, y = sales_unitboxes)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Inflation Rate",
y = "Sales in box units",
title = "Relation between Sales and Inflation Rate")
## `geom_smooth()` using formula = 'y ~ x'
It is possible to identify two different things in this graph:
Inflation rate seems to have a negative relation with sales.
The majority of inflation rate seems to be on the left side of the graph, which could be a left-skewed distribution.
par(mfrow=c(3, 4))
hist(CC$sales_unitboxes) # log
hist(CC$consumer_sentiment)
hist(CC$CPI) #Not
hist(CC$inflation_rate) #log
hist(CC$unemp_rate)
hist(CC$itaee) #log
hist(CC$itaee_growth) #Not
hist(CC$pop_density)
hist(CC$job_density)
hist(CC$pop_minwage) #log
hist(CC$max_temperature)
hist(CC$exchange_rate)
This histogram’s matrix were done with the intention of detecting non normal distribution in the variables for the dataset. If a normal distribution is not observed in one or more variables, it could be an option to transform them into their log value. An example for this are the variables “inflation_rate”, “itaee”, “pop_minwage”, among others.
CC1 <- CC %>% dplyr::select(-holiday_month, -tperiod)
corrplot(cor(CC1), type = 'upper', order = 'hclust', addCoef.col = 'black', tl.cex = 0.9, tl.srt = 90, mar = c(0,0,0,0),number.cex=1)
This correlation plot shows the relationship between variables in the dataset. It is important to find which of these variables have significant correlation with dependent variable “sales_unitboxes”. The highest correlated variables with dependent variables are:
It is relevant to understand that correlation is not the same as regression, this plot could give a few parameters to start estimating a model, but significative variables for an accurate model could not be the most correlated ones.
(I described the 4 graphs I made)
H1: It might be expected a positive relation between max temperature and dependent variable sales unit boxes.
H2: It might be expected a positive relation between itaee and sales unit boxes.
H3: It might be expected a negative relation between inflation rate and sales unit boxes.
model1<-lm(sales_unitboxes ~ consumer_sentiment + CPI + inflation_rate + unemp_rate+ gdp_percapita + itaee + itaee_growth + pop_density + job_density + log(pop_minwage) + exchange_rate + max_temperature + holiday_month, data=CC)
summary(model1)
##
## Call:
## lm(formula = sales_unitboxes ~ consumer_sentiment + CPI + inflation_rate +
## unemp_rate + gdp_percapita + itaee + itaee_growth + pop_density +
## job_density + log(pop_minwage) + exchange_rate + max_temperature +
## holiday_month, data = CC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -504466 -220234 -6898 196922 758377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -98517471 49342629 -1.997 0.0539 .
## consumer_sentiment 46725 28066 1.665 0.1051
## CPI -139593 101913 -1.370 0.1798
## inflation_rate 27714 282353 0.098 0.9224
## unemp_rate -9381467 19802288 -0.474 0.6387
## gdp_percapita -2627 1475 -1.781 0.0839 .
## itaee 12921 84192 0.153 0.8789
## itaee_growth 4561673 5401479 0.845 0.4043
## pop_density 1459283 762070 1.915 0.0640 .
## job_density -470162 485876 -0.968 0.3400
## log(pop_minwage) 2032031 1597502 1.272 0.2120
## exchange_rate -60533 119054 -0.508 0.6144
## max_temperature 178035 38946 4.571 6.13e-05 ***
## holiday_month 199639 143900 1.387 0.1744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 366300 on 34 degrees of freedom
## Multiple R-squared: 0.7292, Adjusted R-squared: 0.6256
## F-statistic: 7.042 on 13 and 34 DF, p-value: 2.348e-06
This model shows a few significant variables, where max_temperature are highly significant, with a 99% of significance.
model2<-lm(log(sales_unitboxes) ~ consumer_sentiment + inflation_rate + unemp_rate + log(itaee) + pop_density + job_density + log(pop_minwage) + exchange_rate + I(max_temperature^2) + holiday_month, data=CC)
summary(model2)
##
## Call:
## lm(formula = log(sales_unitboxes) ~ consumer_sentiment + inflation_rate +
## unemp_rate + log(itaee) + pop_density + job_density + log(pop_minwage) +
## exchange_rate + I(max_temperature^2) + holiday_month, data = CC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08391 -0.03474 0.00128 0.02903 0.12413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.119e-01 5.842e+00 -0.156 0.8768
## consumer_sentiment 7.108e-03 3.945e-03 1.802 0.0797 .
## inflation_rate 2.778e-02 3.961e-02 0.701 0.4875
## unemp_rate -4.072e-01 2.802e+00 -0.145 0.8853
## log(itaee) 2.997e+00 6.758e-01 4.435 7.96e-05 ***
## pop_density 3.621e-02 6.445e-02 0.562 0.5776
## job_density -1.081e-01 6.676e-02 -1.620 0.1138
## log(pop_minwage) 2.613e-01 2.284e-01 1.144 0.2599
## exchange_rate -2.264e-02 1.458e-02 -1.553 0.1290
## I(max_temperature^2) 5.134e-04 8.448e-05 6.077 4.94e-07 ***
## holiday_month 2.848e-02 2.000e-02 1.424 0.1629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05574 on 37 degrees of freedom
## Multiple R-squared: 0.7176, Adjusted R-squared: 0.6412
## F-statistic: 9.4 on 10 and 37 DF, p-value: 1.613e-07
For model 2, only 3 variables remain significant. It will be necessary to test these models, in order to interpret the best one, and make some predictions.
#Hetocedasticity
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 15.24, df = 13, p-value = 0.2926
m1res<-resid(model1)
plot(fitted(model1),m1res)
Upon examining this graph and considering a p-value of 0.29, we do not have sufficient evidence to reject the null hypothesis (H0). This suggests the presence of homoscedasticity, indicating that the model’s residuals do not exhibit heteroscedasticity.
#Multicollinearity
vif(model1)
## consumer_sentiment CPI inflation_rate unemp_rate
## 2.302813 93.062556 4.238225 4.736072
## gdp_percapita itaee itaee_growth pop_density
## 48.148366 56.087168 2.179659 338.421913
## job_density log(pop_minwage) exchange_rate max_temperature
## 129.555197 7.368171 12.804899 3.684720
## holiday_month
## 1.388718
This results show some variables with a Variance Inflation Factor (VIF) greater than 10, which suggest the presence of multicollinearity in the model. Indicating that predictor variables are highly correlated with each other.
#Hetocedasticity
bptest(model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 12.252, df = 10, p-value = 0.2685
m2res<-resid(model2)
plot(fitted(model2),m2res)
Upon examining this graph and considering a p-value of 0.2685, we do not have sufficient evidence to reject the null hypothesis (H0). This suggests the presence of homoscedasticity, indicating that the model’s residuals do not exhibit heteroscedasticity.
#Multicollinearity
vif(model2)
## consumer_sentiment inflation_rate unemp_rate
## 1.964909 3.603140 4.096399
## log(itaee) pop_density job_density
## 12.196889 104.548072 105.638166
## log(pop_minwage) exchange_rate I(max_temperature^2)
## 6.506213 8.296245 2.923265
## holiday_month
## 1.158825
This results show but less than model1 variables with a Variance Inflation Factor (VIF) greater than 10, which suggest the presence of multicollinearity in the model. Indicating that some predictor variables are highly correlated with each other.
AIC(model1,model2)
## df AIC
## model1 15 1379.5498
## model2 12 -129.4325
When using AIC as the selection criterion for models, we can observe that Model 2 has the lowest value. Therefore, it is the one that will be used with the aim of having a precise model. However, there are still issues that need to be addressed, such as the presence of multicollinearity for some variables.
#Adjusting model 2
fixed_model<-lm(log(sales_unitboxes) ~ consumer_sentiment + inflation_rate + unemp_rate + log(itaee)+ log(pop_minwage) + exchange_rate + I(max_temperature^2) + holiday_month, data=CC)
vif(fixed_model)
## consumer_sentiment inflation_rate unemp_rate
## 1.654092 2.974401 2.855593
## log(itaee) log(pop_minwage) exchange_rate
## 4.097255 3.476172 5.520924
## I(max_temperature^2) holiday_month
## 2.249282 1.084519
The variables “pop_density” and “job_density” were not highly relevant for this model and prediction. They were the ones showing the highest VIF, indicating higher levels of multicollinearity. That’s why the decision was made to remove them, and now the model no longer exhibits variables with high multicollinearity values.
AIC(fixed_model)
## [1] -125.3375
It still has a low value for AIC.
bptest(fixed_model)
##
## studentized Breusch-Pagan test
##
## data: fixed_model
## BP = 3.9856, df = 8, p-value = 0.8584
Fixed model has a p-value greater than 0.05, which indicates that there is not enough evidence to reject null hypothesis, which suggest the presence of homocedasticity.
histogram(fixed_model$residuals)
Residuals for this model follows a normal distribution, which suggest that statistically inference might be true.
summary(fixed_model)
##
## Call:
## lm(formula = log(sales_unitboxes) ~ consumer_sentiment + inflation_rate +
## unemp_rate + log(itaee) + log(pop_minwage) + exchange_rate +
## I(max_temperature^2) + holiday_month, data = CC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10327 -0.03970 0.01222 0.03163 0.12988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.714e+00 1.801e+00 4.282 0.000117 ***
## consumer_sentiment 6.305e-03 3.835e-03 1.644 0.108219
## inflation_rate -6.975e-03 3.814e-02 -0.183 0.855849
## unemp_rate 3.612e+00 2.479e+00 1.457 0.153216
## log(itaee) 1.583e+00 4.151e-01 3.813 0.000477 ***
## log(pop_minwage) -9.278e-02 1.769e-01 -0.524 0.602970
## exchange_rate -5.260e-03 1.261e-02 -0.417 0.678765
## I(max_temperature^2) 4.197e-04 7.853e-05 5.345 4.2e-06 ***
## holiday_month 2.064e-02 2.050e-02 1.006 0.320383
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05907 on 39 degrees of freedom
## Multiple R-squared: 0.6657, Adjusted R-squared: 0.5971
## F-statistic: 9.706 on 8 and 39 DF, p-value: 2.765e-07
In order to understand the results of this model, it is necessary to describe the impact that every explanatory variable has into dependent variable (y). This impact for each variable is:
It’s important to emphasize and take into account that the only significant variables for the model are “max_temperature” and “itaee”.
plot(effect("I(max_temperature^2)",fixed_model))
This plots were created in order to make predictions between the X’s regressors and the dependent variable, and to test hypotheses statements established before. Variable max temperature does have a positive relation with dependent variable of sales in units boxes, which indicates that H1 is accepted.
plot(effect("log(itaee)",fixed_model))
Variable itaee does have a positive relation with dependent variable of sales in units boxes, which indicates that H2 is accepted.
plot(effect("inflation_rate",fixed_model))
The effect plot for the inflation variable shows a very slight negative relationship with the dependent variable, but since it is a negative relation, we fail to reject H3.
REFERENCES: Coca-Cola FEMSA / 2016 Annual Report. (2016). Coca-Colafemsa.com. https://coca-colafemsa.com/reportes/reporte-anual-2016/eng/glossary.html