1 Exam Part III

Please download the dataset entertainment_stocks.csv (Canvas). Your task is to perform a time series analysis for a stock price. Please select a stock price and complete the following tasks:

1.0.1 Import Data

library(readr)
entre <- read_csv("entre.csv")
head(entre)
## # A tibble: 6 × 7
##   Date       Disney_Adj_Close Netflix_Adj_Close Nintendo_Adj_Close WBD_Adj_Close
##   <chr>                 <dbl>             <dbl>              <dbl>         <dbl>
## 1 01/01/2007             29.1              3.26               37.1          8.47
## 2 02/01/2007             28.4              3.22               33.1          8.21
## 3 03/01/2007             28.5              3.31               36.3          9.78
## 4 04/01/2007             29.0              3.17               40.0         11.1 
## 5 05/01/2007             29.3              3.13               43.6         12.0 
## 6 06/01/2007             28.6              2.77               45.8         11.8 
## # ℹ 2 more variables: EA_Adj_Close <dbl>, Paramount_Adj_Close <dbl>

1.0.2 Loading required libraries

library(xts)
library(dplyr)
library(zoo)
library(tseries)
library(stats)
library(forecast)
library(astsa)
library(corrplot)
library(AER)
library(vars)
library(dynlm)
library(vars)
library(TSstudio)
library(Metrics)  # For RMSE calculation
library(tidyverse)
library(sarima)
library(dygraphs)

1.0.3 Setting time series format

entre$Date<-as.Date(entre$Date,"%m/%d/%Y") 
summary(entre$Nintendo_Adj_Close)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.12   16.94   33.66   34.87   46.76   80.52

1.1 A) VISUALIZATION (25 pts)

1.1.1 - Plot the stock price / variable using a time series format.

# time series plot 1
plot(entre$Date,entre$Nintendo_Adj_Close,type="l",col="blue", lwd=2, xlab ="Date",ylab ="Adjusted Close Price", main = "Nintendo Adjusted Close Stock Price")

# time series plot 2
entrexts<-xts(entre$Nintendo_Adj_Close,order.by=entre$Date)
plot(entrexts)

1.1.2 -Decompose the time series data in observed, trend, seasonality, and random.

entrets<-ts(entre$Nintendo_Adj_Close,frequency=52,start=c(2015,1))
entrets_decompose<-decompose(entrets)
plot(entrets_decompose)    

1.1.3 - Briefly comment on the following components:

  1. Do the time series data show a trend? At an initial observation, there appears to be a parabolic positive trend. In early 2015, the stock price was notably high. Over time, it declined, only to rise again. However, in recent years, there’s been a discernible decline. Although there is a trend, it’s neither linear nor consistent. From my understanding of the company, several external factors have influenced these highs and lows. The recent dip can be attributed to the Switch console becoming outdated post-pandemic with the releases of other significant consoles as PS5 and Xbox Series X, coupled with a significant decrease in annual sales. The limited release of notable games has also severely impacted the company’s performance.
  1. Do the time series data show seasonality? How is the change of the seasonal component over time? Certainly, there appears to be a degree of seasonality in the data. When contextualized, the seasonally decomposed graph highlights spikes in stock prices at the start, middle (likely influenced by summer), and end of the year. This aligns logically with consumer behavior for Nintendo products. Parents and guardians often purchase games during summer and winter vacations, as well as during holiday seasons. These are times when children have more leisure hours or when games are bought as gifts.

1.2 B) ESTIMATION (25 pts)

###- Detect if the time series data is stationary.

# 1) Stationary Test 
adf.test(entre$Nintendo_Adj_Close) 
## 
##  Augmented Dickey-Fuller Test
## 
## data:  entre$Nintendo_Adj_Close
## Dickey-Fuller = -2.1034, Lag order = 5, p-value = 0.5329
## alternative hypothesis: stationary

The result shows that the data is not stationary since we have a p-value of 0.5329 we feil to reject the null hypothesis

1.2.1 - Detect if the time series data shows serial autocorrelation.

# 2) Serial Autocorrelation
# ACF plots: correlation between two periods in a time series is referred as autocorrelation function (acf)
acf(entre$Nintendo_Adj_Close,main="Significant Autocorrelations")  # Generally, non-stationary time series data show the presence of serial autocorrelation.  

> The plot reveals significant serial correlation that diminishes gradually but is still notable even after 6 periods. Given that the data is collected monthly, it’s reasonable to infer that the stock performance from one month carries over to subsequent months. This lingering effect results in a high degree of autocorrelation, especially in the initial lags.

1.2.2 - Estimate 2 different time series regression models. You might want to consider ARMA (p,q) and / or ARIMA (p,d,q).

1.2.2.1 ARMA model

# Model the Time Series using ARMA(1,1)
NINTENDO_ARMA <- arima(log(entre$Nintendo_Adj_Close), order=c(1,0,1))
# Using the arima() function with d=0 to simulate an ARMA model like described in class (during examn session)

# Display the model summary for diagnostics
summary(NINTENDO_ARMA)
## 
## Call:
## arima(x = log(entre$Nintendo_Adj_Close), order = c(1, 0, 1))
## 
## Coefficients:
##          ar1      ma1  intercept
##       0.9697  -0.0074     3.2829
## s.e.  0.0176   0.0724     0.3142
## 
## sigma^2 estimated as 0.0229:  log likelihood = 88.7,  aic = -169.41
## 
## Training set error measures:
##                        ME      RMSE        MAE        MPE     MAPE     MASE
## Training set -0.002856949 0.1513374 0.08573275 -0.3286407 2.657598 1.018259
##                     ACF1
## Training set 0.001404071
# Plot the diagnostics for the ARMA model
tsdiag(NINTENDO_ARMA)

# Compute the estimated stock prices
NINTENDO_estimated_stock_price <- exp(fitted(NINTENDO_ARMA))

# Plot the estimated stock prices
plot(NINTENDO_estimated_stock_price, main="Estimated Stock Prices from ARMA(1,1)")

# Diagnostic Tests
Box.test(NINTENDO_ARMA$residuals, lag=5, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  NINTENDO_ARMA$residuals
## X-squared = 3.3084, df = 5, p-value = 0.6526
# Check for stationarity in the estimated stock prices using Augmented Dickey-Fuller test
adf.test(na.omit(NINTENDO_estimated_stock_price), alternative="stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(NINTENDO_estimated_stock_price)
## Dickey-Fuller = -2.0914, Lag order = 5, p-value = 0.5379
## alternative hypothesis: stationary

1.2.2.2 ARIMA model

# Log Transform the Time Series Data
nintendo_log <- log(entre$Nintendo_Adj_Close)

# Plot the log-transformed stock price
plot(entre$Date, nintendo_log, 
     type="l", col="blue", lwd=2, 
     xlab="Date", ylab="log(Stock Price)", 
     main="Nintendo Stock Price Log Transformation")

# Plot the first difference of the log-transformed stock price
plot(diff(nintendo_log), 
     type="l", ylab="First Order Difference", 
     main="Difference of Nintendo Stock Price Log Transformation")

# Model the Time Series using ARIMA(1,1,1)
NINTENDO_ARIMA <- Arima(nintendo_log, order=c(1,1,1))

# Display the model summary for diagnostics
summary(NINTENDO_ARIMA)
## Series: nintendo_log 
## ARIMA(1,1,1) 
## 
## Coefficients:
##           ar1     ma1
##       -0.1150  0.0914
## s.e.   1.0303  1.0377
## 
## sigma^2 = 0.02348:  log likelihood = 88.28
## AIC=-170.56   AICc=-170.43   BIC=-160.8
## 
## Training set error measures:
##                        ME     RMSE        MAE        MPE     MAPE      MASE
## Training set -0.006733967 0.152019 0.08374382 -0.3518187 2.592671 0.9946365
##                      ACF1
## Training set -0.001815814
# Plot the diagnostics for the ARIMA model
tsdiag(NINTENDO_ARIMA)

# Diagnostic Tests
# a) Check for autocorrelation in the residuals using ACF
acf(NINTENDO_ARIMA$residuals, main="ACF - ARIMA (1,1,1)")

# b) Use the Ljung-Box test to check for autocorrelation in residuals
Box.test(NINTENDO_ARIMA$residuals, lag=1, type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  NINTENDO_ARIMA$residuals
## X-squared = 0.000643, df = 1, p-value = 0.9798
# c) Check for stationarity in the residuals using Augmented Dickey-Fuller test
adf.test(NINTENDO_ARIMA$residuals, alternative="stationary")
## Warning in adf.test(NINTENDO_ARIMA$residuals, alternative = "stationary"):
## p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  NINTENDO_ARIMA$residuals
## Dickey-Fuller = -4.951, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

1.3 C) EVALUATION (25 pts)

1.3.1 - Based on diagnostic tests, compare the 2 estimated time series regression models,

and select the results that you consider might generate the best forecast.

ARMA(1,1) Model:

  • Autocorrelation: The Ljung-Box test shows a p-value of 0.6526, suggesting the residuals are not significantly autocorrelated.
  • AIC: A lower AIC value of -169.41 suggests better model fit.
  • Stationarity: The Augmented Dickey-Fuller test p-value of 0.5379 indicates that the series is not stationary.

ARIMA(1,1,1) Model:

  • Autocorrelation: The Ljung-Box test has a p-value of 0.9798, indicating no significant autocorrelation in residuals.
  • AIC: A slightly higher AIC of -170.56, although differences are marginal.
  • Stationarity: The Augmented Dickey-Fuller test p-value of 0.01 indicates the residuals are stationary, unlike the ARMA model.

Selection: The ARIMA(1,1,1) model outperforms in terms of stationarity and exhibits no significant autocorrelation, making it more suitable for stock prediction. The AIC difference between the two models is minimal, so the edge in stationarity and autocorrelation makes ARIMA(1,1,1) the preferred choice for forecasting.

1.4 D) FORECAST (25 pts)

1.4.1 - By using the selected model, make a forecast for the next 5 periods. In doing so,

include a time series plot showing your forecast.

# Forecast the next 5 periods using the selected ARIMA model (in log scale)
NINTENDO_forecast_log <- forecast(NINTENDO_ARIMA, h=5)

# Convert the forecasted log values and their upper and lower bounds to the original scale
NINTENDO_forecast_values <- exp(NINTENDO_forecast_log$mean)
NINTENDO_forecast_upper <- exp(NINTENDO_forecast_log$upper)
NINTENDO_forecast_lower <- exp(NINTENDO_forecast_log$lower)

# Print the forecasted values along with their upper and lower bounds
cat("Forecasted Values for next 5 periods:\n")
## Forecasted Values for next 5 periods:
print(NINTENDO_forecast_values)
## Time Series:
## Start = 193 
## End = 197 
## Frequency = 1 
## [1] 10.42853 10.42755 10.42766 10.42765 10.42765
cat("\nUpper Bound (95% Prediction Interval):\n")
## 
## Upper Bound (95% Prediction Interval):
print(NINTENDO_forecast_upper[,2])
## Time Series:
## Start = 193 
## End = 197 
## Frequency = 1 
## [1] 14.08137 15.86594 17.40811 18.82627 20.17302
cat("\nLower Bound (95% Prediction Interval):\n")
## 
## Lower Bound (95% Prediction Interval):
print(NINTENDO_forecast_lower[,2])
## Time Series:
## Start = 193 
## End = 197 
## Frequency = 1 
## [1] 7.723275 6.853283 6.246295 5.775753 5.390165
# Plot the historical stock prices (on original scale)
plot(entre$Nintendo_Adj_Close, type="l", col="blue", lwd=2, xlab="Time", ylab="Stock Price", main="Nintendo Stock Price with Forecast", xlim=c(1, length(entre$Nintendo_Adj_Close)+5))

# Add the prediction intervals as shaded regions
polygon(c(seq(length(entre$Nintendo_Adj_Close)+1, length(entre$Nintendo_Adj_Close)+5), rev(seq(length(entre$Nintendo_Adj_Close)+1, length(entre$Nintendo_Adj_Close)+5))), 
        c(NINTENDO_forecast_upper[,2], rev(NINTENDO_forecast_lower[,2])), 
        col=rgb(0.8,0.8,0.8), border=NA)

# Add the forecasted stock prices to the plot (after shaded region to ensure it's on top)
lines(c(rep(NA, length(entre$Nintendo_Adj_Close)), NINTENDO_forecast_values), col="red", lwd=2)

# Add legend
legend("topright", legend=c("Actual", "Forecast", "95% Prediction Interval"), 
       fill=c("blue", "red", rgb(0.8,0.8,0.8)))

Based on the results we can see that:

  • 2. Stable Short-Term Outlook: The forecast suggests a relatively stable stock price for Nintendo in the immediate future, hovering around the 10.43 mark.

  • 1. Significant Upcoming Catalysts: With the anticipated release of major game titles in 2023 and a potential new console in 2024, Nintendo has significant bullish catalysts on the horizon that could push stock prices upwards.

  • 3.High Forecast Uncertainty: The wide gap between the prediction interval’s upper and lower bounds indicates considerable uncertainty. This underscores the importance of closely monitoring market responses to Nintendo’s releases and other related news.

2 Exam Part IV

2.0.1 Import Data

library(readr)
cocac <- read_csv("cocac.csv")
head(cocac)
## # A tibble: 6 × 15
##   tperiod sales_unitboxes consumer_sentiment   CPI inflation_rate unemp_rate
##   <chr>             <dbl>              <dbl> <dbl>          <dbl>      <dbl>
## 1 15-ene         5516689.               38.1  87.1          -0.09     0.0523
## 2 15-feb         5387496.               37.5  87.3           0.19     0.0531
## 3 15-mar         5886747.               38.5  87.6           0.41     0.0461
## 4 15-abr         6389182.               37.8  87.4          -0.26     0.051 
## 5 15-may         6448275.               38.0  87.0          -0.5      0.0552
## 6 15-jun         6697947.               39.1  87.1           0.17     0.0507
## # ℹ 9 more variables: gdp_percapita <dbl>, itaee <dbl>, itaee_growth <dbl>,
## #   pop_density <dbl>, job_density <dbl>, pop_minwage <dbl>,
## #   exchange_rate <dbl>, max_temperature <dbl>, holiday_month <dbl>

2.1 A) EXPLORATORY DATA ANALYSIS (20 pts):

2.1.1 - Briefly describe the dataset. For example, what is the structure of the dataset?

str(cocac)
## spc_tbl_ [48 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ tperiod           : chr [1:48] "15-ene" "15-feb" "15-mar" "15-abr" ...
##  $ sales_unitboxes   : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
##  $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
##  $ CPI               : num [1:48] 87.1 87.3 87.6 87.4 87 ...
##  $ inflation_rate    : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
##  $ unemp_rate        : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 0.0507 0.0542 0.0547 0.0538 0.0539 ...
##  $ gdp_percapita     : num [1:48] 11660 11660 11660 11626 11626 ...
##  $ itaee             : num [1:48] 104 104 104 108 108 ...
##  $ itaee_growth      : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 0.0318 0.0565 0.0565 0.0565 0.0056 ...
##  $ pop_density       : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
##  $ job_density       : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
##  $ pop_minwage       : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
##  $ exchange_rate     : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
##  $ max_temperature   : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
##  $ holiday_month     : num [1:48] 0 0 0 1 0 0 0 0 1 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   tperiod = col_character(),
##   ..   sales_unitboxes = col_double(),
##   ..   consumer_sentiment = col_double(),
##   ..   CPI = col_double(),
##   ..   inflation_rate = col_double(),
##   ..   unemp_rate = col_double(),
##   ..   gdp_percapita = col_double(),
##   ..   itaee = col_double(),
##   ..   itaee_growth = col_double(),
##   ..   pop_density = col_double(),
##   ..   job_density = col_double(),
##   ..   pop_minwage = col_double(),
##   ..   exchange_rate = col_double(),
##   ..   max_temperature = col_double(),
##   ..   holiday_month = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

2.1.2 - How many observations include the dataset? Is there any presence of missing

values in the dataset?

# Number of observations
nobs <- nrow(cocac)

# Check for missing values
missing_values <- sum(is.na(cocac))

cat("Number of observations:", nobs, "\n")
## Number of observations: 48
cat("Number of missing values:", missing_values, "\n")
## Number of missing values: 0

2.1.3 - Include summary of descriptive statistics. What is the mean, min, and max values

of the dependent variable?

# Summary of dependent variables: Coca Cola Sales (in unit boxes)
summary_stats <- summary(cocac$sales_unitboxes)

# Descriptive Statistics
mean_value <- mean(cocac$sales_unitboxes, na.rm = TRUE)
min_value <- min(cocac$sales_unitboxes, na.rm = TRUE)
max_value <- max(cocac$sales_unitboxes, na.rm = TRUE)

# Print results
list(
  Summary = summary_stats,
  Mean = mean_value,
  Min = min_value,
  Max = max_value
)
## $Summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 5301755 6171767 6461357 6473691 6819782 7963063 
## 
## $Mean
## [1] 6473691
## 
## $Min
## [1] 5301755
## 
## $Max
## [1] 7963063

The dataset contains monthly data across a 4-year span, with a total of 48 observations and no missing values. It appears to track various economic, environmental, and demographic indicators, such as consumer sentiment, inflation rate, unemployment rate, GDP per capita, and even maximum temperatures. The primary metric of interest is sales_unitboxes, which varies from about 5.3 million to almost 8 million over the examined period. On average, around 6.47 million units were sold in a given month. The lack of missing values suggests that the dataset is comprehensive and well-maintained for the period in question. The various economic indicators in the dataset suggest a potential focus on understanding how broader economic and environmental factors might influence sales.

2.2 B) DATA VISUALIZATION (20 pts):

2.2.1 - Show at least 4-5 plots including pair-wised graphs, frequency plots, correlation

matrix plots, scatter plots, box plots, and / or histogram plots, etc.

2.2.1.1 1. Correlation Plot

This plot showcases the correlation between sales_unitboxes and other numeric columns.

cor_matrix <- cor(cocac[,-1])  # Exclude the character column 'tperiod' 
corrplot(cor_matrix, method = "circle")

2.2.1.2 2. Distribution Plot (Histogram)

This plot displays the distribution of the sales_unitboxes.

ggplot(cocac, aes(x=sales_unitboxes)) +
  geom_histogram(aes(y=..density..), fill="blue", alpha=0.5, bins=30) +
  geom_density(color="red") +
  labs(title="Distribution of Sales Unitboxes", x="Sales", y="Density")

2.2.1.3 3. Bar Plot for Sales vs Consumer Sentiment

This bar plot displays the average sales for different levels of consumer sentiment.

cocac$consumer_sentiment_group <- cut(cocac$consumer_sentiment, breaks=4, labels=c("Low", "Medium-Low", "Medium-High", "High"))

ggplot(cocac, aes(x=consumer_sentiment_group, y=sales_unitboxes)) +
  geom_bar(stat="summary", fun="mean", aes(fill=consumer_sentiment_group)) +
  labs(title="Sales vs Consumer Sentiment", x="Consumer Sentiment", y="Average Sales")

2.2.1.4 4. Bar Plot for Sales vs CPI (Consumer Price Index)

This bar plot showcases the average sales at different CPI levels.

cocac$CPI_group <- cut(cocac$CPI, breaks=4, labels=c("Low", "Medium-Low", "Medium-High", "High"))

ggplot(cocac, aes(x=CPI_group, y=sales_unitboxes)) +
  geom_bar(stat="summary", fun="mean", aes(fill=CPI_group)) +
  labs(title="Sales vs CPI", x="CPI", y="Average Sales")

2.2.1.5 5. Bar Plot for Sales vs Max Temperature

This will help us understand if there’s any seasonality in sales related to temperature or if temperature affects sales in some other way.

cocac$max_temp_group <- cut(cocac$max_temperature, breaks=4, labels=c("Low", "Medium-Low", "Medium-High", "High"))

ggplot(cocac, aes(x=max_temp_group, y=sales_unitboxes)) +
  geom_bar(stat="summary", fun="mean", aes(fill=max_temp_group)) +
  labs(title="Sales vs Max Temperature", x="Max Temperature", y="Average Sales")

2.2.2 - Select 1-2 plots and briefly describe the data patterns.

Observations: The histogram for sales distribution reveals a modest right-skewed pattern, hinting at the presence of atypical data. This skewness can be attributed to the uncommonly high sales figures, perhaps due to large-scale purchases by major clients like Walmart or HEB. However, these are exceptions and not Coca Cola’s primary market segment. Instead, most sales seem to be more commonly linked to traditional local stores, or “tienditas,” and local distributors, explaining the right skew. The correlation plot uncovers several independent variables exhibiting strong negative correlations with one another, hinting at potential multicollinearity issues in future regression analyses. Notably, max temperature stands out as the sole variable with a strong positive correlation to sales, suggesting a potential rise in sales with increasing temperatures. However, the nature of this relationship might be polynomial, peaking at a certain temperature before perhaps plateauing or even declining. The remaining graphs generally conform to expected patterns, with no other surprising insights.

2.3 C) LINEAR REGRESSION MODEL SPECIFICATION (20 pts):

2.3.1 - Briefly describe the hypotheses statements. That is, what is the expected impact

of the explanatory variable X on the dependent variable Y.

Hypotheses Statements: Based on our observations, for our linear regression models the dependent variable is Y, which represents sales, and the independent variables are all the other columns in the dataset. Our general hypothesis for each variable X can be formulated as:

H0 : The variable X has no impact on Y (i.e., the coefficient of X is 0).

H1 : The variable X has a significant impact on Y (the coefficient of X is not 0).

(Where must probably the variables that are going to have a more significant effect are max temperature, inflation rate and itaee)

2.3.2 - Estimate 2 different multiple linear regression model specifications.

Model 1: Using All Variables

model1 <- lm(sales_unitboxes ~ . -tperiod, data=cocac)  
# Excluding 'tperiod' as it's a character variable
summary(model1)
## 
## Call:
## lm(formula = sales_unitboxes ~ . - tperiod, data = cocac)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -418406 -211931  -24558  147931  656317 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                         -78983801   58943596  -1.340   0.1923  
## consumer_sentiment                      77995      93960   0.830   0.4143  
## CPI                                   -162562     132726  -1.225   0.2321  
## inflation_rate                          93966     407734   0.230   0.8196  
## unemp_rate                          -10965110   23997833  -0.457   0.6517  
## gdp_percapita                           -2319       2078  -1.116   0.2751  
## itaee                                   -2317     124403  -0.019   0.9853  
## itaee_growth                          6012858    6408657   0.938   0.3571  
## pop_density                           1320901     898293   1.470   0.1539  
## job_density                           -466194     606196  -0.769   0.4491  
## pop_minwage                            189226     207035   0.914   0.3695  
## exchange_rate                          -13413     235506  -0.057   0.9550  
## max_temperature                         -5148      96343  -0.053   0.9578  
## holiday_month                          166356     181185   0.918   0.3673  
## consumer_sentiment_groupMedium-Low     212739     663090   0.321   0.7510  
## consumer_sentiment_groupMedium-High     26089     797382   0.033   0.9742  
## consumer_sentiment_groupHigh           173919    1436310   0.121   0.9046  
## CPI_groupMedium-Low                    169908     401303   0.423   0.6756  
## CPI_groupMedium-High                   423387     669445   0.632   0.5328  
## CPI_groupHigh                          312535     781102   0.400   0.6925  
## max_temp_groupMedium-Low               358934     313480   1.145   0.2631  
## max_temp_groupMedium-High              989243     548510   1.804   0.0834 .
## max_temp_groupHigh                    1678503     756458   2.219   0.0358 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 372600 on 25 degrees of freedom
## Multiple R-squared:  0.794,  Adjusted R-squared:  0.6127 
## F-statistic:  4.38 on 22 and 25 DF,  p-value: 0.0002801

Interpretation: The regression results indicate that many predictor variables aren’t statistically significant in explaining the variance in sales_unitboxes, as their p-values are well above the typical 0.05 threshold. Notably, the variable max_temp_groupHigh shows significance, suggesting an influential relationship between higher temperatures and sales. Given the current model’s non-significance of some predictors and my interest in exploring a polynomial relationship with max_temperature, a refined model specification is needed

Model 2: Selected Variables and Polynomial relationship

model2 <- lm(sales_unitboxes ~ max_temperature + I(max_temperature^2) + unemp_rate + itaee_growth + pop_density + job_density + holiday_month + inflation_rate, data=cocac)
summary(model2)
## 
## Call:
## lm(formula = sales_unitboxes ~ max_temperature + I(max_temperature^2) + 
##     unemp_rate + itaee_growth + pop_density + job_density + holiday_month + 
##     inflation_rate, data = cocac)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -773165 -291017   84863  227545  733861 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          -11023554   35491994  -0.311   0.7578  
## max_temperature        -966744     558838  -1.730   0.0916 .
## I(max_temperature^2)     17590       8984   1.958   0.0574 .
## unemp_rate            41050210   18637484   2.203   0.0336 *
## itaee_growth          -3590732    4825519  -0.744   0.4613  
## pop_density             295442     405218   0.729   0.4703  
## job_density             -46841     430458  -0.109   0.9139  
## holiday_month            97649     151757   0.643   0.5237  
## inflation_rate          -43232     227906  -0.190   0.8505  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 431600 on 39 degrees of freedom
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.4803 
## F-statistic: 6.429 on 8 and 39 DF,  p-value: 2.567e-05

2.4 D) RESULTS ANALYSIS (40 pts):

2.4.1 - Evaluate each regression model using model diagnostics (e.g., multicollinearity

and heteroscedasticity).

2.4.1.1 Model 1 Diagnostic Tests

# Load necessary libraries
library(car)
library(lmtest)

# Model 1 Diagnostics

# Check for multicollinearity using VIF
vif_values_model1 <- vif(model1)
print(vif_values_model1)
##                                GVIF Df GVIF^(1/(2*Df))
## consumer_sentiment        24.946823  1        4.994680
## CPI                      152.568047  1       12.351844
## inflation_rate             8.542516  1        2.922758
## unemp_rate                 6.723036  1        2.592882
## gdp_percapita             92.341357  1        9.609441
## itaee                    118.364451  1       10.879543
## itaee_growth               2.965726  1        1.722128
## pop_density              454.504354  1       21.319108
## job_density              194.923670  1       13.961507
## pop_minwage               14.625876  1        3.824379
## exchange_rate             48.431367  1        6.959265
## max_temperature           21.794374  1        4.668444
## holiday_month              2.128012  1        1.458771
## consumer_sentiment_group 379.473045  3        2.690689
## CPI_group                713.339485  3        2.989161
## max_temp_group           100.360730  3        2.155728
# Check for heteroscedasticity using Breusch-Pagan test
bptest_result_model1 <- bptest(model1)
print(bptest_result_model1)
## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 19.543, df = 22, p-value = 0.6116
# Plot residuals to check their distribution
hist(residuals(model1), main="Residual Distribution for Model 1", xlab="Residuals", border="blue", col="lightblue")

> Upon analyzing the diagnostics for Model 1, we observe potential concerns. The Variance Inflation Factor (VIF) values indicate significant multicollinearity for several variables, notably CPI, gdp_percapita, itaee, pop_density, job_density, consumer_sentiment_group, and CPI_group, suggesting redundancy among these predictors. Meanwhile, the Breusch-Pagan test result (p-value = 0.6116) does not indicate heteroscedasticity. Finally, the distribution of residuals shows a slight right skew, implying that our model might be leaving out some influential observations or nonlinear patterns in the data but overall a good performance for residuals.

2.4.1.2 Model 2 Diagnostic Tests

# Model 2 Diagnostics

# Check for multicollinearity using VIF
vif_values_model2 <- vif(model2)
print(vif_values_model2)
##      max_temperature I(max_temperature^2)           unemp_rate 
##           546.441765           551.349276             3.021783 
##         itaee_growth          pop_density          job_density 
##             1.253003            68.920224            73.242896 
##        holiday_month       inflation_rate 
##             1.112488             1.988888
# Check for heteroscedasticity using Breusch-Pagan test
bptest_result_model2 <- bptest(model2)
print(bptest_result_model2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2
## BP = 4.8361, df = 8, p-value = 0.7749
# Plot residuals to check their distribution
hist(residuals(model2), main="Residual Distribution for Model 2", xlab="Residuals", border="blue", col="lightblue")

> For Model 2, the diagnostic tests indicate mixed findings. While the VIF values for most of the predictors are within acceptable limits, suggesting no major concerns of multicollinearity, the max_temperature and its squared term have notably high VIF values and also the variables realted to density. This indicates potential multicollinearity between these two pairs of variables, which is expected as one is derived from the other. The Breusch-Pagan test result (p-value = 0.7749) suggests that heteroscedasticity isn’t a concern for this model. However, the peculiar shape of the residual distribution, resembling a normal distribution but lacking tails, hints at possible anomalies or truncations in the data that may need further investigation.

2.4.2 - Select the regression model that better fits the data (e.g., AIC and / or RMSE).

# Assuming model1 and model2 are already defined

# For Model 1
aic_model1 <- AIC(model1)
r2_model1 <- summary(model1)$r.squared
rmse_model1 <- rmse(predict(model1), cocac$sales_unitboxes)

# For Model 2
aic_model2 <- AIC(model2)
r2_model2 <- summary(model2)$r.squared
rmse_model2 <- rmse(predict(model2), cocac$sales_unitboxes)

# Print the results
cat("Model 1 Metrics:\n")
## Model 1 Metrics:
cat("AIC:", aic_model1, "\nR-squared:", r2_model1, "\nRMSE:", rmse_model1, "\n\n")
## AIC: 1384.423 
## R-squared: 0.7939877 
## RMSE: 268909.4
cat("Model 2 Metrics:\n")
## Model 2 Metrics:
cat("AIC:", aic_model2, "\nR-squared:", r2_model2, "\nRMSE:", rmse_model2)
## AIC: 1391.885 
## R-squared: 0.5687287 
## RMSE: 389076.2

Based on the metrics provided, Model 1 appears to be the better fit. The AIC (Akaike Information Criterion) for Model 1 is lower than that for Model 2, suggesting that Model 1 provides a better balance of goodness of fit and parsimony. Furthermore, the R-squared for Model 1 is significantly higher, indicating that it explains more of the variability in the data than Model 2. Lastly, the RMSE (Root Mean Square Error) for Model 1 is also smaller, which implies that Model 1’s predictions are, on average, closer to the actual values. Given these metrics, Model 1 would be the preferred choice at this stage. However, Model 2 might be improved with some modifications, possibly bridging the gap in performance or even surpassing Model 1. But based on the current evidence in accuracy measures and diagnostic test, Model 1 is the better fit for the data.

2.4.3 - Interpret the regression results of selected regression model. That is, describe

how much is the impact of the explanatory variables X’s on the dependent variable Y.

In the context of the Model 1 regression results, the impact of explanatory variables on the dependent variable, Coca-Cola sales (measured in unit boxes), is evident. The consumer_sentiment suggests that with every unit increase in sentiment, the sales of Coca-Cola increase by 77,995 unit boxes, holding all other factors constant. This showcases the positive relation between consumer confidence and consumption behavior. Furthermore, the CPI coefficient indicates that for every unit rise in the Consumer Price Index, the sales dip by 162,562 unit boxes. Similarly, factors like unemployment rate (unemp_rate) and GDP per capita (gdp_percapita) have respective negative impacts on the sales, emphasizing the significance of economic conditions on consumer purchasing patterns.

The pop_density, though not statistically significant at conventional levels, suggests that as population density increases by one unit, sales increase by 1,320,901 unit boxes. On the other hand, the max_temperature variable doesn’t exhibit a statistically significant linear relationship with sales in this model. The categories within variables, such as consumer sentiment groups and max temperature groups, further provide nuances. Specifically, the sales appear to be significantly higher in high max temperature conditions compared to the reference category.

2.4.3.1 Key insights include:

Positive Impact: Higher consumer sentiment is associated with increased sales. Higher population density is related to greater sales, indicating urban areas might be key markets.

Negative Impact: A rise in CPI, reflecting inflationary pressures, corresponds to decreased sales. Economic indicators such as unemployment rate and GDP per capita negatively influence sales, showcasing the importance of overall economic health.

Significant Observations: The sales are notably higher during months with higher maximum temperatures, reinforcing the belief in the seasonal nature of soft drink sales.

2.4.4 - Use the selected regression results to make predictions between the X’s

regressors and the dependent variable (e.g., effects plots).

# Installing and loading necessary packages
library(effects)
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
# Most relevant individual effect plots

# Individual effect plot for Consumer Sentiment
plot(effect("consumer_sentiment", model1))

# Individual effect plot for CPI
plot(effect("CPI", model1))

# Individual effect plot for Maximum Temperature
plot(effect("max_temperature", model1))

# Individual effect plot for Unemployment Rate
plot(effect("unemp_rate", model1))

# Global comparison of prediction

# Getting predicted values from the model
predicted_values <- predict(model1)

# Actual vs Predicted plot
plot(cocac$sales_unitboxes, predicted_values, 
     main="Actual vs Predicted Sales", 
     xlab="Actual Sales", 
     ylab="Predicted Sales")
abline(0,1, col="red") # Line of perfect prediction

> While the results from Model 1 suggest that the predictions align relatively well with the actual values, making it a seemingly acceptable fit for the data at hand, it’s crucial to approach it with caution. The persistent issue of multicollinearity in the model could lead to biased predictions. Hence, refining Model 2, which potentially offers a more robust framework by addressing the multicollinearity concerns, would be a more prudent route to ensure accurate and unbiased forecasting.

David Dominguez a01570975