Please download the dataset entertainment_stocks.csv (Canvas). Your task is to perform a time series analysis for a stock price. Please select a stock price and complete the following tasks:
library(readr)
entre <- read_csv("entre.csv")
head(entre)
## # A tibble: 6 × 7
## Date Disney_Adj_Close Netflix_Adj_Close Nintendo_Adj_Close WBD_Adj_Close
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01/01/2007 29.1 3.26 37.1 8.47
## 2 02/01/2007 28.4 3.22 33.1 8.21
## 3 03/01/2007 28.5 3.31 36.3 9.78
## 4 04/01/2007 29.0 3.17 40.0 11.1
## 5 05/01/2007 29.3 3.13 43.6 12.0
## 6 06/01/2007 28.6 2.77 45.8 11.8
## # ℹ 2 more variables: EA_Adj_Close <dbl>, Paramount_Adj_Close <dbl>
library(xts)
library(dplyr)
library(zoo)
library(tseries)
library(stats)
library(forecast)
library(astsa)
library(corrplot)
library(AER)
library(vars)
library(dynlm)
library(vars)
library(TSstudio)
library(Metrics) # For RMSE calculation
library(tidyverse)
library(sarima)
library(dygraphs)
entre$Date<-as.Date(entre$Date,"%m/%d/%Y")
summary(entre$Nintendo_Adj_Close)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.12 16.94 33.66 34.87 46.76 80.52
# time series plot 1
plot(entre$Date,entre$Nintendo_Adj_Close,type="l",col="blue", lwd=2, xlab ="Date",ylab ="Adjusted Close Price", main = "Nintendo Adjusted Close Stock Price")
# time series plot 2
entrexts<-xts(entre$Nintendo_Adj_Close,order.by=entre$Date)
plot(entrexts)
entrets<-ts(entre$Nintendo_Adj_Close,frequency=52,start=c(2015,1))
entrets_decompose<-decompose(entrets)
plot(entrets_decompose)
- Do the time series data show a trend? At an initial observation, there appears to be a parabolic positive trend. In early 2015, the stock price was notably high. Over time, it declined, only to rise again. However, in recent years, there’s been a discernible decline. Although there is a trend, it’s neither linear nor consistent. From my understanding of the company, several external factors have influenced these highs and lows. The recent dip can be attributed to the Switch console becoming outdated post-pandemic with the releases of other significant consoles as PS5 and Xbox Series X, coupled with a significant decrease in annual sales. The limited release of notable games has also severely impacted the company’s performance.
- Do the time series data show seasonality? How is the change of the seasonal component over time? Certainly, there appears to be a degree of seasonality in the data. When contextualized, the seasonally decomposed graph highlights spikes in stock prices at the start, middle (likely influenced by summer), and end of the year. This aligns logically with consumer behavior for Nintendo products. Parents and guardians often purchase games during summer and winter vacations, as well as during holiday seasons. These are times when children have more leisure hours or when games are bought as gifts.
###- Detect if the time series data is stationary.
# 1) Stationary Test
adf.test(entre$Nintendo_Adj_Close)
##
## Augmented Dickey-Fuller Test
##
## data: entre$Nintendo_Adj_Close
## Dickey-Fuller = -2.1034, Lag order = 5, p-value = 0.5329
## alternative hypothesis: stationary
The result shows that the data is not stationary since we have a p-value of 0.5329 we feil to reject the null hypothesis
# 2) Serial Autocorrelation
# ACF plots: correlation between two periods in a time series is referred as autocorrelation function (acf)
acf(entre$Nintendo_Adj_Close,main="Significant Autocorrelations") # Generally, non-stationary time series data show the presence of serial autocorrelation.
> The plot reveals significant serial correlation that diminishes
gradually but is still notable even after 6 periods. Given that the data
is collected monthly, it’s reasonable to infer that the stock
performance from one month carries over to subsequent months. This
lingering effect results in a high degree of autocorrelation, especially
in the initial lags.
# Model the Time Series using ARMA(1,1)
NINTENDO_ARMA <- arima(log(entre$Nintendo_Adj_Close), order=c(1,0,1))
# Using the arima() function with d=0 to simulate an ARMA model like described in class (during examn session)
# Display the model summary for diagnostics
summary(NINTENDO_ARMA)
##
## Call:
## arima(x = log(entre$Nintendo_Adj_Close), order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.9697 -0.0074 3.2829
## s.e. 0.0176 0.0724 0.3142
##
## sigma^2 estimated as 0.0229: log likelihood = 88.7, aic = -169.41
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.002856949 0.1513374 0.08573275 -0.3286407 2.657598 1.018259
## ACF1
## Training set 0.001404071
# Plot the diagnostics for the ARMA model
tsdiag(NINTENDO_ARMA)
# Compute the estimated stock prices
NINTENDO_estimated_stock_price <- exp(fitted(NINTENDO_ARMA))
# Plot the estimated stock prices
plot(NINTENDO_estimated_stock_price, main="Estimated Stock Prices from ARMA(1,1)")
# Diagnostic Tests
Box.test(NINTENDO_ARMA$residuals, lag=5, type="Ljung-Box")
##
## Box-Ljung test
##
## data: NINTENDO_ARMA$residuals
## X-squared = 3.3084, df = 5, p-value = 0.6526
# Check for stationarity in the estimated stock prices using Augmented Dickey-Fuller test
adf.test(na.omit(NINTENDO_estimated_stock_price), alternative="stationary")
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(NINTENDO_estimated_stock_price)
## Dickey-Fuller = -2.0914, Lag order = 5, p-value = 0.5379
## alternative hypothesis: stationary
# Log Transform the Time Series Data
nintendo_log <- log(entre$Nintendo_Adj_Close)
# Plot the log-transformed stock price
plot(entre$Date, nintendo_log,
type="l", col="blue", lwd=2,
xlab="Date", ylab="log(Stock Price)",
main="Nintendo Stock Price Log Transformation")
# Plot the first difference of the log-transformed stock price
plot(diff(nintendo_log),
type="l", ylab="First Order Difference",
main="Difference of Nintendo Stock Price Log Transformation")
# Model the Time Series using ARIMA(1,1,1)
NINTENDO_ARIMA <- Arima(nintendo_log, order=c(1,1,1))
# Display the model summary for diagnostics
summary(NINTENDO_ARIMA)
## Series: nintendo_log
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1
## -0.1150 0.0914
## s.e. 1.0303 1.0377
##
## sigma^2 = 0.02348: log likelihood = 88.28
## AIC=-170.56 AICc=-170.43 BIC=-160.8
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.006733967 0.152019 0.08374382 -0.3518187 2.592671 0.9946365
## ACF1
## Training set -0.001815814
# Plot the diagnostics for the ARIMA model
tsdiag(NINTENDO_ARIMA)
# Diagnostic Tests
# a) Check for autocorrelation in the residuals using ACF
acf(NINTENDO_ARIMA$residuals, main="ACF - ARIMA (1,1,1)")
# b) Use the Ljung-Box test to check for autocorrelation in residuals
Box.test(NINTENDO_ARIMA$residuals, lag=1, type="Ljung-Box")
##
## Box-Ljung test
##
## data: NINTENDO_ARIMA$residuals
## X-squared = 0.000643, df = 1, p-value = 0.9798
# c) Check for stationarity in the residuals using Augmented Dickey-Fuller test
adf.test(NINTENDO_ARIMA$residuals, alternative="stationary")
## Warning in adf.test(NINTENDO_ARIMA$residuals, alternative = "stationary"):
## p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: NINTENDO_ARIMA$residuals
## Dickey-Fuller = -4.951, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
and select the results that you consider might generate the best forecast.
ARMA(1,1) Model:
ARIMA(1,1,1) Model:
Selection: The ARIMA(1,1,1) model outperforms in terms of stationarity and exhibits no significant autocorrelation, making it more suitable for stock prediction. The AIC difference between the two models is minimal, so the edge in stationarity and autocorrelation makes ARIMA(1,1,1) the preferred choice for forecasting.
include a time series plot showing your forecast.
# Forecast the next 5 periods using the selected ARIMA model (in log scale)
NINTENDO_forecast_log <- forecast(NINTENDO_ARIMA, h=5)
# Convert the forecasted log values and their upper and lower bounds to the original scale
NINTENDO_forecast_values <- exp(NINTENDO_forecast_log$mean)
NINTENDO_forecast_upper <- exp(NINTENDO_forecast_log$upper)
NINTENDO_forecast_lower <- exp(NINTENDO_forecast_log$lower)
# Print the forecasted values along with their upper and lower bounds
cat("Forecasted Values for next 5 periods:\n")
## Forecasted Values for next 5 periods:
print(NINTENDO_forecast_values)
## Time Series:
## Start = 193
## End = 197
## Frequency = 1
## [1] 10.42853 10.42755 10.42766 10.42765 10.42765
cat("\nUpper Bound (95% Prediction Interval):\n")
##
## Upper Bound (95% Prediction Interval):
print(NINTENDO_forecast_upper[,2])
## Time Series:
## Start = 193
## End = 197
## Frequency = 1
## [1] 14.08137 15.86594 17.40811 18.82627 20.17302
cat("\nLower Bound (95% Prediction Interval):\n")
##
## Lower Bound (95% Prediction Interval):
print(NINTENDO_forecast_lower[,2])
## Time Series:
## Start = 193
## End = 197
## Frequency = 1
## [1] 7.723275 6.853283 6.246295 5.775753 5.390165
# Plot the historical stock prices (on original scale)
plot(entre$Nintendo_Adj_Close, type="l", col="blue", lwd=2, xlab="Time", ylab="Stock Price", main="Nintendo Stock Price with Forecast", xlim=c(1, length(entre$Nintendo_Adj_Close)+5))
# Add the prediction intervals as shaded regions
polygon(c(seq(length(entre$Nintendo_Adj_Close)+1, length(entre$Nintendo_Adj_Close)+5), rev(seq(length(entre$Nintendo_Adj_Close)+1, length(entre$Nintendo_Adj_Close)+5))),
c(NINTENDO_forecast_upper[,2], rev(NINTENDO_forecast_lower[,2])),
col=rgb(0.8,0.8,0.8), border=NA)
# Add the forecasted stock prices to the plot (after shaded region to ensure it's on top)
lines(c(rep(NA, length(entre$Nintendo_Adj_Close)), NINTENDO_forecast_values), col="red", lwd=2)
# Add legend
legend("topright", legend=c("Actual", "Forecast", "95% Prediction Interval"),
fill=c("blue", "red", rgb(0.8,0.8,0.8)))
Based on the results we can see that:
2. Stable Short-Term Outlook: The forecast suggests a relatively stable stock price for Nintendo in the immediate future, hovering around the 10.43 mark.
1. Significant Upcoming Catalysts: With the anticipated release of major game titles in 2023 and a potential new console in 2024, Nintendo has significant bullish catalysts on the horizon that could push stock prices upwards.
3.High Forecast Uncertainty: The wide gap between the prediction interval’s upper and lower bounds indicates considerable uncertainty. This underscores the importance of closely monitoring market responses to Nintendo’s releases and other related news.
library(readr)
cocac <- read_csv("cocac.csv")
head(cocac)
## # A tibble: 6 × 15
## tperiod sales_unitboxes consumer_sentiment CPI inflation_rate unemp_rate
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15-ene 5516689. 38.1 87.1 -0.09 0.0523
## 2 15-feb 5387496. 37.5 87.3 0.19 0.0531
## 3 15-mar 5886747. 38.5 87.6 0.41 0.0461
## 4 15-abr 6389182. 37.8 87.4 -0.26 0.051
## 5 15-may 6448275. 38.0 87.0 -0.5 0.0552
## 6 15-jun 6697947. 39.1 87.1 0.17 0.0507
## # ℹ 9 more variables: gdp_percapita <dbl>, itaee <dbl>, itaee_growth <dbl>,
## # pop_density <dbl>, job_density <dbl>, pop_minwage <dbl>,
## # exchange_rate <dbl>, max_temperature <dbl>, holiday_month <dbl>
str(cocac)
## spc_tbl_ [48 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ tperiod : chr [1:48] "15-ene" "15-feb" "15-mar" "15-abr" ...
## $ sales_unitboxes : num [1:48] 5516689 5387496 5886747 6389182 6448275 ...
## $ consumer_sentiment: num [1:48] 38.1 37.5 38.5 37.8 38 ...
## $ CPI : num [1:48] 87.1 87.3 87.6 87.4 87 ...
## $ inflation_rate : num [1:48] -0.09 0.19 0.41 -0.26 -0.5 0.17 0.15 0.21 0.37 0.51 ...
## $ unemp_rate : num [1:48] 0.0523 0.0531 0.0461 0.051 0.0552 0.0507 0.0542 0.0547 0.0538 0.0539 ...
## $ gdp_percapita : num [1:48] 11660 11660 11660 11626 11626 ...
## $ itaee : num [1:48] 104 104 104 108 108 ...
## $ itaee_growth : num [1:48] 0.0497 0.0497 0.0497 0.0318 0.0318 0.0318 0.0565 0.0565 0.0565 0.0056 ...
## $ pop_density : num [1:48] 98.5 98.5 98.5 98.8 98.8 ...
## $ job_density : num [1:48] 18.3 18.5 18.6 18.7 18.7 ...
## $ pop_minwage : num [1:48] 9.66 9.66 9.66 9.59 9.59 ...
## $ exchange_rate : num [1:48] 14.7 14.9 15.2 15.2 15.3 ...
## $ max_temperature : num [1:48] 28 31 29 32 34 32 29 29 29 29 ...
## $ holiday_month : num [1:48] 0 0 0 1 0 0 0 0 1 0 ...
## - attr(*, "spec")=
## .. cols(
## .. tperiod = col_character(),
## .. sales_unitboxes = col_double(),
## .. consumer_sentiment = col_double(),
## .. CPI = col_double(),
## .. inflation_rate = col_double(),
## .. unemp_rate = col_double(),
## .. gdp_percapita = col_double(),
## .. itaee = col_double(),
## .. itaee_growth = col_double(),
## .. pop_density = col_double(),
## .. job_density = col_double(),
## .. pop_minwage = col_double(),
## .. exchange_rate = col_double(),
## .. max_temperature = col_double(),
## .. holiday_month = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
values in the dataset?
# Number of observations
nobs <- nrow(cocac)
# Check for missing values
missing_values <- sum(is.na(cocac))
cat("Number of observations:", nobs, "\n")
## Number of observations: 48
cat("Number of missing values:", missing_values, "\n")
## Number of missing values: 0
of the dependent variable?
# Summary of dependent variables: Coca Cola Sales (in unit boxes)
summary_stats <- summary(cocac$sales_unitboxes)
# Descriptive Statistics
mean_value <- mean(cocac$sales_unitboxes, na.rm = TRUE)
min_value <- min(cocac$sales_unitboxes, na.rm = TRUE)
max_value <- max(cocac$sales_unitboxes, na.rm = TRUE)
# Print results
list(
Summary = summary_stats,
Mean = mean_value,
Min = min_value,
Max = max_value
)
## $Summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5301755 6171767 6461357 6473691 6819782 7963063
##
## $Mean
## [1] 6473691
##
## $Min
## [1] 5301755
##
## $Max
## [1] 7963063
The dataset contains monthly data across a 4-year span, with a total of 48 observations and no missing values. It appears to track various economic, environmental, and demographic indicators, such as consumer sentiment, inflation rate, unemployment rate, GDP per capita, and even maximum temperatures. The primary metric of interest is sales_unitboxes, which varies from about 5.3 million to almost 8 million over the examined period. On average, around 6.47 million units were sold in a given month. The lack of missing values suggests that the dataset is comprehensive and well-maintained for the period in question. The various economic indicators in the dataset suggest a potential focus on understanding how broader economic and environmental factors might influence sales.
matrix plots, scatter plots, box plots, and / or histogram plots, etc.
This plot showcases the correlation between sales_unitboxes and other numeric columns.
cor_matrix <- cor(cocac[,-1]) # Exclude the character column 'tperiod'
corrplot(cor_matrix, method = "circle")
This plot displays the distribution of the sales_unitboxes.
ggplot(cocac, aes(x=sales_unitboxes)) +
geom_histogram(aes(y=..density..), fill="blue", alpha=0.5, bins=30) +
geom_density(color="red") +
labs(title="Distribution of Sales Unitboxes", x="Sales", y="Density")
This bar plot displays the average sales for different levels of consumer sentiment.
cocac$consumer_sentiment_group <- cut(cocac$consumer_sentiment, breaks=4, labels=c("Low", "Medium-Low", "Medium-High", "High"))
ggplot(cocac, aes(x=consumer_sentiment_group, y=sales_unitboxes)) +
geom_bar(stat="summary", fun="mean", aes(fill=consumer_sentiment_group)) +
labs(title="Sales vs Consumer Sentiment", x="Consumer Sentiment", y="Average Sales")
This bar plot showcases the average sales at different CPI levels.
cocac$CPI_group <- cut(cocac$CPI, breaks=4, labels=c("Low", "Medium-Low", "Medium-High", "High"))
ggplot(cocac, aes(x=CPI_group, y=sales_unitboxes)) +
geom_bar(stat="summary", fun="mean", aes(fill=CPI_group)) +
labs(title="Sales vs CPI", x="CPI", y="Average Sales")
This will help us understand if there’s any seasonality in sales related to temperature or if temperature affects sales in some other way.
cocac$max_temp_group <- cut(cocac$max_temperature, breaks=4, labels=c("Low", "Medium-Low", "Medium-High", "High"))
ggplot(cocac, aes(x=max_temp_group, y=sales_unitboxes)) +
geom_bar(stat="summary", fun="mean", aes(fill=max_temp_group)) +
labs(title="Sales vs Max Temperature", x="Max Temperature", y="Average Sales")
Observations: The histogram for sales distribution reveals a modest right-skewed pattern, hinting at the presence of atypical data. This skewness can be attributed to the uncommonly high sales figures, perhaps due to large-scale purchases by major clients like Walmart or HEB. However, these are exceptions and not Coca Cola’s primary market segment. Instead, most sales seem to be more commonly linked to traditional local stores, or “tienditas,” and local distributors, explaining the right skew. The correlation plot uncovers several independent variables exhibiting strong negative correlations with one another, hinting at potential multicollinearity issues in future regression analyses. Notably, max temperature stands out as the sole variable with a strong positive correlation to sales, suggesting a potential rise in sales with increasing temperatures. However, the nature of this relationship might be polynomial, peaking at a certain temperature before perhaps plateauing or even declining. The remaining graphs generally conform to expected patterns, with no other surprising insights.
of the explanatory variable X on the dependent variable Y.
Hypotheses Statements: Based on our observations, for our linear regression models the dependent variable is Y, which represents sales, and the independent variables are all the other columns in the dataset. Our general hypothesis for each variable X can be formulated as:
H0 : The variable X has no impact on Y (i.e., the coefficient of X is 0).
H1 : The variable X has a significant impact on Y (the coefficient of X is not 0).
(Where must probably the variables that are going to have a more significant effect are max temperature, inflation rate and itaee)
Model 1: Using All Variables
model1 <- lm(sales_unitboxes ~ . -tperiod, data=cocac)
# Excluding 'tperiod' as it's a character variable
summary(model1)
##
## Call:
## lm(formula = sales_unitboxes ~ . - tperiod, data = cocac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -418406 -211931 -24558 147931 656317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -78983801 58943596 -1.340 0.1923
## consumer_sentiment 77995 93960 0.830 0.4143
## CPI -162562 132726 -1.225 0.2321
## inflation_rate 93966 407734 0.230 0.8196
## unemp_rate -10965110 23997833 -0.457 0.6517
## gdp_percapita -2319 2078 -1.116 0.2751
## itaee -2317 124403 -0.019 0.9853
## itaee_growth 6012858 6408657 0.938 0.3571
## pop_density 1320901 898293 1.470 0.1539
## job_density -466194 606196 -0.769 0.4491
## pop_minwage 189226 207035 0.914 0.3695
## exchange_rate -13413 235506 -0.057 0.9550
## max_temperature -5148 96343 -0.053 0.9578
## holiday_month 166356 181185 0.918 0.3673
## consumer_sentiment_groupMedium-Low 212739 663090 0.321 0.7510
## consumer_sentiment_groupMedium-High 26089 797382 0.033 0.9742
## consumer_sentiment_groupHigh 173919 1436310 0.121 0.9046
## CPI_groupMedium-Low 169908 401303 0.423 0.6756
## CPI_groupMedium-High 423387 669445 0.632 0.5328
## CPI_groupHigh 312535 781102 0.400 0.6925
## max_temp_groupMedium-Low 358934 313480 1.145 0.2631
## max_temp_groupMedium-High 989243 548510 1.804 0.0834 .
## max_temp_groupHigh 1678503 756458 2.219 0.0358 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 372600 on 25 degrees of freedom
## Multiple R-squared: 0.794, Adjusted R-squared: 0.6127
## F-statistic: 4.38 on 22 and 25 DF, p-value: 0.0002801
Interpretation: The regression results indicate that many predictor variables aren’t statistically significant in explaining the variance in sales_unitboxes, as their p-values are well above the typical 0.05 threshold. Notably, the variable max_temp_groupHigh shows significance, suggesting an influential relationship between higher temperatures and sales. Given the current model’s non-significance of some predictors and my interest in exploring a polynomial relationship with max_temperature, a refined model specification is needed
Model 2: Selected Variables and Polynomial relationship
model2 <- lm(sales_unitboxes ~ max_temperature + I(max_temperature^2) + unemp_rate + itaee_growth + pop_density + job_density + holiday_month + inflation_rate, data=cocac)
summary(model2)
##
## Call:
## lm(formula = sales_unitboxes ~ max_temperature + I(max_temperature^2) +
## unemp_rate + itaee_growth + pop_density + job_density + holiday_month +
## inflation_rate, data = cocac)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773165 -291017 84863 227545 733861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11023554 35491994 -0.311 0.7578
## max_temperature -966744 558838 -1.730 0.0916 .
## I(max_temperature^2) 17590 8984 1.958 0.0574 .
## unemp_rate 41050210 18637484 2.203 0.0336 *
## itaee_growth -3590732 4825519 -0.744 0.4613
## pop_density 295442 405218 0.729 0.4703
## job_density -46841 430458 -0.109 0.9139
## holiday_month 97649 151757 0.643 0.5237
## inflation_rate -43232 227906 -0.190 0.8505
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 431600 on 39 degrees of freedom
## Multiple R-squared: 0.5687, Adjusted R-squared: 0.4803
## F-statistic: 6.429 on 8 and 39 DF, p-value: 2.567e-05
and heteroscedasticity).
# Load necessary libraries
library(car)
library(lmtest)
# Model 1 Diagnostics
# Check for multicollinearity using VIF
vif_values_model1 <- vif(model1)
print(vif_values_model1)
## GVIF Df GVIF^(1/(2*Df))
## consumer_sentiment 24.946823 1 4.994680
## CPI 152.568047 1 12.351844
## inflation_rate 8.542516 1 2.922758
## unemp_rate 6.723036 1 2.592882
## gdp_percapita 92.341357 1 9.609441
## itaee 118.364451 1 10.879543
## itaee_growth 2.965726 1 1.722128
## pop_density 454.504354 1 21.319108
## job_density 194.923670 1 13.961507
## pop_minwage 14.625876 1 3.824379
## exchange_rate 48.431367 1 6.959265
## max_temperature 21.794374 1 4.668444
## holiday_month 2.128012 1 1.458771
## consumer_sentiment_group 379.473045 3 2.690689
## CPI_group 713.339485 3 2.989161
## max_temp_group 100.360730 3 2.155728
# Check for heteroscedasticity using Breusch-Pagan test
bptest_result_model1 <- bptest(model1)
print(bptest_result_model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 19.543, df = 22, p-value = 0.6116
# Plot residuals to check their distribution
hist(residuals(model1), main="Residual Distribution for Model 1", xlab="Residuals", border="blue", col="lightblue")
> Upon analyzing the diagnostics for Model 1, we observe potential
concerns. The Variance Inflation Factor (VIF) values indicate
significant multicollinearity for several variables, notably CPI,
gdp_percapita, itaee, pop_density, job_density,
consumer_sentiment_group, and CPI_group, suggesting redundancy among
these predictors. Meanwhile, the Breusch-Pagan test result (p-value =
0.6116) does not indicate heteroscedasticity. Finally, the distribution
of residuals shows a slight right skew, implying that our model might be
leaving out some influential observations or nonlinear patterns in the
data but overall a good performance for residuals.
# Model 2 Diagnostics
# Check for multicollinearity using VIF
vif_values_model2 <- vif(model2)
print(vif_values_model2)
## max_temperature I(max_temperature^2) unemp_rate
## 546.441765 551.349276 3.021783
## itaee_growth pop_density job_density
## 1.253003 68.920224 73.242896
## holiday_month inflation_rate
## 1.112488 1.988888
# Check for heteroscedasticity using Breusch-Pagan test
bptest_result_model2 <- bptest(model2)
print(bptest_result_model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 4.8361, df = 8, p-value = 0.7749
# Plot residuals to check their distribution
hist(residuals(model2), main="Residual Distribution for Model 2", xlab="Residuals", border="blue", col="lightblue")
> For Model 2, the diagnostic tests indicate mixed findings. While
the VIF values for most of the predictors are within acceptable limits,
suggesting no major concerns of multicollinearity, the max_temperature
and its squared term have notably high VIF values and also the variables
realted to density. This indicates potential multicollinearity between
these two pairs of variables, which is expected as one is derived from
the other. The Breusch-Pagan test result (p-value = 0.7749) suggests
that heteroscedasticity isn’t a concern for this model. However, the
peculiar shape of the residual distribution, resembling a normal
distribution but lacking tails, hints at possible anomalies or
truncations in the data that may need further investigation.
# Assuming model1 and model2 are already defined
# For Model 1
aic_model1 <- AIC(model1)
r2_model1 <- summary(model1)$r.squared
rmse_model1 <- rmse(predict(model1), cocac$sales_unitboxes)
# For Model 2
aic_model2 <- AIC(model2)
r2_model2 <- summary(model2)$r.squared
rmse_model2 <- rmse(predict(model2), cocac$sales_unitboxes)
# Print the results
cat("Model 1 Metrics:\n")
## Model 1 Metrics:
cat("AIC:", aic_model1, "\nR-squared:", r2_model1, "\nRMSE:", rmse_model1, "\n\n")
## AIC: 1384.423
## R-squared: 0.7939877
## RMSE: 268909.4
cat("Model 2 Metrics:\n")
## Model 2 Metrics:
cat("AIC:", aic_model2, "\nR-squared:", r2_model2, "\nRMSE:", rmse_model2)
## AIC: 1391.885
## R-squared: 0.5687287
## RMSE: 389076.2
Based on the metrics provided, Model 1 appears to be the better fit. The AIC (Akaike Information Criterion) for Model 1 is lower than that for Model 2, suggesting that Model 1 provides a better balance of goodness of fit and parsimony. Furthermore, the R-squared for Model 1 is significantly higher, indicating that it explains more of the variability in the data than Model 2. Lastly, the RMSE (Root Mean Square Error) for Model 1 is also smaller, which implies that Model 1’s predictions are, on average, closer to the actual values. Given these metrics, Model 1 would be the preferred choice at this stage. However, Model 2 might be improved with some modifications, possibly bridging the gap in performance or even surpassing Model 1. But based on the current evidence in accuracy measures and diagnostic test, Model 1 is the better fit for the data.
how much is the impact of the explanatory variables X’s on the dependent variable Y.
In the context of the Model 1 regression results, the impact of explanatory variables on the dependent variable, Coca-Cola sales (measured in unit boxes), is evident. The consumer_sentiment suggests that with every unit increase in sentiment, the sales of Coca-Cola increase by 77,995 unit boxes, holding all other factors constant. This showcases the positive relation between consumer confidence and consumption behavior. Furthermore, the CPI coefficient indicates that for every unit rise in the Consumer Price Index, the sales dip by 162,562 unit boxes. Similarly, factors like unemployment rate (unemp_rate) and GDP per capita (gdp_percapita) have respective negative impacts on the sales, emphasizing the significance of economic conditions on consumer purchasing patterns.
The pop_density, though not statistically significant at conventional levels, suggests that as population density increases by one unit, sales increase by 1,320,901 unit boxes. On the other hand, the max_temperature variable doesn’t exhibit a statistically significant linear relationship with sales in this model. The categories within variables, such as consumer sentiment groups and max temperature groups, further provide nuances. Specifically, the sales appear to be significantly higher in high max temperature conditions compared to the reference category.
Positive Impact: Higher consumer sentiment is associated with increased sales. Higher population density is related to greater sales, indicating urban areas might be key markets.
Negative Impact: A rise in CPI, reflecting inflationary pressures, corresponds to decreased sales. Economic indicators such as unemployment rate and GDP per capita negatively influence sales, showcasing the importance of overall economic health.
Significant Observations: The sales are notably higher during months with higher maximum temperatures, reinforcing the belief in the seasonal nature of soft drink sales.
regressors and the dependent variable (e.g., effects plots).
# Installing and loading necessary packages
library(effects)
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
# Most relevant individual effect plots
# Individual effect plot for Consumer Sentiment
plot(effect("consumer_sentiment", model1))
# Individual effect plot for CPI
plot(effect("CPI", model1))
# Individual effect plot for Maximum Temperature
plot(effect("max_temperature", model1))
# Individual effect plot for Unemployment Rate
plot(effect("unemp_rate", model1))
# Global comparison of prediction
# Getting predicted values from the model
predicted_values <- predict(model1)
# Actual vs Predicted plot
plot(cocac$sales_unitboxes, predicted_values,
main="Actual vs Predicted Sales",
xlab="Actual Sales",
ylab="Predicted Sales")
abline(0,1, col="red") # Line of perfect prediction
> While the results from Model 1 suggest that the predictions align
relatively well with the actual values, making it a seemingly acceptable
fit for the data at hand, it’s crucial to approach it with caution. The
persistent issue of multicollinearity in the model could lead to biased
predictions. Hence, refining Model 2, which potentially offers a more
robust framework by addressing the multicollinearity concerns, would be
a more prudent route to ensure accurate and unbiased forecasting.
David Dominguez a01570975