—


Exploring the Data

Introduction

For this project, I use a dataset containing time series data of the weekly sales of a Walmart store in the US. The dataset was taken from https://www.kaggle.com/vik2012kvs/walmart-dataretail-analysis.

Libraries

Importing the Data

We import the data and print the first couple of rows.

df <- read.csv("Walmart_Store_sales.csv")
df <- df[which(df$Store==36),]
df <- df[,colnames(df)!="Store"]
head(df)
##            Date Weekly_Sales Holiday_Flag Temperature Fuel_Price      CPI
## 5006 05-02-2010     467546.7            0       45.97      2.545 209.8530
## 5007 12-02-2010     469563.7            1       46.11      2.539 209.9970
## 5008 19-02-2010     470281.0            0       45.66      2.472 210.0451
## 5009 26-02-2010     447519.4            0       50.87      2.520 210.0772
## 5010 05-03-2010     480203.4            0       51.33      2.574 210.1093
## 5011 12-03-2010     441434.2            0       61.96      2.619 210.1414
##      Unemployment
## 5006        8.554
## 5007        8.554
## 5008        8.554
## 5009        8.554
## 5010        8.554
## 5011        8.554

The dataset contains the date, the quantity of weekly sales (Weekly_Sales), whether the week was a holiday week or not (Holiday_Flag), the average temperature of that week (Temperature), the cost of fuel in the region (Fuel_Price), the prevailing consumer price index (CPI), and the prevailing unemployment rate (Unemployment). For this assignment, I aim to forecast weekly sales while using the other variables as predictors.

Descriptive Statistics & Missing Values

Next, I calculate descriptive statistics, the length of the series and the number of missing values.

descriptive_df <- as.data.frame(cbind(sapply(df[,2:ncol(df)], mean), sapply(df[,2:ncol(df)], sd), sapply(df[,2:ncol(df)], min), sapply(df[,2:ncol(df)], max)))
colnames(descriptive_df) <- c("Mean", "SD", "Min", "Max")
round(descriptive_df, 2)
##                   Mean       SD       Min       Max
## Weekly_Sales 373511.99 60725.17 270677.98 489372.02
## Holiday_Flag      0.07     0.26      0.00      1.00
## Temperature      71.16    12.09     41.16     87.64
## Fuel_Price        3.20     0.45      2.47      3.93
## CPI             214.73     4.32    209.12    222.11
## Unemployment      7.87     0.70      6.23      8.55
paste("The length of the time series is", nrow(df), "weeks and it ranges from", min(df$Date), "to", max(df$Date))
## [1] "The length of the time series is 143 weeks and it ranges from 01-04-2011 to 31-12-2010"
paste("There are", sum(is.na(df)), "missing values")
## [1] "There are 0 missing values"

The variables take on values of very different scales; Weekly_Sales takes on very large values while Fuel_Price takes on very low values. Also, from the low mean for Holiday_Flag we can already tell that that there are very few values of 1 for that variable, in other words, there is a class imbalance. The time series is of length 143 weeks and it ranges from 01-04-2011 to 31-12-2010. There are no missing values.

Plotting Distributions

Subsequently, we plot the distribution of the target variable Weekly_Sales and calculate its skewness and kurtosis.

df %>%
  ggplot(aes(x=Weekly_Sales)) +
  geom_histogram(bins=30) +
  theme_classic()

paste("Weekly_Sales has a skewness of", skewness(df$Weekly_Sales))
## [1] "Weekly_Sales has a skewness of 0.190745780352522"
paste("Weekly_Sales has a kurtosis of", kurtosis(df$Weekly_Sales))
## [1] "Weekly_Sales has a kurtosis of 1.79444130585377"

When just eyeballing the histogram, Weekly_Sales does not seem very normally distributed, however, both the skewness (0.19) and kurtosis (1.79) are within acceptable bounds to assume normality (Jones, 1969).

Jones, T. A. (1969). Skewness and kurtosis as criteria of normality in observed frequency distributions. Journal of Sedimentary Research, 39(4), 1622-1627.

Next, we plot the distributions of the predictors.

melt_df <- melt(df[,c(1,4:6)], id.vars = "Date")
ggplot(melt_df, aes(x=value)) + 
geom_histogram(bins = 30) +
facet_wrap( ~ variable, scales="free") +
theme_minimal()

df %>% ggplot(aes(x=Holiday_Flag)) + geom_bar() + theme_classic()

Temperature seems normally distributed but left-skewed. Fuel_Price and CPI seem to follow a bimodal distribution. As indicated earlier, Holiday_Flag contains a class imbalance with very few values of 1.

Detecting Outliers

We check for outliers using boxplots.

melt_df <- melt(df[,c(1:2, 4:ncol(df))], id.vars = "Date")
ggplot(melt_df, aes(x=variable, y=value)) + 
geom_boxplot() + 
facet_wrap( ~ variable, scales="free") +
theme_minimal()

There seem to be no outliers in the continuous variables.

Equal Intervals & Seasonality

Next, to check if there are equal length intervals between the observations, I calculate the differences between each of the subsequent dates and test whether any of them are different than 7. I also print the first few values to see if I calculated the differences correctly.

df$Date <- as.Date(df$Date, format = "%d-%m-%Y")

date_diff <- vector()
for (i in 2:nrow(df)){
  date_diff <- c(date_diff, df$Date[i] - df$Date[i-1])
}

head(date_diff)
## [1] 7 7 7 7 7 7
paste("There are", sum(date_diff != 7), "differences between subsequent dates that are not 7 (days)")
## [1] "There are 0 differences between subsequent dates that are not 7 (days)"

All of the differences are 7 days. Hence, the intervals between observations are of equal length.

In terms of seasonality, because the data are weekly observations, the only seasonality I would expect is within a year and within a month. I expect seasonality within a month because I expect that there are consistent causes of a change in weekly sales that occur in the same week each month. For example, weekly sales may increase in a certain week of the month because that is they week most people receive their salary. I would also expect there to be annual patterns as for example, people likely buy more during Christmas and this is consistent over the years. Furthermore, patterns within a week or a day cannot be observed with the data.

Creating ts and tsibble Objects

To conduct the analysis, the data needs to be in ts format. To do this, we set the frequency to the number of weeks in a year (365/7), so a period represents a week. Furthermore, we subset all expect the date column and set the start date to the sixth period of 2010. The week number of the first date in the time series is actually 5, but after inspection, setting the start date to the sixth week created the correct ts and tsibble objects.

ts <- ts(df[, 2:ncol(df)], start = c(2010, 6),  frequency = 365.25 / 7)
str(ts)
##  Time-Series [1:143, 1:6] from 2010 to 2013: 467547 469564 470281 447519 480203 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:6] "Weekly_Sales" "Holiday_Flag" "Temperature" "Fuel_Price" ...

We also create a tsibble object from the ts object and print its structure.

tsbl <- as_tsibble(ts, pivot_longer = FALSE)
str(tsbl)
## tbl_ts [143 x 7] (S3: tbl_ts/tbl_df/tbl/data.frame)
##  $ index       : week [1:143] 2010 W05, 2010 W06, 2010 W07, 2010 W08, 2010 W09, 2010 W1...
##    ..@ week_start: num 1
##  $ Weekly_Sales: num [1:143] 467547 469564 470281 447519 480203 ...
##  $ Holiday_Flag: num [1:143] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num [1:143] 46 46.1 45.7 50.9 51.3 ...
##  $ Fuel_Price  : num [1:143] 2.54 2.54 2.47 2.52 2.57 ...
##  $ CPI         : num [1:143] 210 210 210 210 210 ...
##  $ Unemployment: num [1:143] 8.55 8.55 8.55 8.55 8.55 ...
##  - attr(*, "key")= tibble [1 x 1] (S3: tbl_df/tbl/data.frame)
##   ..$ .rows: list<int> [1:1] 
##   .. ..$ : int [1:143] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..@ ptype: int(0) 
##  - attr(*, "index")= chr "index"
##   ..- attr(*, "ordered")= logi TRUE
##  - attr(*, "index2")= chr "index"
##  - attr(*, "interval")= interval [1:1] 1W
##   ..@ .regular: logi TRUE

As shown in the structure of the tsibble, the object contains the correct starting week, which is 5.

Plotting the Time Series

We print a sequence plot and autocorrelation function of the time series of the outcome variable Weekly_Sales.

autoplot(ts[,1], ylab='Weekly Sales')

acf(ts[,1])

We see that there are trends in the data. First of all, from a big picture perspective, weekly sales are decreasing over time. However, looking at the time series in terms of annual trends, weekly sales seems to stay relatively stable or even increase right at the end of the year and in the first quarter, while after the first quarter there seems to be a steep decrease. Zooming in even more, there could also be a trend of increase and decrease on a monthly or bi-weekly basis, as weekly sales is going up and down at those levels too. The autocorrelation function also suggests that there are patterns, as the correlations are high and stay quite high at larger lags. The autocorrelations also show a trend of going up and down in a cyclic manner, also indicating the existence of trends in the time series.

Modeling

Formulating the Forecasting Horizon

Since the aim is to forecast the weekly sales of a Walmart store, mid to long-term forecasts are likely more important than short-term forecasts. The Walmart store is likely not able to change much about its store or offerings within a week, a month or even a few months. It might take quite a few months to make meaningful changes to the store to react to a certain forecast in the present. The store might need to hire new personnel, find new suppliers or renovate the store. Walmart may perhaps even want to pursue a new market opportunity like expanding the branch when high sales figures are expected. Hence, a forecast horizon of a year, or approximately 52 weeks, seems like a suitable horizon. The last week of the times series is week 43 of 2012. Hence, the forecasting question is: How can we expect weekly sales to change from week 43 of 2012 to week 43 of 2013?

ETS modeling

We run the ETS model using the tsibble object we created.

fit1 <- tsbl %>% model(ETS(log(Weekly_Sales)))
## Warning: Seasonal periods (`period`) of length greather than 24 are not
## supported by ETS. Seasonality will be ignored.
report(fit1)
## Series: Weekly_Sales 
## Model: ETS(M,A,N) 
## Transformation: log(Weekly_Sales) 
##   Smoothing parameters:
##     alpha = 0.1996145 
##     beta  = 0.0001000008 
## 
##   Initial states:
##      l[0]         b[0]
##  13.04425 -0.003458122
## 
##   sigma^2:  0
## 
##       AIC      AICc       BIC 
## -130.1326 -129.6946 -115.3184

We get a message saying that seasonal periods of length greater than 24 are not supported by ETS, and that therefore seasonality is ignored. I looked up the issue and this seems to result from the specification of the ‘frequency’ argument when creating the ts object, which is larger than 24 as there are around 52 weeks in a year. In Hyndman, & Athanasopoulos (2018) section 12.1, they mentioned that weekly data may be difficult to work with as most of the forecasting methods will not handle seasonal periods of 52, because the seasonal period is too large. They recommend other methods to solve this, but these methods are unrelated to ETS and therefore do not answer the questions of this assignment. Hence, we will just accept that the seasonal component is ignored for now (even though there may be one), and interpret the results accordingly.

Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

autoplot(components(fit1))
## Warning: Removed 1 row(s) containing missing values (geom_path).

We get a warning that 1 row contained missing values, however, we checked for missing values before and there were none. Thus, we assume this is not a problematic error. We can see that although the seasonal component was ignored, the model did manage to extract a level and slope component. The results partly match our previous expectations. The level mostly captures the big picture decrease in the time series. Furthermore, the slope captures the fluctuations within the year, where at the end and beginning of the year the weekly sales are stable or even increasing, and after the first quarter the weekly sales consistently decrease.

gg_tsresiduals(fit1)

From the sequence plot of the residuals, I find it hard to determine whether there is still some trend in the data by just eyeballing it. The autocorrelations are a lot less than before, meaning the model did capture quite some structure from the data. However, more than a few autocorrelations are still quite large, meaning it still missed some important parts of the structure. The histogram of the residuals is left-skewed.

fit1 %>% forecast(h=52) %>% autoplot(tsbl)
## Warning: Seasonal periods (`period`) of length greather than 24 are not
## supported by ETS. Seasonality will be ignored.

From the forecast plot, we can see that the result is a linear decrease in weekly sales. The forecast does not seem to reflect the changes in weekly sales during the year (stable in beginning and decreasing after first quarter). The forecast also does not reflect a shorter seasonal pattern. Because the model ignored the seasonal component, we do not know if there would be any smaller fluctuations in the forecast if the seasonal component would have been included, or whether we would get the same forecast because there is no structure in the time series at that level. Perhaps the SARIMA model will shed light on this concern.

SARIMA modeling

We run the SARIMA model using the tsibble object we created.

fit2 <- tsbl %>% model(ARIMA(Weekly_Sales))
report(fit2)
## Series: Weekly_Sales 
## Model: ARIMA(1,0,1)(0,1,0)[52] w/ drift 
## 
## Coefficients:
##          ar1      ma1   constant
##       0.8661  -0.4424  -9282.569
## s.e.  0.0751   0.1248   1010.284
## 
## sigma^2 estimated as 330764004:  log likelihood=-1020.47
## AIC=2048.93   AICc=2049.4   BIC=2058.98

The result is an ARIMA(1,0,1)(0,1,0)[52] model. Hence, the seasonal component is 52, which makes sense as we have weekly data. Furthermore, the AR and MA orders are both 1 and the seasonal AR and MA orders are both 0. The SARIMA model has significantly lower fit measures than the ETS model, indicating the SARIMA model is significantly better at modeling the structure of the time series. The model expression is as follows:

\[ (1-ϕ_1B)y_t = (1 + θ_1 B)ε_t \]

In the equation, only the non-seasonal AR and MA components are included, as the non-seasonal I component is 0 and the seasonal AR and MA components are too.

gg_tsresiduals(fit2)

The sequence plot of residuals indicates that the model perfectly models the first part of the time series, as the residuals are very close to zero for that part. Furthermore, there does not seem to be any pattern when eyeballing the sequence plot. None of the autocorrelations are high, meaning the model did very well at capturing the structure of the time series. Here, the model also seems to outperform the ETS model. Furthermore, the histogram of the residuals also suggests that many of the residuals are very close to zero.

fit2 %>% forecast(h=52) %>% autoplot(tsbl)  

The forecast very closely follows the pattern of the historical data. The forecast reflects the changes during the year, where at the beginning of the year weekly sales is more stable than later in the year, where it starts to decrease. The forecast also reflects a more short term trend, as weekly sales is predicted to go up and down on the basis of a short time period. These are two results the ETS model was not able to produce, as it just resulted in a forecast of a linear decrease. However, although we expected there to be these types of patterns, we could not see whether they were true as the ETS model did not model them. However, now that we know the SARIMA model is able to capture these patterns, it is likely that the reason for not seeing the patterns earlier is due to the limitations of the ETS model, rather than that there is no structure on that level to be modeled.

Cross Validation

We compare the models using cross-validation. We expect that the SARIMA model is better at capturing the structure of the time series because we saw earlier that it had significantly better goodness of fit measures and less high autocorrelations after modeling. We compare a horizon of 1, 26 and 52 weeks.

tsbl.CV.h1 <- tsbl %>% 
              slice(-n()) %>%    
              stretch_tsibble(.init = 10) %>%   
              model(ETS(Weekly_Sales),ARIMA(Weekly_Sales)) %>%
              forecast(h = 1)

tsbl.CV.h26 <- tsbl %>% 
               slice(-n()) %>%    
               stretch_tsibble(.init = 10) %>%   
               model(ETS(Weekly_Sales),ARIMA(Weekly_Sales)) %>%
               forecast(h = 26)

tsbl.CV.h52 <- tsbl %>% 
               slice(-n()) %>%    
               stretch_tsibble(.init = 10) %>%   
               model(ETS(Weekly_Sales),ARIMA(Weekly_Sales)) %>%
               forecast(h = 52)

tsbl.CV.h1 %>% accuracy(tsbl) -> mod.sel.h1
tsbl.CV.h26 %>% accuracy(tsbl)-> mod.sel.h26
tsbl.CV.h52 %>% accuracy(tsbl) -> mod.sel.h52

mod.sel.h1[,2:9]
## # A tibble: 2 x 8
##   .type     ME   RMSE    MAE    MPE  MAPE  MASE RMSSE
##   <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Test   -785. 20283. 16631. -0.247  4.55 0.229 0.267
## 2 Test  -4256. 20477. 16608. -1.40   4.56 0.229 0.269
mod.sel.h26[,2:9]
## # A tibble: 2 x 8
##   .type      ME   RMSE    MAE   MPE  MAPE  MASE RMSSE
##   <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Test   -8495. 31564. 25165. -2.06  7.12 0.346 0.415
## 2 Test  -20829. 33754. 27190. -6.08  7.81 0.374 0.444
mod.sel.h52[,2:9]
## # A tibble: 2 x 8
##   .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE
##   <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Test  -20084. 45275. 35804.  -5.43  10.4 0.493 0.595
## 2 Test  -36374. 49451. 41123. -10.8   12.1 0.566 0.650

For the horizon of 1 week, 5 of the 7 measures indicate that the SARIMA produces better forecasts (i.e. closer to zero). And, for the horizons of 26 and 52 weeks, 7 of the 7 measures indicate that the SARIMA produces better forecasts. Hence, as expected, the results suggest that the SARIMA model produces better forecasts than the ETS model.

The ETS model ignored the seasonality component because there were too many periods within a year. This likely reduced the ability of the ETS model to capture the structure of the data. Hence, it might be interesting to see what happens if we transform the weekly data to monthly data, thereby reducing the number of periods in a year. This will likely even the playing field, as the seasonal component will not be ignored by the ETS model, and therefore our data does not restrict the capabilities of the ETS model. However, as we are reducing the number of periods per year, we can expect to see less patterns in the data, because the patterns within a month cannot be examined as a result. It is hard to change the data to exact months, so instead we use four weeks. First we create an empty vector, then we loop over every first week of each of the four week periods. Next, we subset the four weeks belonging to that four week period and we save the sum in the vector we created. We create a ts object and set the frequency to 12, because setting it to the number of four week periods in a year resulted in errors later on.

Monthly_Sales <- vector()
for (i in seq(1,140,4)){
  Monthly_Sales <- c(Monthly_Sales, sum(df$Weekly_Sales[i:i+3]))
}

ts_m <- ts(Monthly_Sales, start = c(2010, 2),  frequency = 12)
tsbl_m <- as_tsibble(ts_m)

autoplot(ts_m, ylab='Weekly Sales')

acf(ts_m)

The result is a flattened version of the old sequence plot, with autocorrelations that are significantly lower than for the weekly data. This is likely the result of disregarding the patterns that occur within the months. Next, we run the models and evaluate their outputs.

fit1 <- tsbl_m %>% model(ETS((value)))
report(fit1)
## Series: value 
## Model: ETS(M,N,N) 
## Transformation: (value) 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##      l[0]
##  446873.5
## 
##   sigma^2:  0.003
## 
##      AIC     AICc      BIC 
## 822.0437 822.8179 826.7098
autoplot(components(fit1))

gg_tsresiduals(fit1)

fit1 %>% forecast(h=12) %>% autoplot(tsbl_m)  

fit2 <- tsbl_m %>% model(ARIMA(value))
report(fit2)
## Series: value 
## Model: ARIMA(0,0,0)(0,1,0)[12] w/ drift 
## 
## Coefficients:
##         constant
##       -66302.960
## s.e.    4256.757
## 
## sigma^2 estimated as 435837418:  log likelihood=-260.89
## AIC=525.78   AICc=526.38   BIC=528.05
gg_tsresiduals(fit2)

fit2 %>% forecast(h=12) %>% autoplot(tsbl_m)

We see that the SARIMA model is able to produce better goodness of fit measures than the ETS model, although the difference is significantly less than when using the weekly data. Furthermore, the ETS model seems to capture the data well using only the level component. Also, the SARIMA model was slightly better at reducing the autocorrelations than the ETS model. We also see that the ETS model forecasts monthly sales to be completely stable without any fluctuations while the SARIMA model forecasts monthly sales to be fluctuating. Subsequently, we compare the models using cross validation like we did before. We adjust the horizons to the same periods as before but expressed in months instead of weeks, except for the horizon of 1 which now represents 1 month instead of 1 week.

tsbl_m.CV.h1 <- tsbl_m %>% 
              slice(-n()) %>%    
              stretch_tsibble(.init = 10) %>%   
              model(ETS(value),ARIMA(value)) %>%
              forecast(h = 1)

tsbl_m.CV.h6 <- tsbl_m %>% 
               slice(-n()) %>%    
               stretch_tsibble(.init = 10) %>%   
               model(ETS(value),ARIMA(value)) %>%
               forecast(h = 6)

tsbl_m.CV.h12 <- tsbl_m %>% 
               slice(-n()) %>%    
               stretch_tsibble(.init = 10) %>%   
               model(ETS(value),ARIMA(value)) %>%
               forecast(h = 12)

tsbl_m.CV.h1 %>% accuracy(tsbl_m) -> mod.sel.h1
tsbl_m.CV.h6 %>% accuracy(tsbl_m) -> mod.sel.h6
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 5 observations are missing between 2013 Jan and 2013 May
tsbl_m.CV.h12 %>% accuracy(tsbl_m) -> mod.sel.h12
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 11 observations are missing between 2013 Jan and 2013 Nov
mod.sel.h1[,2:9]
## # A tibble: 2 x 8
##   .type     ME   RMSE    MAE   MPE  MAPE  MASE RMSSE
##   <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Test  -8763. 28053. 18708. -2.37  5.59 0.276 0.398
## 2 Test  -5426. 17990. 13252. -1.73  3.98 0.196 0.255
mod.sel.h6[,2:9]
## # A tibble: 2 x 8
##   .type      ME   RMSE    MAE   MPE  MAPE  MASE RMSSE
##   <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Test  -19752. 40516. 31030. -5.77  9.47 0.459 0.575
## 2 Test  -13109. 34566. 26708. -4.46  8.37 0.395 0.490
mod.sel.h12[,2:9]
## # A tibble: 2 x 8
##   .type      ME   RMSE    MAE    MPE  MAPE  MASE RMSSE
##   <chr>   <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Test  -37083. 58730. 46220. -11.4   14.5 0.683 0.833
## 2 Test  -25150. 44074. 35094.  -8.31  11.2 0.519 0.625

All measures of all three horizons indicate that the ETS model produces better forecasts. Thus, interestingly, the transformation to (approximately) monthly data has led to the ETS model performing better than the SARIMA model. This indicates that the ETS model is limited in its capabilities to model weekly data. This supports the earlier introduced idea that some models have difficulties modeling weekly data, as also discussed in Hyndman & Athanasopoulos (2018, section 12.1).

Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.

Formulating a Research Question

Since we did not come across any indications of an intervention in the time series when exploring the data, we cannot use synthetic control or interrupted time series. Hence, we use the Granger Causality causal inference method. I want to investigate the causal relationship between weekly sales and unemployment. Weekly sales may cause unemployment because an increase in sales may increase profits, which may then be used to hire more personnel, thereby decreasing unemployment. Unemployment may also cause weekly sales, because a decrease in unemployment means that customers are more able to buy products, thereby increasing sales. Hence, it would be interesting to investigate which variable causes which other variable. Although Walmart cannot easily intervene on weekly sales or unemployment, they benefit from knowing this relationship by for example adapting their inventory planning to expected changes in unemployment. The research question that needs to be answered is therefore: What is the causal relationship between Weekly Sales and Unemployment?

Granger Causality

Differencing

I create two separate ts objects for the two variables and plot them.

ws <- ts[,1]
unemp <- ts[,6]
plot.ts(ws, ylab = "Weekly Sales")

plot.ts(unemp, ylab = "Unemployment")

Just by eyeballing, it looks like the reverse is happening of what we expected. Over time, both unemployment and weekly sales are decreasing, suggesting there could be a positive relationship between them instead of the negative relationship we expected. We already know from the previous part of the project that weekly sales is not stationary, and unemployment does not look stationary either. Thus, we check whether we need to apply differencing.

ndiffs(ws, alpha=0.05, test=c("kpss")) 
## [1] 1
ndiffs(unemp, alpha=0.05, test=c("kpss")) 
## [1] 1

The result indicates we should use a differencing of 1 for weekly sales and unemployment. Hence, this is what we will do.

ws_d <- diff(ws)
unemp_d <- diff(unemp)
plot.ts(ws_d, ylab='Weekly Sales')

plot.ts(unemp_d, ylab='Unemployment')

The function seems to have correctly made weekly sales and unemployment approximately stationary.

Testing for Granger Causality

First, we let the auto.arima function choose the best ARIMA model to get an indication for how many lags we should use in testing for Granger causality.

auto.arima(ws_d, stepwise = FALSE, approximation = FALSE)
## Warning: The time series frequency has been rounded to support seasonal
## differencing.
## Series: ws_d 
## ARIMA(0,0,1)(0,1,0)[52] 
## 
## Coefficients:
##           ma1
##       -0.5412
## s.e.   0.0955
## 
## sigma^2 estimated as 342340999:  log likelihood=-1011.59
## AIC=2027.19   AICc=2027.33   BIC=2032.19
auto.arima(unemp_d, stepwise = FALSE, approximation = FALSE)
## Series: unemp_d 
## ARIMA(2,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3     mean
##       1.7630  -0.9798  -1.9505  1.2883  -0.1915  -0.0158
## s.e.  0.0234   0.0310   0.1065  0.2150   0.1175   0.0041
## 
## sigma^2 estimated as 0.005319:  log likelihood=172.55
## AIC=-331.11   AICc=-330.27   BIC=-310.42

For weekly sales, a lag-1 MA model was chosen (ignoring the seasonal component) and for unemployment, a lag-2 AR combined with a lag-3 MA model was chosen. Hence, we will test Granger Causality with a lag of 1 and a lag of 3.

We start with a lag of 1. First, we test if weekly sales is a Granger cause of unemployment. We do this by testing an AR(1) model for unemployment against that same model but which also includes lag-1 values of weekly sales as a predictor. We do this using the grangertest function.

grangertest(unemp_d ~ ws_d, order = 1)
## Granger causality test
## 
## Model 1: unemp_d ~ Lags(unemp_d, 1:1) + Lags(ws_d, 1:1)
## Model 2: unemp_d ~ Lags(unemp_d, 1:1)
##   Res.Df Df     F  Pr(>F)  
## 1    138                   
## 2    139 -1 6.173 0.01417 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value of the Wald-test is below 0.05. Thus, we reject the null hypothesis that the fit of the models is equivalent. Therefore, the results indicate that Weekly Sales a prima facie cause of unemployment. Next, we perform the same analysis but in the opposite direction.

grangertest(ws_d ~ unemp_d, order = 1)
## Granger causality test
## 
## Model 1: ws_d ~ Lags(ws_d, 1:1) + Lags(unemp_d, 1:1)
## Model 2: ws_d ~ Lags(ws_d, 1:1)
##   Res.Df Df      F  Pr(>F)  
## 1    138                    
## 2    139 -1 2.7816 0.09762 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the p-value of the Wald-test is above 0.05. Thus, we do not reject the null hypothesis that the fit of the models is equivalent. We conclude that both models have similar fit. Hence, the results indicate that unemployment is not a Granger cause of weekly sales. Next, we repeat both tests but with a lag of 3.

grangertest(unemp_d ~ ws_d, order = 3)
## Granger causality test
## 
## Model 1: unemp_d ~ Lags(unemp_d, 1:3) + Lags(ws_d, 1:3)
## Model 2: unemp_d ~ Lags(unemp_d, 1:3)
##   Res.Df Df      F  Pr(>F)  
## 1    132                    
## 2    135 -3 3.4784 0.01789 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see no change. As before, we conclude that the models do not have similar fit. Therefore, the results indicate that Weekly Sales is a prima facie cause of unemployment.

grangertest(ws_d ~ unemp_d, order = 3)
## Granger causality test
## 
## Model 1: ws_d ~ Lags(ws_d, 1:3) + Lags(unemp_d, 1:3)
## Model 2: ws_d ~ Lags(ws_d, 1:3)
##   Res.Df Df      F Pr(>F)
## 1    132                 
## 2    135 -3 0.1249 0.9452

We also do not see a change in these results. As before, we conclude that the models have similar fit. Therefore, the results indicate that unemployment is not a Granger cause of weekly sales.

We reached the same conclusion for the different lags, therefore, at this point our results are conclusive. However, we want to make sure these results are robust. Hence, in this section we try another implementation of Granger causality; using vector autoregressive models instead of univariate models like we did so far.

Assumptions

In testing for Granger causality, if we want to rule out that statistical dependence arises because of unobserved common causes, we have to assume sufficiency, which means that we assume that there are no unobserved common causes. Furthermore, the application of Granger causality assumes that (1) the data is stationary and (2) that the data can be described by a linear model.

Previously, we analyzed the causal relationship between weekly sales and unemployment and found conclusive results for two different lags. However, we want to make sure these results are robust. Therefore, in this part, we try a different implementation of Granger causality where we use vector autoregressive models (VAR) instead of univariate models. Instead of comparing an ARIMA model with and without lagged predictors we fit a single multivariate VAR model to the data, and examine the cross-lagged relationships.

VARselect(cbind(ws_d, unemp_d))
## $selection
## AIC(n)  HQ(n)  SC(n) FPE(n) 
##      3      3      3      3 
## 
## $criteria
##                   1            2            3            4            5
## AIC(n) 1.493154e+01 1.474568e+01 1.418004e+01 1.422423e+01 1.420754e+01
## HQ(n)  1.498478e+01 1.483443e+01 1.430429e+01 1.438398e+01 1.440278e+01
## SC(n)  1.506257e+01 1.496408e+01 1.448579e+01 1.461734e+01 1.468801e+01
## FPE(n) 3.052747e+06 2.535128e+06 1.440127e+06 1.505533e+06 1.481132e+06
##                   6            7            8            9           10
## AIC(n) 1.423483e+01 1.427723e+01 1.425768e+01 1.427909e+01 1.431051e+01
## HQ(n)  1.446557e+01 1.454346e+01 1.455942e+01 1.461632e+01 1.468324e+01
## SC(n)  1.480266e+01 1.493241e+01 1.500022e+01 1.510899e+01 1.522776e+01
## FPE(n) 1.522879e+06 1.589926e+06 1.560566e+06 1.596171e+06 1.649458e+06

We see that all selection criteria suggest we should use a VAR(3) model.

var3_out <- VAR(cbind(ws_d, unemp_d), p = 3)
summary(var3_out)
## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: ws_d, unemp_d 
## Deterministic variables: const 
## Sample size: 139 
## Log Likelihood: -1372.427 
## Roots of the characteristic polynomial:
## 0.8992 0.8992 0.8179 0.2918 0.2656 0.2656
## Call:
## VAR(y = cbind(ws_d, unemp_d), p = 3)
## 
## 
## Estimation results for equation ws_d: 
## ===================================== 
## ws_d = ws_d.l1 + unemp_d.l1 + ws_d.l2 + unemp_d.l2 + ws_d.l3 + unemp_d.l3 + const 
## 
##              Estimate Std. Error t value Pr(>|t|)    
## ws_d.l1    -6.105e-01  6.618e-02  -9.225 5.98e-16 ***
## unemp_d.l1 -2.347e+03  1.852e+04  -0.127   0.8994    
## ws_d.l2    -6.332e-01  6.680e-02  -9.479  < 2e-16 ***
## unemp_d.l2 -5.355e+03  1.848e+04  -0.290   0.7724    
## ws_d.l3    -6.485e-01  6.866e-02  -9.446  < 2e-16 ***
## unemp_d.l3  9.514e+03  1.843e+04   0.516   0.6065    
## const      -3.692e+03  1.484e+03  -2.488   0.0141 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 15960 on 132 degrees of freedom
## Multiple R-Squared: 0.5602,  Adjusted R-squared: 0.5403 
## F-statistic: 28.03 on 6 and 132 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation unemp_d: 
## ======================================== 
## unemp_d = ws_d.l1 + unemp_d.l1 + ws_d.l2 + unemp_d.l2 + ws_d.l3 + unemp_d.l3 + const 
## 
##              Estimate Std. Error t value Pr(>|t|)   
## ws_d.l1     9.119e-07  3.124e-07   2.919  0.00413 **
## unemp_d.l1 -3.921e-02  8.742e-02  -0.448  0.65454   
## ws_d.l2     6.682e-07  3.153e-07   2.119  0.03595 * 
## unemp_d.l2 -4.069e-04  8.723e-02  -0.005  0.99629   
## ws_d.l3     2.616e-07  3.241e-07   0.807  0.42100   
## unemp_d.l3 -2.482e-02  8.698e-02  -0.285  0.77582   
## const      -1.544e-02  7.004e-03  -2.204  0.02923 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.07533 on 132 degrees of freedom
## Multiple R-Squared: 0.08028, Adjusted R-squared: 0.03848 
## F-statistic:  1.92 on 6 and 132 DF,  p-value: 0.08203 
## 
## 
## 
## Covariance matrix of residuals:
##               ws_d    unemp_d
## ws_d     2.547e+08 -1.144e+02
## unemp_d -1.144e+02  5.675e-03
## 
## Correlation matrix of residuals:
##             ws_d  unemp_d
## ws_d     1.00000 -0.09512
## unemp_d -0.09512  1.00000

We see that there are are no significant cross-legged effects of unemployment on weekly sales for any of the lags at an alpha of 0.05. Hence, the results indicate that unemployment is not a Granger cause of weekly sales. Furthermore, we see that there are two significant cross-legged effects of weekly sales on unemployment, specifically at a lag of 1 and a lag of 3. Thus, the results indicate that weekly sales is a Granger cause of unemployment. These results confirm what we have found in part 2a. One more thing we may try is using the Granger causality test of the vars package.

causality(var3_out, cause = "unemp_d")$Granger
## 
##  Granger causality H0: unemp_d do not Granger-cause ws_d
## 
## data:  VAR object var3_out
## F-Test = 0.1249, df1 = 3, df2 = 264, p-value = 0.9453
causality(var3_out, cause = "ws_d")$Granger
## 
##  Granger causality H0: ws_d do not Granger-cause unemp_d
## 
## data:  VAR object var3_out
## F-Test = 3.4784, df1 = 3, df2 = 264, p-value = 0.01653

Here we see that these results also support all the results we found so far, as we find that weekly sales is a Granger cause of unemployment and unemployment is not a Granger cause of weekly sales. Overall, we conclude that our results are robust.

One important thing to take into consideration here is that although we found that weekly sales is a Granger cause of unemployment, the results in of the VAR analysis show the coefficients are positive, indicating a positive relationship. This contradicts our expectations. We would expect that there is a negative effect because an increase in weekly sales may lead to higher profits, which may be used to hire more personnel, which leads to a decrease in unemployment. However, the results suggest that an increase in weekly sales would actually lead to and increase in unemployment, which is highly counter-intuitive. A more likely explanation would be that there is another unobserved variable that we did not include in our analysis that is confounding the relationship. For example, there may be increased competition in the market that is driving weekly sales down of this specific Walmart store, but which also decreases unemployment as more businesses are entering the market and more work becomes available. Other potential confounders should therefore also be taken into account to make a more substantiated conclusion.