Scenario

Corporación Favorita is a large Ecuadorian-based grocery store. Over recent years, Corporación Favorita has experienced growth as well as difficult times. On April 16th 2016, an earthquake with a magnitude of 7.8 struck Ecuador and greatly affected sales for several weeks. Corporación Favorita now wants to analyze the data from the last few years and get a better understanding of the company’s growth and where it can make improvements. In this analysis, 54 of Corporación Favorita stores were used to collect data. To get a strong understanding of how the company is performing, the total sales per day will be used to train a forecasting model. Individual stores did not get analyzed. This will be left for a future study. The type of goods sold by the company and oil-prices over the same period of time will be analyzed as well to gain more insight on the company’s progess.

Ask

Corporación Favorita desires an accurate forecast of future sales to better understand inventory needs.

Prepare

The data for this analysis can be accessed by following this link. The following data sets were used during this project. * train.csv: This will be used to train the forecasting model. This is a data set compiling each stores performance since 2013 * oil.csv: This data set tracks the oil prices over the same period of time as the previous data sets * holidays_events.csv: This data set tracks where a holiday falls and how it is categorized.

These data sets were uploaded and cleaned using SQL.

Process

SQL queries were used to create the following tables. * training_clean_v02: Table consists of each day from 01-01-2013 until 08-15-2017. For eeach date, there is a corresponding “daily_sales” which totals the amount of sales that occurred for all of the stores combined. Duplicates and null entries were removed. * holidays_events_clean: Duplicates and null entries were removed. * oil_prices_clean: Duplicates and null entries were removed.

Analyze

These data sets were then uploaded to RStudio for further analysis. To begin, a time-series graph will be created to visualize the total sales for Corporación Favorita on a given day.

# Ensure that the date column is viewed in date format
training_clean_v02$date <- as.Date(training_clean_v02$date)


# Create a time-series
traints <- ts(training_clean_v02$daily_sales,
              start = as.numeric(format(min(training_clean_v02$date), "%Y")), 
              end = as.numeric(format(max(training_clean_v02$date), "%Y")), 
              frequency = 365)

Now, a forecasting model will be created, using the training_clean_v02 and the above time-series.

# The acf function will be used to learn more about the time-series. The consistent crossing of the blue-dotted line indicates that past values of the time series are correlated with the current values.
acf(traints)

# The PACF helps us understand the direct influence of a past observation on the current observation, removing the influence of the intervening observations.
pacf(traints)

# The ADF Test is used to check for the presence of a unit root in a time series. The presence of a unit root indicates that the time series is non-stationary, and its statistical properties change over time.
adf.test(traints)
## Warning in adf.test(traints): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  traints
## Dickey-Fuller = -6.7665, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

Share

Based on the ACF and PACF tests, it seems that there is some level of seasonality and also an Autoregressive component. The presence of both of these components suggests that an ARIMA model may be appropriate for modeling the time-series.

# Forumulate the ARIMA model
sales_model <- auto.arima(traints, ic = "aic", trace = TRUE)
## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,1,2)(1,0,1)[365] with drift         : Inf
##  ARIMA(0,1,0)             with drift         : 39021.37
##  ARIMA(1,1,0)(1,0,0)[365] with drift         : Inf
##  ARIMA(0,1,1)(0,0,1)[365] with drift         : Inf
##  ARIMA(0,1,0)                                : 39019.4
##  ARIMA(0,1,0)(1,0,0)[365] with drift         : Inf
##  ARIMA(0,1,0)(0,0,1)[365] with drift         : Inf
##  ARIMA(0,1,0)(1,0,1)[365] with drift         : Inf
##  ARIMA(1,1,0)             with drift         : 39006.64
##  ARIMA(1,1,0)(0,0,1)[365] with drift         : Inf
##  ARIMA(1,1,0)(1,0,1)[365] with drift         : Inf
##  ARIMA(2,1,0)             with drift         : 38795.13
##  ARIMA(2,1,0)(1,0,0)[365] with drift         : Inf
##  ARIMA(2,1,0)(0,0,1)[365] with drift         : Inf
##  ARIMA(2,1,0)(1,0,1)[365] with drift         : Inf
##  ARIMA(3,1,0)             with drift         : 38778.76
##  ARIMA(3,1,0)(1,0,0)[365] with drift         : Inf
##  ARIMA(3,1,0)(0,0,1)[365] with drift         : Inf
##  ARIMA(3,1,0)(1,0,1)[365] with drift         : Inf
##  ARIMA(4,1,0)             with drift         : 38675.9
##  ARIMA(4,1,0)(1,0,0)[365] with drift         : Inf
##  ARIMA(4,1,0)(0,0,1)[365] with drift         : Inf
##  ARIMA(4,1,0)(1,0,1)[365] with drift         : Inf
##  ARIMA(5,1,0)             with drift         : 38229.51
##  ARIMA(5,1,0)(1,0,0)[365] with drift         : Inf
##  ARIMA(5,1,0)(0,0,1)[365] with drift         : Inf
##  ARIMA(5,1,0)(1,0,1)[365] with drift         : Inf
##  ARIMA(5,1,1)             with drift         : 38128.21
##  ARIMA(5,1,1)(1,0,0)[365] with drift         : Inf
##  ARIMA(5,1,1)(0,0,1)[365] with drift         : Inf
##  ARIMA(5,1,1)(1,0,1)[365] with drift         : Inf
##  ARIMA(4,1,1)             with drift         : 38355.46
##  ARIMA(5,1,2)             with drift         : Inf
##  ARIMA(4,1,2)             with drift         : Inf
##  ARIMA(5,1,1)                                : 38126.53
##  ARIMA(5,1,1)(1,0,0)[365]                    : Inf
##  ARIMA(5,1,1)(0,0,1)[365]                    : Inf
##  ARIMA(5,1,1)(1,0,1)[365]                    : Inf
##  ARIMA(4,1,1)                                : 38354.01
##  ARIMA(5,1,0)                                : 38227.69
##  ARIMA(5,1,2)                                : Inf
##  ARIMA(4,1,0)                                : 38673.97
##  ARIMA(4,1,2)                                : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(5,1,1)                                : 38162.2
## 
##  Best model: ARIMA(5,1,1)
# Determine the forecast range. The training_clean_v02 data set goes up to 08-15-2017, threfore we will forecast 16 more days.
sales_forecast <- forecast(sales_model, level = c(95), h = 16)

forecast_dates <- seq(max(training_clean_v02$date) + 1, length.out = 16, by = "days")

# Forecasted values will be placed into a table as well as shown visually.
sales_forecast <- data.frame(date = forecast_dates, predicted_sales = sales_forecast$mean)

ggplot(data = training_clean_v02, aes(x = date, y = daily_sales)) + 
  geom_line() +
  labs(x = "Date", y = "Value", title = "Time Series Plot") +
  geom_line(data = sales_forecast, aes(x = date, y = predicted_sales), color = "blue")

print(sales_forecast)
##          date predicted_sales
## 1  2017-08-16       1000831.2
## 2  2017-08-17       1295193.4
## 3  2017-08-18        679329.8
## 4  2017-08-19        764395.7
## 5  2017-08-20       1039842.5
## 6  2017-08-21       1068742.7
## 7  2017-08-22        930934.4
## 8  2017-08-23       1131242.8
## 9  2017-08-24       1050183.8
## 10 2017-08-25        847320.1
## 11 2017-08-26        871795.0
## 12 2017-08-27       1010324.0
## 13 2017-08-28        970321.6
## 14 2017-08-29        996388.4
## 15 2017-08-30       1067889.6
## 16 2017-08-31       1006531.0

Now that an estimate of the total sales has been made, it is appropriate to look at the top performing categories of sales. Below are the Top 10 areas where Corporación Favorita made the most amount of sales.

# Filter the top N categories to display
top_N <- 10
top_families <- family_sales_clean %>% 
  arrange(desc(total_sales)) %>% 
  head(top_N)

# Create a custom formatting function for the y-axis labels
millions_format <- function(x) {
  x_millions <- x / 1e6
  paste0(format(x_millions, big.mark = ","), "M")
}

# Plot the horizontal bar chart using ggplot2 with formatted y-axis labels
ggplot(top_families, aes(x = reorder(family, total_sales), y = total_sales)) +
  geom_bar(stat = "identity", width = 0.6) +
  labs(x = "Product Category", y = "Total Sales", title = "Top Total Sales by Family Since 2013") +
  theme(axis.text.y = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip() +
  scale_y_continuous(labels = millions_format)

As an added layer of insight, the oil prices can be viewed over the same period of time.

ggplot(oil_prices_clean, aes(x = date, y = oil_price)) +
  geom_line() +
  labs(title = "Oil Prices Over Time", x = "Date", y = "Oil Price")

Key Findings

  • There is a 95% confidence level that the forecasted values for total sales throughout August are accurate.
print(sales_forecast)
##          date predicted_sales
## 1  2017-08-16       1000831.2
## 2  2017-08-17       1295193.4
## 3  2017-08-18        679329.8
## 4  2017-08-19        764395.7
## 5  2017-08-20       1039842.5
## 6  2017-08-21       1068742.7
## 7  2017-08-22        930934.4
## 8  2017-08-23       1131242.8
## 9  2017-08-24       1050183.8
## 10 2017-08-25        847320.1
## 11 2017-08-26        871795.0
## 12 2017-08-27       1010324.0
## 13 2017-08-28        970321.6
## 14 2017-08-29        996388.4
## 15 2017-08-30       1067889.6
## 16 2017-08-31       1006531.0
  • Beverages and Grocery I have contributed greatly to Corporación Favorita’s performance.
  • There was a breif decline in performance in 2016, however, by considering the price of oil at that time, it is likely that Corporación Favorita was not alone in this trend.