Analysis of Sales and Profits for a Retail Store in R and MS Power BI

Independent Data Analysis Project

Author

Affiliations

John Karuitha, PhD

Karatina University, Department of Business and Economics

University of the Witwatersrand, School of Construction Economics & Management

Published

October 23, 2024

Modified

October 23, 2024

Abstract

This article has three sections: analysis of store sales, time series analysis, and the development of a dashboard in PowerBI. In the store sales analysis, I test for hypotheses and create a regression model. In the time series section, I forecast store sales from historical data. Finally, I create a dashboard for the retail store to inform managers of developments in the business.

Keywords

Data analysis, R, Microsoft PowerBI, Descriptive Analysis, Linear Regression, Hypothesis testing, Dashboards, Times series analysis

PART A: Analyzing Store Sales

Introduction

In today’s dynamic retail landscape, understanding sales trends and profitability factors is crucial for the sustained success of any business. The retail sector is highly competitive and demands a nuanced analysis of various factors influencing sales and profitability. With this perspective, my analysis focuses on the trends and drivers of sales and profitability in a New Jersey store environment. New Jersey, known for its diverse consumer demographics and a vibrant economy, serves as an ideal setting to investigate the intricate dynamics impacting retail sales.

The analysis is especially critical with the rise of e-commerce and changing consumer preferences. These dynamics make it imperative for traditional brick-and-mortar stores to adapt and innovate. Our analysis aims to provide valuable insights into the sales and profitability of the store (R Core Team 2023).

The analysis proceeds as follows:

Loading and examining the data.
Exploratory data analysis.
Hypothesis testing.
Regression analysis.

This analysis provides practical implications for store operators aiming to enhance their financial performance in a challenging market environment.

## Load packages and options ----
if(!require(pacman)){
  install.packages('pacman')
  library(pacman)
}

p_load(tidyverse, janitor, skimr, xgboost, 
       kableExtra, stargazer, rio, mice, naniar, 
       modeltime, tidyquant, timetk, anytime, ggpubr, 
       glue, doParallel, patchwork, tidymodels,
       data.table)

p_load_gh("datarootsio/artyfarty")

theme_set(artyfarty::theme_bain())
options(digits = 2)
options(scipen = 999)

## Nice table function 
nice_table <- function(mydata, caption, full_width = FALSE){
  mydata %>% 
    kableExtra::kbl(booktabs = TRUE,
                    caption = caption) %>% 
    kableExtra::kable_classic(
      full_width = full_width,
      latex_options = "hold_position"
    )
}

## Speed
## Hasten code execution by parallel computing
all_cores <- parallel::detectCores(logical = FALSE)
cl <- makeCluster(all_cores)
registerDoParallel(cl)

Objectives

General objective

To analyze sales trends and evaluate factors affecting profitability of a New Jersey store.

Specific objectives

Exploratory data analysis: Descriptive statistics on the sales and profitability of the store.
Hypothesis testing: Test the hypothesis that the profit margin of the store is significantly greater than zero (One-sample t-test).
Hypothesis testing: Compare the profitability of corporate consumers and ordinary consumers (Two-sample t-test).
Regression model: Run a regression model to establish variables that have a string relationship with the profitability of the store.

Data

## Read the data ----
sales_tbl <- rio::import("data/New_jersey_only.xlsx",
                         sheet = 1)

The data for this section comes from a retail store in New Jersey. The data spans the years 2011 to 2014 and has 130 and 23. The variables in the data are as follows.

## Column names ----
names(sales_tbl)

 [1] "order_id"       "order_date"     "ship_date"      "ship_mode"     
 [5] "customer_name"  "segment"        "state"          "country"       
 [9] "market"         "region"         "product_id"     "category"      
[13] "sub_category"   "product_name"   "sales"          "quantity"      
[17] "discount"       "profit"         "shipping_cost"  "order_priority"
[21] "year"           "interval"       "order_month"

Missing Values and Duplicates

The data has neither missing values nor duplicates. Most retail data tends to be reasonably complete due to the use of POS terminals to capture transactions. We proceed to feature engineering and variable selection for analysis modelling (Wickham, Çetinkaya-Rundel, and Grolemund 2023).

Feature Engineering

To create new features, I start by converting the order_date and ship_date variables into dates and compute the difference between them to create a new variable, interval. I then create an order_month variable that captures the month of the corresponding sale.

## Data wrangling and feature engineering ----
## calculate interval between shopping and shipping
sales_tbl <- sales_tbl %>% 
  ## Convert to dates ----
  mutate(
    order_date = anytime(order_date),
    ship_date = anytime(ship_date))

Analysis

I commence the analysis by employing exploratory data analysis (EDA) to shed light on the store’s sales landscape, providing a foundation to understand the financial performance of the store. Following EDA, I conduct two hypothesis tests, including both one-sample and two-sample t-tests. Finally, I execute a regression analysis to explore the factors influencing the profitability of products in the store.

Exploring the Data

Figure 1 illustrates the distribution of the relevant variables in the dataset. Notably, profit margins hover slightly above zero, underscoring the narrow profit margins typical in retail businesses in the United States and globally. Sales prices exhibit a pronounced leftward skew, indicating that most retail items are sold individually at relatively low prices. While quantities of sales are not as heavily skewed as sales prices, they still display a rightward skew, suggesting that a majority of consumers make purchases in small quantities. In terms of units sold, binders emerge as the most popular items in the store, followed by storage equipment, while tables are the least popular.

Table 1 and Table 2 provide summary statistics. Table 1 outlines summaries for numeric variables, revealing that the highest-priced item is listed at USD 22,638.00 (sales variable). Notably, there is a sale resulting in a profit of USD -6600 (negative), and the most profitable sale yields USD 8399.98. Sales, with the highest standard deviation (SD) at USD 487.57, contrasts with discounts, which have the lowest SD at USD 0.21. Shipping costs range from zero (free shipping) to a high of USD 933.57, potentially for bulk items shipped internationally.

Table 2 offers a summary of character variables, indicating that the store’s sales span 147 countries across 7 global markets, subdivided into 13 regions. The sales encompass 3 categories and 17 sub-categories of items, with a total of 3788 unique products stocked and sold. Additionally, the market targets 3 segments: consumer, home office, and corporate, yet the majority of sales are directed towards the consumer segment, as we will further explore.

## Plotiing distributions of main variables ----

# Profits
profits <- sales_tbl %>% 
  ggplot(mapping = aes(x = profit)) + 
  geom_density() + 
  labs(x = "Profit",
       y = "",
       title = "Distribution of Profit Variable",
       subtitle = "Most of the Profits are Marginal")

# Sales
sales <- sales_tbl %>% 
  ggplot(mapping = aes(x = sales)) + 
  geom_density() + 
  labs(x = "Selling Price",
       y = "",
       title = "Distribution of the Sales Variable",
       subtitle = "Sales are skewed to the right indicating most products have low prices")

# Quantity
quantity <- sales_tbl %>% 
  ggplot(mapping = aes(x = quantity)) + 
  geom_histogram() + 
  labs(x = "Quantity of Sales",
       y = "",
       title = "Distribution of Quantity Sold Variable",
       subtitle = "Most products sell in low quantities")

# Category
categories <- sales_tbl %>%
  mutate(sub_category = fct_rev(fct_infreq(sub_category))) %>% 
  ggplot(mapping = aes(x = sub_category)) + 
  geom_bar() + 
  labs(x = "",
       y = "",
       title = "Sales by Product Category",
       subtitle = "Binders have the highest demand while tables have the least") + coord_flip()

## Combine the plots ----
(profits | sales) / (quantity | categories)

## Summary for Numeric variables -----
sales_tbl %>% 
  dplyr::select(where(is.numeric), 
                -year) %>% 
  skimr::skim_without_charts() %>% 
  dplyr::select(-n_missing, 
                -skim_type,
                -complete_rate) %>% 
  rename(
         Variable = skim_variable,
         Mean = numeric.mean,
         SD = numeric.sd,
         Min = numeric.p0,
         Q1 = numeric.p25,
         Median = numeric.p50,
         Q3 = numeric.p75,
         Max = numeric.p100) %>% 
  nice_table(caption = "Summary Statistics",
             full_width = TRUE)

Summary Statistics
Variable	Mean	SD	Min	Q1	Median	Q3	Max
sales	275.1	890.53	4.00	20.0	67.0	209	9100.0
quantity	3.5	1.81	1.00	2.0	3.0	5	9.0
discount	0.0	0.04	0.00	0.0	0.0	0	0.3
profit	75.2	236.39	-31.37	6.6	18.8	56	2366.0
shipping_cost	27.0	72.48	0.14	1.4	5.1	18	516.9
interval	4.4	1.58	0.00	4.0	4.0	6	7.0

## Summary for character variables -----
sales_tbl %>% 
  dplyr::select(where(is.character)) %>% 
  skimr::skim_without_charts() %>% 
  dplyr::select(-n_missing,
                -skim_type,
                -complete_rate,
                - character.empty,
                -character.whitespace) %>% 
  nice_table(caption = "Summary Statistics: Character Variables",
             full_width = TRUE)

Summary Statistics: Character Variables
skim_variable	character.min	character.max	character.n_unique
order_id	14	14	61
ship_mode	8	14	4
customer_name	9	19	61
segment	8	11	3
state	10	10	1
country	13	13	1
market	2	2	1
region	4	4	1
product_id	15	15	129
category	9	15	3
sub_category	3	11	16
product_name	7	94	128
order_priority	3	8	4
order_month	3	3	12

We now turn to Figure 2. Panel 1 displays the trends in sales and profits for the store. Notably, there is a significant increase in sales between 2011 and 2014, accompanied by a less pronounced upward trend in profits. This observation suggests that the store may be bolstering sales by either reducing profit margins or offering higher discounts. The number of transactions fluctuates monthly, with an increase from January to a peak in December. Despite this overall rising trend, certain months, such as July and October, deviate from the pattern, yet still exhibit higher sales than January.

The trends in total sales and profits by month follow a similar pattern, as depicted in Panel 3. Notably, paper and labels boast the highest profit margins, exceeding 10%. Conversely, tables have the lowest margin, which is negative, indicating that tables are sold below cost. This may imply that the store is clearing slow-moving table stock.

# Plotting trends in pertinent variables ----

sales_year <- sales_tbl %>% 
  summarise(
    sales = sum(sales),
    profit = sum(profit),
    .by = year
  ) %>% 
  pivot_longer(
    -year,
    names_to = "variable"
  ) %>% 
  ggplot(mapping = aes(x = year, y = value,
                       color = variable)) + 
  geom_line(linewidth = 3) + 
  scale_color_manual(values = pal("pony")) + 
  theme(legend.title = element_blank(),
        legend.position = 'top') + 
  labs(title = "Trends in Sales and Profits, 2011-14",
       subtitle = "Sales are sising faster than profits")

## Number of transactions -----
trans_month <- sales_tbl %>% 
  ggplot(mapping = aes(x = order_month)) + 
  geom_bar() + 
  scale_fill_manual(values = pal("pony")) + 
  labs(x = "", y = "Transactions",
       title = "Number of Transactions by Month",
       subtitle = "Gradual rise in transactions from January, peaking in December.")

## Value of transactions by month ----
sales_month <- sales_tbl %>% 
  dplyr::summarise(
    total_sales = sum(sales),
    total_profit = sum(profit),
    .by = order_month
  ) %>% 
  pivot_longer(-order_month,
               names_to = "Var") %>% 
  ggplot(aes(x = order_month, 
             y = value, 
             fill = Var)) + 
  geom_col(position = 'dodge') + 
  scale_fill_manual(values = pal("pony")) + 
  theme(
    legend.title = element_blank(),
    legend.position = 'top'
  ) +  
  labs(x = "", y = "Sales/Profits",
       title = "Sales/ Profits by Month",
       subtitle = "Gradual rise in sales and profits from January, peaking in December\n Sales rise faster than profits.")

##################################################
## Profits by product category ----
profit_subcategory <- sales_tbl %>% 
  dplyr::summarise(
    total_sales = sum(sales),
    total_profit = sum(profit),
    .by = sub_category
  ) %>% 
  mutate(margin = total_profit / total_sales) %>% 
  mutate(sub_category = fct_reorder(sub_category, margin)) %>% 
  ggplot(mapping = aes(x = sub_category,
                       y = margin)) + 
  geom_col(show.legend = FALSE) + 
  labs(x = "", y = "Profit Margin",
       title = "Profit Margin by Product Sub-category",
       subtitle = "Paper has the highest profit margin, while tables have a negative\n margin, meaning each table is sold at below cost") + coord_flip()

#######################################################
# Combine the plots ----
(sales_year | trans_month) / (sales_month + profit_subcategory)

Hypothesis testing

We initiate hypothesis testing concerning the store’s profitability. In the first test, we assess whether the profitability of the store significantly differs from zero. Subsequently, in the second test, we examine whether the profitability of the corporate segment significantly differs from that of the consumer segment. T-tests hinge on the assumption of data normality (Black 2023); hence, we present a graphical representation of profitability by segment to verify this assumption. Upon observation, the distribution of profit by segment is skewed. Hence, we take the logarithm of profits in doing the t-tests. The Figure below shows the distribution of the logarithm of profits that is closer to normal. With this confirmation, we proceed to the hypothesis testing phase (Dalgaard 2008).

# profits by segments ----
#| fig-cap: "profits by Segment"
#| fig-height: 8
#| fig-width: 6
sales_tbl %>% 
  ggplot(mapping = aes(x = log(profit + 32), color = segment)) + 
  geom_density(alpha = 0.5) + 
  labs(title = "Profit Distribution by Segment (Log Scale)",
       x = "Profit, Log Scale", y = "") + 
  theme(
    legend.position = "top",
    legend.title = element_blank()
  )

One-sample T-test

We seek to establish whether the profit margin of the store is significantly different from zero (One-sample t-test). The following are the hypotheses.

H0: The average profit of the store is zero.

H1: The average profit of the store is not equal to zero.

Note that this is a two-tailed test. We compute the t as follows.

\(t = \frac{\Omega - \mu}{\frac{s}{\sqrt{n}}}\)

Where \(\Omega\) is the actual mean, \(\mu\) is the hypothesized mean, \(s\) is the sample standard deviation, and \(n\) is the sample size.

## One sample t-test ----
sales_tbl %>% 
  pull(profit) %>% 
  t.test(mu = 0, alternative = "two.sided")


    One Sample t-test

data:  .
t = 4, df = 129, p-value = 0.0004
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  34 116
sample estimates:
mean of x 
       75

Subsequently, we determine the p-value associated with the t-value at \(n - 1\) degrees of freedom. If the p-value is less than or equal to the chosen significance level (5% in this instance), we reject the null hypothesis and embrace the alternative hypothesis. In our analysis, the observed p-value is nearly zero. Consequently, we opt for the alternative hypothesis, indicating that the mean profit of the store is not equal to zero.

Two-sample T-test

In this section, we conduct a comparison of profitability between corporate consumers and ordinary consumers using a Two-sample t-test. The hypotheses are outlined as follows:

H0: The true difference in means between group Consumer and group Corporate is equal to 0.

H1: The true difference in means between group Consumer and group Corporate is NOT equal to 0.

## Two sample t-test ----
t.test(log(profit + 32) ~ segment, data = sales_tbl %>% 
         dplyr::filter(segment %in% c('Consumer', 'Corporate')) %>% 
         mutate(segment = factor(segment))
       )


    Welch Two Sample t-test

data:  log(profit + 32) by segment
t = -0.8, df = 105, p-value = 0.5
alternative hypothesis: true difference in means between group Consumer and group Corporate is not equal to 0
95 percent confidence interval:
 -0.48  0.22
sample estimates:
 mean in group Consumer mean in group Corporate 
                    4.1                     4.2

The p-value of 0.5 means that we should accept the null hypothesis and reject the alternative. Hence, there is no significant difference in the average profit in the consumer and corporate section. The store management must cater for both segments equally well.

Regression analysis

In this regression analysis, I investigate the variables that exhibit a noteworthy association with profit. Following an iteration, I select the following variables:

Sales.
Quantity.
Discount.

For each variable, the null hypothesis posits that there is no substantial relationship between the variable and profits, while the alternative hypothesis asserts that a significant relationship exists between each variable and profits (Lantz 2019).

## Linear regression model -----
lm(profit ~ sales + quantity + discount, 
   data = sales_tbl) |> 
  summary()


Call:
lm(formula = profit ~ sales + quantity + discount, data = sales_tbl)

Residuals:
   Min     1Q Median     3Q    Max 
-426.7  -10.2   -2.6    7.8  369.3 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   18.46723   12.08554    1.53               0.129    
sales          0.25798    0.00639   40.40 <0.0000000000000002 ***
quantity      -3.68389    3.16663   -1.16               0.247    
discount    -302.20553  148.08567   -2.04               0.043 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62 on 126 degrees of freedom
Multiple R-squared:  0.933, Adjusted R-squared:  0.931 
F-statistic:  583 on 3 and 126 DF,  p-value: <0.0000000000000002

The p-value in all three cases is nearly zero. This observation implies the rejection of the null hypothesis in favor of the alternative hypothesis. Consequently, sales, quantity, and discount exhibit a significant relationship with profits. The coefficient of 0.25798 for sales indicates that, all else remaining constant, a dollar of sales corresponds to a 0.25798 increase in profits (log of profit precisely).

Quantity holds a coefficient of -3.87267, revealing a negative association between increased sales quantities and profits, likely due to higher sales quantities correlating with more discounts. On average, each additional unit sold corresponds to a USD 3.87267 reduction in profits, ceteris paribus. Notably, discounts exert a substantial impact on profitability, with a unit increase in discounts resulting in a USD 302.20553 reduction in profits, on average, while all other factors remain constant.

The model’s significance is reinforced by the F-statistic, which is also nearly zero. The F-test of overall significance assesses whether the linear regression model offers a better fit to the data compared to a model with no independent variables. Judging by the \(R^{2}\), the model can elucidate 31.4% of the variation in profits using the three variables.

Conclusion

This analysis centered on the trends and drivers of sales and profitability within a New Jersey store environment. Given New Jersey’s reputation for diverse consumer demographics and economic vibrancy, it provides an ideal backdrop for examining the complex dynamics that influence retail sales. Our findings indicate that the store consistently generates significant profits. Furthermore, we observed comparable profitability between the consumer and corporate segments. Notably, our investigation revealed that sales, quantity, and discounts emerge as the primary drivers of the store’s profitability.

These insights carry practical implications for the management of this specific store and other similar establishments. Understanding the key factors influencing profitability allows for informed decision-making and strategic planning, contributing to the overall success and sustainability of retail operations in comparable environments.

## Times series forecasting ------

PART B: Time Series Analysis

Background

In the dynamic realm of business, understanding and predicting future trends in sales is paramount for strategic decision-making. Time series forecasting serves as a valuable tool in this endeavor, providing a systematic approach to analyze patterns and make informed predictions based on historical data. For our analysis, we focus on the total sales data recorded by the store across quarters from 2011 to 2014 (Hyndman and Athanasopoulos 2013).

Time series forecasting involves the application of mathematical models to historical data points, unraveling patterns and trends over time. This method allows us to uncover insights into the store’s sales performance, identify recurring patterns, and project future outcomes. By delving into quarterly sales data, we aim to discern any seasonality, trends, or fluctuations that may influence the store’s financial trajectory.

Moreover, accurate sales predictions can aid the store in optimizing inventory management, resource allocation, and overall operational efficiency. Whether influenced by seasonal demand, economic factors, or other external variables, a robust time series forecasting analysis equips businesses with the foresight to navigate challenges and capitalize on opportunities.

Our goal is to contribute to the store’s strategic planning by developing a reliable forecasting model. Through this analysis, we seek to provide actionable insights that empower the store to proactively adapt to market dynamics and enhance its overall financial performance.

Data

The data comprises quarterly total sales data for a store spanning the years 2011 to 2014. Each entry in the data represents the aggregated sales figure for a specific quarter during this period. The quarterly sales data provides a snapshot of the store’s financial performance over time, allowing for the identification of trends, patterns, and potential seasonality. Key elements include the total sales amount recorded at regular intervals throughout the specified years, forming the basis for a time series analysis aimed at forecasting future sales trends. The dataset’s structure enables a detailed exploration of the store’s revenue dynamics and serves as the foundation for developing predictive models to enhance strategic decision-making.

## Import time series data ----
sales_ts_tbl <- rio::import("data/New_jersey_only.xlsx",
                         sheet = 2) %>% clean_names() %>% 
  mutate(year = str_extract(qtr_year, "\\d{4}$")) %>% 
  mutate(date = case_when(
    str_detect(qtr_year, "Qtr1") ~ "April 1",
    str_detect(qtr_year, "Qtr2") ~ "July 1",
    str_detect(qtr_year, "Qtr3") ~ "October 1",
    .default = "December 1"
  )) %>% 
  mutate(
    final_date = glue('{date}, {year}')
  ) %>% 
  dplyr::select(-year, -date, -qtr_year) %>% 
  relocate(final_date) %>% 
  mutate(
    final_date = mdy(final_date)
  ) %>% 
  mutate(the_month = lubridate::month(final_date, 
                                      label = TRUE,
                                      abbr = TRUE))

The quarterly sales data has 16 rows and 3 columns. The data starts in 2011 and ends in 2014. Let us examine the first few rows of the data

## First few data observations ----
head(sales_ts_tbl)

  final_date sales the_month
1 2011-04-01   115       Apr
2 2011-07-01  1661       Jul
3 2011-10-01  2069       Oct
4 2011-12-01   347       Dec
5 2012-04-01   287       Apr
6 2012-07-01   345       Jul

We also look at the final 6 rows of the data to see where it terminates.

## Last few observations ----
tail(sales_ts_tbl)

   final_date sales the_month
11 2013-10-01   868       Oct
12 2013-12-01  5062       Dec
13 2014-04-01   187       Apr
14 2014-07-01  3261       Jul
15 2014-10-01  1727       Oct
16 2014-12-01  4308       Dec

Objective

The objective of this analysis is to predict the sales in the first quarter of 2015, given the historical prices.

Exploring the Data

We start by exploring the data. The most sensible part is to look at the trends in sales over time. Figure 4 shows the trend which is clearly not stationary.

Definition: Stationarity

Stationarity can be defined in precise mathematical terms. In this case, we mean a flat looking series, that has no trend, has a constant variance over time, a constant autocorrelation structure over time and no periodic fluctuations (seasonality).

There is especially a notable spike in sales in mid-2013, indicating a period of exceptional retail activity that was not experienced in the previous years. The year 2012 also had noticeably low sales. From 2014, there appears to be a sustained uptick in sales. The purpose of this exercise is to predict the sales to expect in the first quarter of 2015. The blue line in the chart is a plot of moving averages of sales.

# Trends in sales 2011-14 ----

sales_ts_tbl %>% 
  plot_time_series(final_date, sales,
                   .x_lab = "",
                   .y_lab = "Sales, USD",
                   .title = "Trend in Sales, 2011-2014",
                   .interactive = FALSE)

Forecasting

We start by splitting the data into a training set and a testing set. I the fit the following algorithms to predict sales.

ARIMA model.
Boosted ARIMA model.
Exponential smoothing model.
Prophet Model.
Random forest model.

## Time series train-test split ----
splits <- initial_time_split(sales_ts_tbl, prop = 0.8)

ARIMA Model

Autoregressive Integrated Moving Average (ARIMA) is a statistical method commonly used for time series forecasting. Let’s break down the components:

Autoregressive (AR):

The “auto-regressive” part refers to the model’s dependence on its own past values.
In simpler terms, it assumes that the future value of a variable is influenced by its own past values. If sales were high in the previous quarter, it might indicate a trend of higher sales in the next quarter.

Integrated (I):

The “integrated” component involves differencing the time series data to make it stationary.
Stationarity means that the statistical properties of a time series, like mean and variance, remain constant over time. Differencing involves subtracting the current value from the previous one, helping stabilize erratic fluctuations.

Moving Average (MA):

The “moving average” component deals with modeling the relationship between the current value and a residual error from a moving average model applied to past observations.
It helps capture short-term fluctuations in the data that might not be explained by the autoregressive component alone.

In essence, ARIMA combines these three elements to create a model that considers the auto-regressive nature of the data, incorporates differencing to stabilize trends, and includes a moving average to account for short-term variations. The parameters of the ARIMA model (often denoted as p, d, q) represent the order of auto-regression, differencing, and moving average, respectively. By fitting this model to historical data, ARIMA enables us to make predictions about future values in a time series.

## ARIMA Model ----
arima_ord <- arima_reg() %>% 
  set_engine(engine = "auto_arima") %>% 
  fit(sales ~ final_date, 
      data = training(splits))

Boosted ARIMA Model

A Boosted ARIMA model is an advanced version of the traditional ARIMA model that incorporates the principles of boosting. Let’s break down the components:

ARIMA Base Model:

The foundation of a Boosted ARIMA model is the Autoregressive Integrated Moving Average (ARIMA) model, which includes autoregressive, differencing, and moving average components.
This base ARIMA model captures the underlying trends and patterns in the time series data.

Boosting Technique:

Boosting is a machine learning ensemble technique that combines the predictive power of multiple models to improve overall accuracy.
In the context of a Boosted ARIMA model, multiple instances of the ARIMA base model are created sequentially, each focusing on correcting errors made by the previous models.

Weighted Voting:

The predictions from each ARIMA base model are combined through a weighted voting system.
The weights assigned to each model depend on its accuracy in capturing patterns and making predictions. Models that perform well receive higher weights in the ensemble.

Adaptive Learning:

Boosting is an adaptive learning technique, meaning it adjusts the weights of the models based on their performance on specific instances.
Models that struggle with certain patterns in the data are given more weight in subsequent iterations to improve accuracy in predicting those patterns.

In summary, a Boosted ARIMA model leverages the strength of multiple ARIMA models through boosting, creating a powerful ensemble that can better capture complex patterns and improve forecasting accuracy. This approach is particularly useful when dealing with time series data that exhibits non-linear or intricate relationships, allowing the model to adapt and enhance its predictive capabilities over time.

## Boosted ARIMA model ----
model_fit_arima_boosted <- arima_boost(
    min_n = 2,
    learn_rate = 0.015
) %>%
    set_engine(engine = "auto_arima_xgboost") %>% 
  fit(sales ~ final_date + factor(the_month), data = training(splits))

Exponential Smoothing

Exponential Smoothing is a popular time series forecasting method that emphasizes the weighted influence of recent observations on predicting future values. Here’s a breakdown of its key components:

Exponential Weighting:

Exponential smoothing assigns exponentially decreasing weights to past observations, giving more significance to recent data points.
The most recent observations are considered more important, and the influence of older observations diminishes rapidly over time.

Smoothing Parameter (\(\alpha\)):

The smoothing parameter, denoted as α (alpha), determines the weight assigned to the most recent observation.
A higher α gives more weight to recent observations, making the model more responsive to short-term fluctuations. A lower α emphasizes older observations and provides a smoother trend.

Forecast Initialization:

The forecasting process begins with an initial forecast, often derived from historical data.
Subsequent forecasts are calculated using a combination of the most recent observation and the forecast made in the previous period.

Adaptability:

Exponential smoothing adapts to changes in the time series over time. As new observations become available, the model updates its forecasts based on the most recent data.
This adaptability is beneficial for capturing evolving trends and patterns in the time series.

In summary, Exponential Smoothing is a simple yet effective method for time series forecasting. It provides flexibility in balancing the influence of recent and historical observations, making it suitable for datasets with varying levels of seasonality and trend. The smoothing parameter allows users to fine-tune the model’s responsiveness to changes in the underlying patterns of the time series data.

## Exponential smoothing model ----
smoothing <- exp_smoothing() %>% 
  set_engine(engine = "ets") %>% 
  fit(sales ~ final_date + factor(the_month), data = training(splits))

Prophet Model

Prophet, developed by Facebook, is a robust time series forecasting model designed for daily observations with multiple seasonality components. Its additive decomposition breaks down time series data into trend, seasonality, and holidays, accommodating various cycles such as daily, weekly, and yearly patterns. The model automatically detects changepoints to adapt to shifts in underlying patterns, and users can include custom seasonality components and holidays to enhance accuracy. Prophet’s unique strength lies in its ability to handle uncertainties by providing uncertainty intervals for forecasts, enabling users to gauge the reliability of predictions. With a user-friendly interface and automatic features, Prophet proves to be an effective and flexible tool for forecasting in diverse time series scenarios.

## prophet model ----
model_fit_prophet <- prophet_reg() %>%
    set_engine(engine = "prophet") %>%
    fit(sales ~ final_date + factor(the_month), data = training(splits))

Random Forest Model

The Random Forest model is an ensemble learning technique widely used for both classification and regression tasks. Its fundamental principle involves constructing a multitude of decision trees during training and outputting the average (for regression) or mode (for classification) prediction of the individual trees. Each decision tree in the forest is built using a subset of the features and a subset of the training data, introducing randomness and diversity. This diversity, achieved through a process known as bootstrapped sampling and random feature selection, helps mitigate over-fitting and enhances the model’s robustness. During prediction, the random forest aggregates the results from individual trees, providing a more accurate and stable outcome. Known for its versatility, scalability, and ability to handle complex datasets, the Random Forest model has found applications in various fields, from finance to healthcare, owing to its capacity to deliver reliable predictions and effectively capture intricate relationships within data.

## Random forest model ----
rf_model <- rand_forest() %>% 
  set_engine(engine = "ranger") %>% 
  set_mode("regression") %>% 
  fit(sales ~ final_date + factor(the_month), data = training(splits))

Models Metrics

The table below shows the model metrics. The boosted ARIMA model beats all other models in all the metrics. I select that model for making reliable predictions of sales in the next quarter.

# Boosted ARIMA Model ----
models_tbl <- modeltime_table(
  arima_ord,
  model_fit_arima_boosted,
  smoothing,
  model_fit_prophet,
  rf_model
)

# models_tbl

## Forecasting table ----
calibration_tbl <- models_tbl %>% 
  modeltime_calibrate(new_data = testing(splits))

# calibration_tbl

# Model metrics ----
#| label: metrics
calibration_tbl %>%
    modeltime_accuracy() %>%
    table_modeltime_accuracy(
        .interactive = FALSE
    )

Accuracy Table
.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
1	ARIMA(0,0,0) WITH NON-ZERO MEAN	Test	1414	295	0.59	74	1570	NA
2	ARIMA(0,0,0) WITH NON-ZERO MEAN W/ XGBOOST ERRORS	Test	1182	255	0.49	66	1394	0.57
3	ETS(M,N,N)	Test	1414	255	0.59	76	1631	NA
4	PROPHET W/ REGRESSORS	Test	5564	959	2.32	119	5689	0.48
5	RANGER	Test	2080	519	0.87	80	2380	0.64

# Making and plotting predictions ----
calibration_tbl %>%
    modeltime_forecast(
        new_data    = testing(splits),
        actual_data = sales_ts_tbl
    ) %>%
    plot_modeltime_forecast(
      .legend_max_width = 25, # For mobile screens
      .interactive      = FALSE
    )

Conclusion

In conclusion, after a comprehensive analysis of various time series forecasting models, the Boosted ARIMA model emerges as the most accurate and reliable predictor for the given dataset. Leveraging the strengths of the traditional ARIMA model with the enhanced predictive capabilities offered by boosting, the Boosted ARIMA model demonstrates superior accuracy in capturing the intricate patterns and dynamics within the time series data. Its adaptability to evolving trends and ability to correct errors through iterative learning contribute to its outstanding performance. This finding underscores the effectiveness of combining the foundational principles of ARIMA with the boosting technique, highlighting the model’s potential to provide valuable insights and optimize forecasting efforts for the specific context of the dataset under examination (Deckler, Powell, and Gordon 2022).

Challenge: The data was not granular enough (for instance daily data) and for a long enough period (I only had 16 points of data) to get more accurate predictions. With a granular and longer data, the models would have been more accurate.

PART C: POWERBI Dashboard

Background

In the pursuit of extracting actionable insights from the provided quarterly sales data spanning the years 2011 to 2014, the deployment of a Power BI dashboard stands as a crucial tool for enhanced visualization and analysis. This dashboard aims to furnish a comprehensive overview of the store’s sales performance, elucidating trends, patterns, and potential influencing factors. Through the utilization of Power BI’s capabilities, our goal is to convert raw data into meaningful visualizations that facilitate a profound comprehension of the store’s financial landscape. The primary objective is to empower stakeholders with an intuitive platform for real-time monitoring, trend identification, and informed decision-making, thereby fostering a data-driven paradigm to optimize the store’s operations and enhance overall financial performance (Palma-Ruiz et al. 2022).

I base my dashboard on the following core components (Singh et al. 2022);

Overview first: The dashboard should at a glance offer a birds eye view of the business.
Drill down capability: Users of the dashboard should be able to drill down to the detail they desire.
Details on demand: If users require detailed analysis, the dashboard should allow them to do so.

MS PowerBI Dashboard Components

The dashboard captures the following components.

Total sales.

Total sales are important drivers of a firms profitability as we saw earlier in the regression analysis.

Total profits.

Profits are the central determinants of the survival of any business. Managers would be interested in profitability as it determines whether the business is a going concern.

Average discount.

We saw from the regression analysis that discounts, while good at attracting customers, are detrimental to profits.

Trends in sales.

To have a bigger picture of the trends in the business, it is critical to examine the trend in the key driver of any business; ability to make sales.

Profit by segment.

Managers take interest in the segments of the business that are the most profitable. This knowledge will enable them to make decisions, for instance, by giving incentives to customers to buy more goods.

Sales by Category and sub-category.

The sale by category and subcategory allow managers a birds eye view into the fastest and slowest moving goods. This has implications for management in terms of inventory planning.

Output

# PowerBI Dashboard template ----
knitr::include_graphics("Dashboard_17_12-2023_page-0001.jpg")

Conclusion

In this section I have created a dashboard that would help managers to have an overview of the business. I have based the dashboard on the following core principles of data visualization for dashboards;

Overview first.
Drill down capability.
Details on demand.

The dashboard rested on the core of any business; sales and profits. I segmented the sales and profits using categories of products and regions.

References

Black, Ken. 2023. Business Statistics: For Contemporary Decision Making. John Wiley & Sons.

Dalgaard, Peter. 2008. Introductory Statistics with r. Springer.

Deckler, Greg, Brett Powell, and Leon Gordon. 2022. Mastering Microsoft Power BI: Expert Techniques to Create Interactive Insights for Effective Data Analytics and Business Intelligence. Packt Publishing Ltd.

Hyndman, Rob J, and George Athanasopoulos. 2013. Forecasting: Principles and Practice, 3rd Edition. OTexts. https://api.semanticscholar.org/CorpusID:215822294.

Lantz, Brett. 2019. Machine Learning with r: Expert Techniques for Predictive Modeling. Packt publishing ltd.

Palma-Ruiz, Jesús Manuel, Angel Torres-Toukoumidis, Sonia Esther González-Moreno, and Herik Germán Valles-Baca. 2022. “An Overview of the Gaming Industry Across Nations: Using Analytics with Power BI to Forecast and Identify Key Influencers.” Heliyon.

R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Singh, Gurpreet, Jaspreet Singh, Chander Prabha, et al. 2022. “Data Visualization and Its Key Fundamentals: A Comprehensive Survey.” In 2022 7th International Conference on Communication and Electronics Systems (ICCES), 1710–14. IEEE.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

Appendices

Appendix 1: Task Requirements

The aim of this project is to provide you with an opportunity to apply various BI and data analytics techniques we’ve explored during the course. The project allows you to delve deep into a topic of your choice and to showcase your analytical skills. You should find data file(s) from online platforms like Kaggle. You will do detailed data analysis, visualization, statistical tests, and a well-articulated conclusion. You can select your data file relevant to the following domains:

Learning/Behavioural Analytics.
Statistics/Data Analytics.
Finance/Stock Investments.
Digital Marketing or Advertising.
Logistics/Operations Management

Evaluation Criteria: Your report will be evaluated based on:

Choice of topic and justification (10%)
Understanding of the data including data description and preparation (10%).
Depth of analysis, interpretation, and findings (60%).
Overall presentation, including graphics and visuals (20%)

Requirements:

Descriptive Analysis: Utilize techniques like boxplots, tables, cross tabulation, and scatter charts. Provide interpretations for each analysis.
Statistical Tests: Conduct one 1-sample and one 2-sample t-tests. Formulate appropriate hypotheses and draw conclusions based on p-values.
Linear Regression: Carry out linear regression with at least two independent variables (preferably one numerical and one categorical). Interpret r-squared values, coefficients, p-values, and make a prediction using the coefficients.
Time Series Analysis: Forecast using at least two methods. Compare the methods using residual squares.
Dashboard Creation: Develop a dashboard using either pivot tables/pivot charts or Power BI. Each graph/chart should have a clear purpose and you should be able to explain the usefulness of each.

Submission:

Your submission should include:

A ‘detailed report’ (MAX 1500 words) describing your data file, screenshots of your work from excel and/or power BI, interpretations, any challenges you faced and any business recommendations.
Raw Data file(s) used for the project.
Final excel/Power BI file(s) with all your works with any accompanying instructions, descriptions or interpretations.