Advanced Data Analytics - Answers: Term End Assignment

Dataset Description and Business Relevance

Dataset: Sample Sales Data
Source: C:/Users/Muhammad.shaamel/Downloads/advdataanalytics/data/sales_data_sample.csv
Size: ~2,823 rows × 25 columns
Time Period: 2003–2005

This dataset contains retail sales transaction records including order details, customer information, product lines, and geographic data. It is useful for analyzing sales patterns, customer behavior, seasonal trends, and market performance by region and product category.

Variable Types

Numerical (9): ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, ORDERLINENUMBER, QTR_ID, MONTH_ID, YEAR_ID, MSRP
Categorical (15): STATUS, PRODUCTLINE, CUSTOMERNAME, COUNTRY, TERRITORY, DEALSIZE, ...
Date (1): ORDERDATE

Load Libraries

library(tidyverse)
library(knitr)
library(lubridate)

Load & Prepare Data

# Read CSV (absolute path as provided)
sales_data <- read.csv("C:/Users/Muhammad.shaamel/Downloads/advdataanalytics/data/sales_data_sample.csv")

# ORDERDATE in this dataset can appear with or without time; normalize safely
# Strip time part if present and parse as mdy
sales_data$ORDERDATE <- lubridate::mdy(gsub(" .*", "", as.character(sales_data$ORDERDATE)))

# Derived time variables (recompute even if present)
sales_data$QTR_ID   <- lubridate::quarter(sales_data$ORDERDATE)
sales_data$MONTH_ID <- lubridate::month(sales_data$ORDERDATE)
sales_data$YEAR_ID  <- lubridate::year(sales_data$ORDERDATE)

cat("Dataset loaded:", nrow(sales_data), "rows,", ncol(sales_data), "columns")

## Dataset loaded: 2823 rows, 25 columns

Descriptive Analytics

Missing Values Analysis

missing_counts <- colSums(is.na(sales_data))
missing_data <- data.frame(
  Variable = names(missing_counts),
  Missing_Count = as.integer(missing_counts),
  Missing_Percent = round((missing_counts / nrow(sales_data)) * 100, 1)
) %>% 
  dplyr::filter(Missing_Count > 0) %>%
  dplyr::arrange(desc(Missing_Percent))

if (nrow(missing_data) == 0) {
  cat("No missing values detected.")
} else {
  kable(missing_data, caption = "Missing Values Summary")
}

Missing Values Summary
	Variable	Missing_Count	Missing_Percent
TERRITORY	TERRITORY	1074	38

Summary Statistics

key_vars <- sales_data[, c("QUANTITYORDERED", "PRICEEACH", "SALES", "MSRP")]
summary_stats <- do.call(cbind, lapply(key_vars, function(x) {
  c(Mean = round(mean(x, na.rm = TRUE), 2),
    Median = round(median(x, na.rm = TRUE), 2),
    SD = round(sd(x, na.rm = TRUE), 2),
    Min = round(min(x, na.rm = TRUE), 2),
    Max = round(max(x, na.rm = TRUE), 2))
}))
kable(summary_stats, caption = "Summary Statistics for Key Variables")

Summary Statistics for Key Variables
	QUANTITYORDERED	PRICEEACH	SALES	MSRP
Mean	35.09	83.66	3553.89	100.72
Median	35.00	95.70	3184.80	99.00
SD	9.74	20.17	1841.87	40.19
Min	6.00	26.88	482.13	33.00
Max	97.00	100.00	14082.80	214.00

# Business metrics
shipped_rate <- round(sum(sales_data$STATUS == "Shipped", na.rm = TRUE) / nrow(sales_data) * 100, 1)
cat("Total Revenue: $", format(sum(sales_data$SALES, na.rm = TRUE), big.mark = ","), "\n")

## Total Revenue: $ 10,032,629

cat("Average Order Value: $", round(mean(sales_data$SALES, na.rm = TRUE), 2), "\n")

## Average Order Value: $ 3553.89

cat("Total Orders:", nrow(sales_data), "\n")

## Total Orders: 2823

cat("Shipping Success Rate:", shipped_rate, "%\n")

## Shipping Success Rate: 92.7 %

Distribution Analysis

# Sales distribution
hist(sales_data$SALES, breaks = 30, main = "Sales Amount Distribution",
     xlab = "Sales ($)", col = "lightblue")

# Product line performance
product_perf <- sales_data %>%
  group_by(PRODUCTLINE) %>%
  summarise(
    Orders = n(),
    Total_Revenue = sum(SALES, na.rm = TRUE),
    Avg_Revenue = round(mean(SALES, na.rm = TRUE), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(Total_Revenue))

kable(head(product_perf), caption = "Top Product Lines by Revenue")

Top Product Lines by Revenue
PRODUCTLINE	Orders	Total_Revenue	Avg_Revenue
Classic Cars	967	3919615.7	4053.38
Vintage Cars	607	1903150.8	3135.34
Motorcycles	331	1166388.3	3523.83
Trucks and Buses	301	1127789.8	3746.81
Planes	306	975003.6	3186.29
Ships	234	714437.1	3053.15

# Order status distribution
status_dist <- table(sales_data$STATUS)
kable(data.frame(Status = names(status_dist), Count = as.numeric(status_dist)),
      caption = "Order Status Distribution")

Order Status Distribution
Status	Count
Cancelled	60
Disputed	14
In Process	41
On Hold	44
Resolved	47
Shipped	2617

Time Series Analysis

# Monthly sales trend
monthly_sales <- sales_data %>%
  mutate(YearMonth = format(ORDERDATE, "%Y-%m")) %>%
  group_by(YearMonth) %>%
  summarise(Monthly_Revenue = sum(SALES, na.rm = TRUE), .groups = 'drop') %>%
  arrange(YearMonth)

kable(head(monthly_sales, 12), caption = "Sample Monthly Sales",
      format.args = list(big.mark = ","))

Sample Monthly Sales
YearMonth	Monthly_Revenue
2003-01	129,753.6
2003-02	140,836.2
2003-03	174,504.9
2003-04	201,609.5
2003-05	192,673.1
2003-06	168,082.6
2003-07	187,731.9
2003-08	197,809.3
2003-09	263,973.4
2003-10	568,291.0
2003-11	1,029,837.7
2003-12	261,876.5

# Yearly comparison
yearly_sales <- sales_data %>%
  group_by(YEAR_ID) %>%
  summarise(
    Total_Sales = sum(SALES, na.rm = TRUE),
    Order_Count = n(),
    Avg_Order_Value = round(mean(SALES, na.rm = TRUE), 2),
    .groups = 'drop'
  ) %>%
  arrange(YEAR_ID)

kable(yearly_sales, caption = "Year-over-Year Performance",
      format.args = list(big.mark = ","))

Year-over-Year Performance
YEAR_ID	Total_Sales	Order_Count	Avg_Order_Value
2,003	3,516,980	1,000	3,516.98
2,004	4,724,163	1,345	3,512.39
2,005	1,791,487	478	3,747.88

Metrics for Inline Insights

# Correlation and leader (computed here so inline R has them ready)
corr_q_sales <- suppressWarnings(
  cor(sales_data$QUANTITYORDERED, sales_data$SALES, use = "complete.obs")
)
shipped_rate <- round(sum(sales_data$STATUS == "Shipped", na.rm = TRUE) / nrow(sales_data) * 100, 1)
product_leader <- if (nrow(product_perf) > 0) product_perf$PRODUCTLINE[1] else NA

Key Insights

Data Quality: Core business fields are generally complete. See Missing Values Summary (if any) for optional fields with gaps.

Sales Performance: - Total revenue of $10,032,629 across 2823 order lines
- Average order value $3553.89
- 92.7% of orders successfully shipped

Business Patterns: - Classic Cars currently leads total revenue among product lines
- Sales distribution is right-skewed (typical for order values)
- Seasonal effects visible in quarterly/monthly aggregations

Correlation: Quantity ordered vs sales amount shows a positive relationship (r = 0.55), as expected.

Predictive Business Objective

Proposed Model: Monthly Sales Revenue Forecasting
Business Problem: Forecast monthly sales 3–6 months ahead to improve inventory planning, cash-flow management, and staffing.

Approach: Use historical trends, seasonal indicators, product-mix signals, and regional factors to predict revenue.

Expected Benefits: - Reduce inventory costs via better demand prediction
- Improve cash-flow planning accuracy
- Enable proactive marketing during predicted slow periods
- Support realistic target setting

Target Variable: Monthly total sales revenue
Key Features: Lagged revenue, month/quarter seasonality, product line mix, geography

Reference

Kyanyoga (2016). Sample Sales Data. Kaggle. https://www.kaggle.com/datasets/kyanyoga/sample-sales-data

Question 1(b): SQL on the dataset using `sqldf`

# Ensure sqldf is available
if (!requireNamespace("sqldf", quietly = TRUE)) install.packages("sqldf")
library(sqldf)

Query 1 — Which product lines generated the highest revenue in 2004 from orders that shipped?

Why this matters?:
Identifying revenue-leading product lines in a specific year and fulfillment state (“Shipped”) helps merchandise and finance teams focus inventory, promotions, and supplier negotiations where it pays off.

SQL query

SELECT
  PRODUCTLINE,
  COUNT(*)              AS order_lines,
  SUM(SALES)            AS total_revenue,
  AVG(SALES)            AS avg_line_revenue
FROM sales_data
WHERE YEAR_ID = 2004
  AND STATUS = 'Shipped'            -- filter (fulfillment)
GROUP BY PRODUCTLINE                 -- grouping
HAVING COUNT(*) >= 10                -- guardrail for tiny groups
ORDER BY total_revenue DESC          -- sorting (by revenue)
LIMIT 10;

Results

sql1 <- "
SELECT
  PRODUCTLINE,
  COUNT(*)              AS order_lines,
  SUM(SALES)            AS total_revenue,
  AVG(SALES)            AS avg_line_revenue
FROM sales_data
WHERE YEAR_ID = 2004
  AND STATUS = 'Shipped'
GROUP BY PRODUCTLINE
HAVING COUNT(*) >= 10
ORDER BY total_revenue DESC
LIMIT 10;
"
res1 <- sqldf(sql1)
knitr::kable(res1, caption = "Top Product Lines by Revenue (Shipped, 2004)",
             format.args = list(big.mark = ",", scientific = FALSE))

Top Product Lines by Revenue (Shipped, 2004)
PRODUCTLINE	order_lines	total_revenue	avg_line_revenue
Classic Cars	425	1,702,871.5	4,006.757
Vintage Cars	275	883,004.4	3,210.925
Motorcycles	164	560,545.2	3,417.959
Trucks and Buses	138	509,109.6	3,689.200
Planes	150	468,552.2	3,123.682
Ships	99	292,522.8	2,954.775
Trains	36	111,441.4	3,095.595

My takeaways

Leading product line: Classic Cars with total revenue $1,702,872 across 425 order lines.
Mean revenue per line for that leader ≈ $4006.76 — useful for margin targeting and pricing tests.
Use this table to prioritize inventory buys and vendor co-marketing for the next 2004-like season.

Query 2 — Among USA customers in 2005 with shipped orders, who are the top customers by average order-line revenue (min. 3 lines)?

Why this matters?:
Finding high-value customers (by spend per line) in a key market guides account-based marketing, loyalty perks, and tailored service levels.

SQL query

SELECT
  CUSTOMERNAME,
  COUNT(*)              AS order_lines,
  SUM(SALES)            AS total_revenue,
  AVG(SALES)            AS avg_line_revenue
FROM sales_data
WHERE COUNTRY = 'USA'                 -- filter (market)
  AND YEAR_ID = 2005                  -- filter (time)
  AND STATUS = 'Shipped'              -- filter (fulfillment)
GROUP BY CUSTOMERNAME                 -- grouping (by customer)
HAVING COUNT(*) >= 3                  -- ensure meaningful activity
ORDER BY avg_line_revenue DESC,       -- sorting (primary by avg ticket)
         total_revenue DESC           -- tie-breaker by total
LIMIT 10;

Results

sql2 <- "
SELECT
  CUSTOMERNAME,
  COUNT(*)              AS order_lines,
  SUM(SALES)            AS total_revenue,
  AVG(SALES)            AS avg_line_revenue
FROM sales_data
WHERE COUNTRY = 'USA'
  AND YEAR_ID = 2005
  AND STATUS = 'Shipped'
GROUP BY CUSTOMERNAME
HAVING COUNT(*) >= 3
ORDER BY avg_line_revenue DESC, total_revenue DESC
LIMIT 10;
"
res2 <- sqldf(sql2)
knitr::kable(res2, caption = "Top USA Customers by Avg Order-Line Revenue (Shipped, 2005)",
             format.args = list(big.mark = ",", scientific = FALSE))

Top USA Customers by Avg Order-Line Revenue (Shipped, 2005)
CUSTOMERNAME	order_lines	total_revenue	avg_line_revenue
Gift Depot Inc.	6	31,648.47	5,274.745
FunGiftIdeas.com	8	37,557.66	4,694.708
The Sharp Gifts Warehouse	9	37,526.84	4,169.649
Corporate Gift Ideas Co.	13	54,203.62	4,169.509
Collectables For Less Inc.	8	31,474.78	3,934.347
Mini Gifts Distributors Ltd.	56	205,993.93	3,678.463
Mini Creations Ltd.	3	11,021.30	3,673.767
Technics Stores Inc.	4	13,529.57	3,382.392

Interpretation

Highest avg order-line revenue customer: Gift Depot Inc. at $5274.74 per line across 6 lines (total $31,648.47).
Strong candidates for ABM, early access to limited stock, and tailored cross-sell bundles.
Replicate this cut for other countries/years to build a geo-customer heat map for sales ops.

Notes

Results operate at the order-line level (each row ≈ an item line on an order). If you must analyze at order-header level, aggregate line items to unique ORDERNUMBER first.
You can easily adapt filters (e.g., DEALSIZE = 'Large', PRODUCTLINE IN (...), QTR_ID IN (1,2)), and change sorting priorities (ORDER BY total_revenue DESC) to answer adjacent business questions.

Question 1(c): Two simple business opinions → hypotheses → tests

Below are two opinions I formed after looking at the data. For each, I state a clear hypothesis, use a basic statistical test, show the results, and give a short, non-technical interpretation.

Opinion 1

Question (business): Do orders placed in Q4 (holiday period) have higher average sales per line than orders in Q1–Q3?
Why this matters: If Q4 really runs higher, we can plan inventory, staffing, and promos earlier.

Hypotheses
- H0: Average SALES in Q4 = Average SALES in Q1–Q3
- H1: Average SALES in Q4 > Average SALES in Q1–Q3

Test chosen: Two-sample t-test (Welch). It compares the means of two groups.

library(dplyr)

# Keep needed columns and drop NAs
df <- sales_data %>%
  select(SALES, QTR_ID) %>%
  filter(!is.na(SALES), !is.na(QTR_ID))

# Define two groups: Q4 vs non-Q4
df <- df %>%
  mutate(group = ifelse(QTR_ID == 4, "Q4", "Q1toQ3"))

# Quick look at group means
op1_means <- df %>%
  group_by(group) %>%
  summarise(n = n(),
            mean_sales = mean(SALES),
            sd_sales = sd(SALES),
            .groups = "drop")

knitr::kable(op1_means, digits = 2, caption = "Average line revenue: Q4 vs Q1–Q3")

Average line revenue: Q4 vs Q1–Q3
group	n	mean_sales	sd_sales
Q1toQ3	1729	3561.51	1879.12
Q4	1094	3541.85	1782.18

# One-sided Welch t-test (Q4 > Q1-3)
t_res <- t.test(SALES ~ group, data = df, alternative = "greater")

# Save key numbers for a plain summary
mean_q4     <- round(op1_means$mean_sales[op1_means$group=="Q4"], 2)
mean_nonq4  <- round(op1_means$mean_sales[op1_means$group=="Q1toQ3"], 2)
pval1       <- signif(t_res$p.value, 3)

Result:
- Average per line — Q4: $3541.85 vs Q1–Q3: $3561.51
- t-test p-value: 0.39

Interpretation:
If the p-value is below 0.05, I treat this as evidence that Q4 runs higher on average. If it’s above 0.05, I don’t have enough evidence from this sample to claim a difference. Either way, the table shows the actual averages, which is useful for planning.

Opinion 2

Question (business): Is the Shipped success rate higher for Large deals than for Small deals?
Why this matters: If Large deals ship more reliably, ops can keep prioritizing them; if not, we may need to fix small-deal handling.

Hypotheses
- H0: Shipped rate (Large) = Shipped rate (Small)
- H1: Shipped rate (Large) > Shipped rate (Small)

Test chosen: Two-proportion test (prop.test). It compares two percentages.

# Build shipped flag and keep only Small/Large
ship_df <- sales_data %>%
  filter(!is.na(DEALSIZE), DEALSIZE %in% c("Small","Large"),
         !is.na(STATUS)) %>%
  mutate(SHIPPED = ifelse(STATUS == "Shipped", 1, 0))

# Counts by deal size
summ <- ship_df %>%
  group_by(DEALSIZE) %>%
  summarise(
    shipped = sum(SHIPPED),
    total   = n(),
    rate    = shipped/total,
    .groups = "drop"
  ) %>%
  arrange(DEALSIZE)

knitr::kable(summ, digits = 3, caption = "Shipped counts and rates by deal size (Small vs Large)")

Shipped counts and rates by deal size (Small vs Large)
DEALSIZE	shipped	total	rate
Large	143	157	0.911
Small	1196	1282	0.933

# Two-proportion test (one-sided: Large > Small)
x <- c(summ$shipped[summ$DEALSIZE=="Small"], summ$shipped[summ$DEALSIZE=="Large"])
n <- c(summ$total[summ$DEALSIZE=="Small"],   summ$total[summ$DEALSIZE=="Large"])
prop_res <- prop.test(x = x, n = n, alternative = "less") 
# 'less' here tests Small < Large, which matches H1 (Large > Small)

pval2      <- signif(prop_res$p.value, 3)
rate_small <- round(summ$rate[summ$DEALSIZE=="Small"], 3)
rate_large <- round(summ$rate[summ$DEALSIZE=="Large"], 3)

Result:
- Shipped rate — Small: 0.933 ; Large: 0.911
- 2-proportion test p-value: 0.805

Interpretation:
If the p-value is below 0.05, I read this as evidence that Large deals ship more successfully than Small deals in this dataset. If it’s above 0.05, the sample does not give me enough evidence to claim a real difference, even if the raw rates look different.

My takeaway

These two quick checks are enough for a basic business read: Q4 may run “richer” per line, and Large deals may ship more reliably. I would use these signals to shape stock planning and ops priorities, and I’d re-check them each quarter as more data comes in.

Question 1(d): Multiple Linear Regression (with stepwise)

Goal (in my own words): Predict the dollar value per order line (SALES). I’ll keep it simple and only use a few sensible predictors.

What I did before modeling (pre-processing) and why: - Kept complete rows for the variables I’ll use → otherwise the model can’t fit. - Logged the target (log(SALES)) because sales values are right-skewed; log usually makes the pattern more linear. - Standardised (z-scored) numeric predictors so the coefficients are on a comparable scale and stepwise doesn’t “prefer” variables just due to units. - Turned time and category fields into factors (QTR_ID, YEAR_ID, PRODUCTLINE, DEALSIZE) so the model can compare groups.

library(dplyr)

lm_df <- sales_data %>%
  select(SALES, QUANTITYORDERED, PRICEEACH, MSRP, QTR_ID, YEAR_ID, PRODUCTLINE, DEALSIZE) %>%
  filter(!is.na(SALES), SALES > 0,
         !is.na(QUANTITYORDERED), !is.na(PRICEEACH), !is.na(MSRP),
         !is.na(QTR_ID), !is.na(YEAR_ID),
         !is.na(PRODUCTLINE), !is.na(DEALSIZE)) %>%
  mutate(
    log_SALES = log(SALES),
    PRODUCTLINE = factor(PRODUCTLINE),
    DEALSIZE    = factor(DEALSIZE),
    QTR_ID      = factor(QTR_ID),
    YEAR_ID     = factor(YEAR_ID)
  )

# Standardise key numeric predictors
num_vars <- c("QUANTITYORDERED","PRICEEACH","MSRP")
lm_df[paste0(num_vars, "_z")] <- scale(lm_df[num_vars])

# Quick sanity check
knitr::kable(head(lm_df[, c("SALES","log_SALES","QUANTITYORDERED_z","PRICEEACH_z","MSRP_z","QTR_ID","YEAR_ID","PRODUCTLINE","DEALSIZE")]),
             caption = "Sample of data after simple pre-processing")

Sample of data after simple pre-processing
SALES	log_SALES	QUANTITYORDERED_z	PRICEEACH_z	MSRP_z	QTR_ID	YEAR_ID	PRODUCTLINE	DEALSIZE
2871.00	7.962416	-0.5227982	0.5968718	-0.1422206	1	2003	Motorcycles	Small
2765.90	7.925121	-0.1121814	-0.1144301	-0.1422206	2	2003	Motorcycles	Small
3884.34	8.264708	0.6063980	0.5492864	-0.1422206	3	2003	Motorcycles	Medium
3746.70	8.228631	1.0170147	-0.0197551	-0.1422206	3	2003	Motorcycles	Medium
5205.27	8.557427	1.4276315	0.8100145	-0.1422206	4	2003	Motorcycles	Medium
3479.76	8.154719	0.0931270	0.6444571	-0.1422206	4	2003	Motorcycles	Medium

Fit the model: I start with a reasonable full model and let stepwise (both directions) trim it using AIC.

# Full model
lm_full <- lm(log_SALES ~ QUANTITYORDERED_z + PRICEEACH_z + MSRP_z +
                QTR_ID + YEAR_ID + PRODUCTLINE + DEALSIZE,
              data = lm_df)

# Stepwise (both directions)
lm_step <- step(lm_full, direction = "both", trace = 0)

summary(lm_step)

## 
## Call:
## lm(formula = log_SALES ~ QUANTITYORDERED_z + PRICEEACH_z + MSRP_z + 
##     QTR_ID + YEAR_ID + PRODUCTLINE + DEALSIZE, data = lm_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.96148 -0.08311 -0.00936  0.07971  0.61706 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  8.412055   0.014683 572.894  < 2e-16 ***
## QUANTITYORDERED_z            0.233724   0.003146  74.293  < 2e-16 ***
## PRICEEACH_z                  0.282935   0.004039  70.047  < 2e-16 ***
## MSRP_z                       0.077464   0.003730  20.770  < 2e-16 ***
## QTR_ID2                     -0.042663   0.007505  -5.685 1.44e-08 ***
## QTR_ID3                     -0.015892   0.008387  -1.895 0.058213 .  
## QTR_ID4                     -0.013213   0.007259  -1.820 0.068849 .  
## YEAR_ID2004                  0.007055   0.005428   1.300 0.193800    
## YEAR_ID2005                 -0.029704   0.008355  -3.555 0.000384 ***
## PRODUCTLINEMotorcycles       0.003297   0.008379   0.394 0.693963    
## PRODUCTLINEPlanes           -0.010674   0.008788  -1.215 0.224635    
## PRODUCTLINEShips            -0.023508   0.009785  -2.402 0.016354 *  
## PRODUCTLINETrains            0.033082   0.015604   2.120 0.034084 *  
## PRODUCTLINETrucks and Buses -0.006100   0.008677  -0.703 0.482099    
## PRODUCTLINEVintage Cars     -0.018471   0.007046  -2.622 0.008799 ** 
## DEALSIZEMedium              -0.288072   0.012082 -23.843  < 2e-16 ***
## DEALSIZESmall               -0.442847   0.015380 -28.794  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.129 on 2806 degrees of freedom
## Multiple R-squared:  0.9392, Adjusted R-squared:  0.9388 
## F-statistic:  2707 on 16 and 2806 DF,  p-value: < 2.2e-16

Assumption checks: - Residuals should look roughly centered around 0 with no pattern. - QQ plot should be roughly straight (normal-ish residuals).

par(mfrow = c(1,2))
plot(lm_step$fitted.values, residuals(lm_step),
     xlab="Fitted log(SALES)", ylab="Residuals",
     main="Residuals vs Fitted"); abline(h=0,lty=2)
qqnorm(residuals(lm_step)); qqline(residuals(lm_step), col=2)

par(mfrow = c(1,1))

Simple cross-validation: 5-fold CV RMSE/R² using `caret.

if (!requireNamespace("caret", quietly = TRUE)) install.packages("caret")
library(caret)

set.seed(123)
cv_ctrl <- trainControl(method = "cv", number = 5)
lm_cv <- train(formula(lm_step), data = lm_df, method = "lm", trControl = cv_ctrl)

lm_cv$results  # shows RMSE and R-squared across folds

##   intercept      RMSE  Rsquared       MAE      RMSESD  RsquaredSD       MAESD
## 1      TRUE 0.1294282 0.9383337 0.0989613 0.007369609 0.007559473 0.004359889

Interpreting coefficients (manager view): Because the target is log(SALES) and predictors are z-scored, a +1 SD change in a numeric predictor changes expected sales by ~exp(beta)−1 percent.

coefs <- coef(lm_step)

pct_change <- function(term) {
  if (term %in% names(coefs)) round(100*(exp(coefs[term]) - 1), 1) else NA
}

chg_qty   <- pct_change("QUANTITYORDERED_z")
chg_price <- pct_change("PRICEEACH_z")
chg_msrp  <- pct_change("MSRP_z")

r2_train <- summary(lm_step)$r.squared

Quantity (+1 SD): ~26.3% change in expected sales
PriceEach (+1 SD): ~32.7% change
MSRP (+1 SD): ~8.1% change
Model fit (training R²): ~0.94; 5-fold CV R²: see table above.

My Takeaway: I’m not claiming a perfect predictor here; I just want a directionally useful model. The signs tell me the way each driver moves expected sales (holding others steady). The CV numbers help me not over-trust the training fit.

Question 1(e): Binary classification with Logistic Regression

Business framing: I’ll predict whether an order line ships successfully.
- Target: SHIPPED_FLAG = 1 if STATUS == "Shipped", else 0
- Inputs: A few practical fields available at order time (quantity, price, MSRP, quarter, year, product line, deal size)

clf_df <- sales_data %>%
  select(STATUS, QUANTITYORDERED, PRICEEACH, MSRP, QTR_ID, YEAR_ID, PRODUCTLINE, DEALSIZE) %>%
  filter(!is.na(STATUS),
         !is.na(QUANTITYORDERED), !is.na(PRICEEACH), !is.na(MSRP),
         !is.na(QTR_ID), !is.na(YEAR_ID),
         !is.na(PRODUCTLINE), !is.na(DEALSIZE)) %>%
  mutate(
    SHIPPED_FLAG = ifelse(STATUS == "Shipped", 1L, 0L),
    PRODUCTLINE = factor(PRODUCTLINE),
    DEALSIZE    = factor(DEALSIZE),
    QTR_ID      = factor(QTR_ID),
    YEAR_ID     = factor(YEAR_ID)
  )

# Train/test split (70/30)
set.seed(123)
idx <- sample(seq_len(nrow(clf_df)), size = 0.7*nrow(clf_df))
train_df <- clf_df[idx, ]
test_df  <- clf_df[-idx, ]

Fit a plain logistic regression:

logit_fit <- glm(SHIPPED_FLAG ~ QUANTITYORDERED + PRICEEACH + MSRP +
                   QTR_ID + YEAR_ID + PRODUCTLINE + DEALSIZE,
                 data = train_df, family = binomial())
summary(logit_fit)

## 
## Call:
## glm(formula = SHIPPED_FLAG ~ QUANTITYORDERED + PRICEEACH + MSRP + 
##     QTR_ID + YEAR_ID + PRODUCTLINE + DEALSIZE, family = binomial(), 
##     data = train_df)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   5.599899   1.416071   3.955 7.67e-05 ***
## QUANTITYORDERED               0.001981   0.012412   0.160   0.8732    
## PRICEEACH                     0.000274   0.008657   0.032   0.9747    
## MSRP                          0.001929   0.004125   0.468   0.6400    
## QTR_ID2                      -2.854973   0.315424  -9.051  < 2e-16 ***
## QTR_ID3                      14.502998 544.985781   0.027   0.9788    
## QTR_ID4                      -1.887710   0.398386  -4.738 2.15e-06 ***
## YEAR_ID2004                  -0.465532   0.284382  -1.637   0.1016    
## YEAR_ID2005                  -2.855769   0.334134  -8.547  < 2e-16 ***
## PRODUCTLINEMotorcycles        1.164998   0.495388   2.352   0.0187 *  
## PRODUCTLINEPlanes            -0.329384   0.360143  -0.915   0.3604    
## PRODUCTLINEShips             -1.552835   0.339134  -4.579 4.68e-06 ***
## PRODUCTLINETrains             1.043427   1.132780   0.921   0.3570    
## PRODUCTLINETrucks and Buses  -0.144872   0.363076  -0.399   0.6899    
## PRODUCTLINEVintage Cars      -0.412190   0.306075  -1.347   0.1781    
## DEALSIZEMedium               -0.407100   0.520008  -0.783   0.4337    
## DEALSIZESmall                -0.104351   0.684772  -0.152   0.8789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 995.54  on 1975  degrees of freedom
## Residual deviance: 687.26  on 1959  degrees of freedom
## AIC: 721.26
## 
## Number of Fisher Scoring iterations: 18

Evaluate fit: confusion matrix (0.5 cutoff), ROC AUC, and a simple pseudo-R².

# Predictions (probability of 'Shipped')
test_df$pred_prob <- predict(logit_fit, newdata = test_df, type = "response")
test_df$pred_flag <- ifelse(test_df$pred_prob >= 0.5, 1L, 0L)

# Confusion matrix
cm <- table(Actual = test_df$SHIPPED_FLAG, Predicted = test_df$pred_flag)

acc  <- sum(diag(cm))/sum(cm)
sens <- ifelse(sum(cm[2,])>0, cm["1","1"] / sum(cm[2,]), NA)  # recall for Shipped
spec <- ifelse(sum(cm[1,])>0, cm["0","0"] / sum(cm[1,]), NA)  # true negative rate

# ROC AUC
if (!requireNamespace("pROC", quietly = TRUE)) install.packages("pROC")
library(pROC)
roc_obj <- roc(response = test_df$SHIPPED_FLAG, predictor = test_df$pred_prob)
auc_val <- as.numeric(auc(roc_obj))

# Pseudo-R^2 (McFadden)
if (!requireNamespace("pscl", quietly = TRUE)) install.packages("pscl")
library(pscl)
pseudo <- pR2(logit_fit)  # McFadden et al.

## fitting null model for pseudo-r2

list(
  Confusion_Matrix = cm,
  Accuracy = round(acc, 3),
  Sensitivity_ShP = round(sens, 3),
  Specificity_Not = round(spec, 3),
  AUC = round(auc_val, 3),
  McFadden_R2 = round(pseudo["McFadden"], 3)
)

## $Confusion_Matrix
##       Predicted
## Actual   0   1
##      0  28  41
##      1  16 762
## 
## $Accuracy
## [1] 0.933
## 
## $Sensitivity_ShP
## [1] 0.979
## 
## $Specificity_Not
## [1] 0.406
## 
## $AUC
## [1] 0.929
## 
## $McFadden_R2
## McFadden 
##     0.31

How I read this: - Accuracy/Sensitivity/Specificity tell me how well the model separates shipped vs not-shipped on unseen data. If sensitivity is decent, we catch most of the shipped lines; if specificity is decent, we’re not over-predicting shipped when it isn’t. - AUC (closer to 1 is better) shows ranking quality independent of a fixed cutoff; anything much above 0.7 is useful in practice for simple triage. - Pseudo-R² (McFadden) gives a rough sense of overall explanatory power for logistic models (values are usually lower than linear R²).

Manager meaning: - If AUC and accuracy are reasonable, I’d use the probability to prioritise lines at risk (low probability of shipping).
- Coefficients’ signs indicate drivers: a positive coefficient means higher chance to ship, negative lowers it (while holding others constant). We can turn strong drivers into checklist items for ops (e.g., certain product lines or deal sizes may need extra QA or stock buffers).

I kept the modeling choices simple on purpose: a basic stepwise linear regression for SALES and a plain logistic regression for Shipped.

Question 1(f): Time-series forecast of Monthly Sales Revenue

Variable chosen (time-dependent): Monthly total SALES (sum of order-line sales per calendar month).
Why: Revenue is naturally time-based and important for planning inventory, cash flow, and marketing.

What I did actually here?

Aggregated order lines to monthly revenue.
Ensured a regular monthly sequence (if a month is missing in the data, I treat it as zero revenue for that month).
Split into train (all but last 6 months) and test (last 6 months).
Fit two models:
1. ETS (exponential smoothing) – good for level/trend/seasonality without heavy assumptions.
2. ARIMA (auto.arima) – lets the data pick AR/MA terms and seasonal structure.
Compared forecasts on the test period with MAD, MSE, RMSE, MAPE.
(Note: MAPE ignores months where actual = 0 to avoid divide-by-zero.)

Remark on packages: I used forecast for ETS/ARIMA and accuracy. Install if needed:

install.packages("forecast")

Build the monthly series

library(dplyr)
library(lubridate)
library(ggplot2)
if (!requireNamespace("forecast", quietly = TRUE)) install.packages("forecast")
library(forecast)

# 1) Aggregate to monthly revenue
ts_df <- sales_data %>%
  mutate(YearMonth = floor_date(ORDERDATE, unit = "month")) %>%
  group_by(YearMonth) %>%
  summarise(Monthly_Sales = sum(SALES, na.rm = TRUE), .groups = "drop") %>%
  arrange(YearMonth)

# 2) Make sure every calendar month is present (fill missing with 0)
full_months <- tibble(YearMonth = seq(min(ts_df$YearMonth), max(ts_df$YearMonth), by = "month"))
ts_df_full <- full_months %>%
  left_join(ts_df, by = "YearMonth") %>%
  mutate(Monthly_Sales = ifelse(is.na(Monthly_Sales), 0, Monthly_Sales))

# 3) Convert to ts object (monthly frequency)
ts_start <- c(year(min(ts_df_full$YearMonth)), month(min(ts_df_full$YearMonth)))
y <- ts(ts_df_full$Monthly_Sales, frequency = 12, start = ts_start)

length(y); head(ts_df_full)

## [1] 29

## # A tibble: 6 × 2
##   YearMonth  Monthly_Sales
##   <date>             <dbl>
## 1 2003-01-01       129754.
## 2 2003-02-01       140836.
## 3 2003-03-01       174505.
## 4 2003-04-01       201610.
## 5 2003-05-01       192673.
## 6 2003-06-01       168083.

Train / Test split and baseline plot

h <- 6  # forecast horizon = last 6 months for testing
n <- length(y)
train_y <- head(y, n - h)
test_y  <- tail(y, h)

# Simple plot of the full series
ggplot(ts_df_full, aes(x = YearMonth, y = Monthly_Sales)) +
  geom_line() +
  geom_vline(xintercept = ts_df_full$YearMonth[n - h], linetype = 2) +
  labs(title = "Monthly Sales (with Train/Test Split)",
       x = "Month", y = "Revenue ($)") +
  theme_minimal()

Model 1: ETS (Exponential Smoothing)

ets_fit <- ets(train_y)                 # ETS automatically picks components
ets_fc  <- forecast(ets_fit, h = h)     # Forecast next h months
ets_fit

## ETS(M,N,N) 
## 
## Call:
## ets(y = train_y)
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 222523.8162 
## 
##   sigma:  0.4864
## 
##      AIC     AICc      BIC 
## 617.5286 618.7918 620.9351

Model 2: ARIMA (auto.arima)

arima_fit <- auto.arima(train_y, seasonal = TRUE, stepwise = TRUE, approximation = FALSE)
arima_fc  <- forecast(arima_fit, h = h)
arima_fit

## Series: train_y 
## ARIMA(0,0,1) with non-zero mean 
## 
## Coefficients:
##          ma1       mean
##       0.5375  352818.89
## s.e.  0.1834   70480.62
## 
## sigma^2 = 5.447e+10:  log likelihood = -316.05
## AIC=638.1   AICc=639.36   BIC=641.51

Compare forecasts against the test period

# Helper to compute metrics; MAPE skips months where actual == 0
compute_metrics <- function(actual, pred){
  err  <- actual - pred
  MAD  <- mean(abs(err), na.rm = TRUE)
  MSE  <- mean(err^2,     na.rm = TRUE)
  RMSE <- sqrt(MSE)
  mape_vec <- abs(err / actual)
  MAPE <- mean(mape_vec[is.finite(mape_vec) & !is.nan(mape_vec) & actual != 0], na.rm = TRUE) * 100
  data.frame(MAD = MAD, MSE = MSE, RMSE = RMSE, MAPE = MAPE)
}

ets_metrics   <- compute_metrics(test_y,  ets_fc$mean)
arima_metrics <- compute_metrics(test_y,  arima_fc$mean)

metrics_tbl <- dplyr::bind_rows(
  cbind(Model = "ETS",   round(ets_metrics,   2)),
  cbind(Model = "ARIMA", round(arima_metrics, 2))
)

knitr::kable(metrics_tbl, caption = "Forecast accuracy on the 6-month test set (lower is better)")

Forecast accuracy on the 6-month test set (lower is better)
Model	MAD	MSE	RMSE	MAPE
ETS	728279.49	533730822655	730568.8	210.32
ARIMA	90059.82	18742773780	136904.2	25.08

Visualise actuals vs forecasts (test window)

# Build a plotting frame for the test period
plot_test <- ts_df_full %>%
  slice((nrow(ts_df_full) - h + 1):nrow(ts_df_full)) %>%
  mutate(ETS_Forecast   = as.numeric(ets_fc$mean),
         ARIMA_Forecast = as.numeric(arima_fc$mean))

ggplot(plot_test, aes(x = YearMonth)) +
  geom_line(aes(y = Monthly_Sales), size = 1) +
  geom_line(aes(y = ETS_Forecast),   linetype = 2) +
  geom_line(aes(y = ARIMA_Forecast), linetype = 3) +
  labs(title = "Actual vs Forecast (Test Months Only)",
       x = "Month", y = "Revenue ($)",
       subtitle = "Solid = Actual; Dotted = ETS; Dot-dash = ARIMA") +
  theme_minimal()

What I see and why I chose these models

ETS is a good “manager-friendly” choice: it handles level, trend, and seasonality without heavy tuning.
ARIMA checks the data’s autocorrelation structure and can sometimes beat ETS when patterns are stable.
By keeping the last 6 months as a simple test, I get an honest view of out-of-sample accuracy.

Which model is better here?

Look at the table: the model with lower RMSE and MAPE on the test months is the safer pick.
MAD (same idea as MAE) is also easy to explain: “on average, we miss the month by about $X.”

Managerial interpretation

A workable forecast (even if not perfect) helps set monthly targets, plan inventory purchases, and time promotions.
If ETS wins, it suggests a fairly regular seasonal pattern; if ARIMA wins, the series’ autocorrelation mattered more.
Either way, use the forecast to flag slow months early (push marketing) and prepare for peaks (stock and staffing).
I’d re-fit this every month as new data comes in and keep the 6-month back-test to make sure performance stays reasonable.

Small note on MAPE: If a month’s actual sales are zero, MAPE is undefined. I skipped those in the MAPE calculation so the % error isn’t blown up by division by zero.

Appendix (Updated): Packages, Reproducibility & Outputs

Remark (packages installed): To reproduce all tables, charts and models in this report I installed the following packages on my machine:

Data + tables/plots: tidyverse, lubridate, knitr, ggplot2
SQL-in-R: sqldf
CV for regression: caret
Classification metrics: pROC (ROC/AUC), pscl (pseudo-R²)
Time-series forecasting: forecast

install.packages(c(
  "tidyverse","lubridate","knitr","ggplot2",
  "sqldf","caret","pROC","pscl","forecast"
))

Outputs that were produced in this document

Tables - Missing Values Summary (missing_data) - Summary Statistics for key variables (summary_stats) - Top Product Lines by Revenue (product_perf) - Order Status Distribution (status_dist) - Year-over-Year Performance (yearly_sales) - SQL Results – Product lines 2004 shipped (res1) - SQL Results – Top USA customers 2005 shipped (res2) - Opinion 1 means (Q4 vs Q1–Q3) (op1_means) - Deal-size shipped counts & rates (summ) - Stepwise CV results (lm_cv$results) - Forecast Accuracy (ETS vs ARIMA) (metrics_tbl)

Figures - Histogram: Sales Amount Distribution - Linear Regression diagnostics (Residuals vs Fitted, QQ plot) - Time-series: Monthly Sales with Train/Test split - Forecasts vs Actuals (test window): ETS & ARIMA - (Optional) ROC curve for logistic regression

Note: If you knit from a clean session, run the earlier chunks (Q1a–Q1f) so the objects exist. The code below only re-prints what’s already been computed.

Re-print key tables (if available)

library(knitr)
safe_kable <- function(obj, caption, ...) {
  if (exists(obj) && !is.null(get(obj))) {
    kable(get(obj), caption = caption, ...)
  } else {
    cat(paste0("*Skipped (object not found): ", obj, "\n\n"))
  }
}

safe_kable("missing_data", "Missing Values Summary")

Missing Values Summary
	Variable	Missing_Count	Missing_Percent
TERRITORY	TERRITORY	1074	38

if (exists("summary_stats")) kable(summary_stats, caption="Summary Statistics for Key Variables")

Summary Statistics for Key Variables
	QUANTITYORDERED	PRICEEACH	SALES	MSRP
Mean	35.09	83.66	3553.89	100.72
Median	35.00	95.70	3184.80	99.00
SD	9.74	20.17	1841.87	40.19
Min	6.00	26.88	482.13	33.00
Max	97.00	100.00	14082.80	214.00

if (exists("product_perf")) kable(head(product_perf), caption="Top Product Lines by Revenue (head)")

Top Product Lines by Revenue (head)
PRODUCTLINE	Orders	Total_Revenue	Avg_Revenue
Classic Cars	967	3919615.7	4053.38
Vintage Cars	607	1903150.8	3135.34
Motorcycles	331	1166388.3	3523.83
Trucks and Buses	301	1127789.8	3746.81
Planes	306	975003.6	3186.29
Ships	234	714437.1	3053.15

if (exists("status_dist"))  kable(data.frame(Status=names(status_dist), Count=as.numeric(status_dist)),
                                  caption="Order Status Distribution")

Order Status Distribution
Status	Count
Cancelled	60
Disputed	14
In Process	41
On Hold	44
Resolved	47
Shipped	2617

safe_kable("yearly_sales", "Year-over-Year Performance", format.args = list(big.mark=","))

Year-over-Year Performance
YEAR_ID	Total_Sales	Order_Count	Avg_Order_Value
2,003	3,516,980	1,000	3,516.98
2,004	4,724,163	1,345	3,512.39
2,005	1,791,487	478	3,747.88

safe_kable("res1", "SQL Result 1 – Product lines by revenue (Shipped, 2004)",
           format.args = list(big.mark=",", scientific=FALSE))

SQL Result 1 – Product lines by revenue (Shipped, 2004)
PRODUCTLINE	order_lines	total_revenue	avg_line_revenue
Classic Cars	425	1,702,871.5	4,006.757
Vintage Cars	275	883,004.4	3,210.925
Motorcycles	164	560,545.2	3,417.959
Trucks and Buses	138	509,109.6	3,689.200
Planes	150	468,552.2	3,123.682
Ships	99	292,522.8	2,954.775
Trains	36	111,441.4	3,095.595

safe_kable("res2", "SQL Result 2 – Top USA customers by avg line revenue (Shipped, 2005)",
           format.args = list(big.mark=",", scientific=FALSE))

SQL Result 2 – Top USA customers by avg line revenue (Shipped, 2005)
CUSTOMERNAME	order_lines	total_revenue	avg_line_revenue
Gift Depot Inc.	6	31,648.47	5,274.745
FunGiftIdeas.com	8	37,557.66	4,694.708
The Sharp Gifts Warehouse	9	37,526.84	4,169.649
Corporate Gift Ideas Co.	13	54,203.62	4,169.509
Collectables For Less Inc.	8	31,474.78	3,934.347
Mini Gifts Distributors Ltd.	56	205,993.93	3,678.463
Mini Creations Ltd.	3	11,021.30	3,673.767
Technics Stores Inc.	4	13,529.57	3,382.392

safe_kable("op1_means", "Opinion 1 – Q4 vs Q1–Q3 mean SALES (line level)", digits=2)

Opinion 1 – Q4 vs Q1–Q3 mean SALES (line level)
group	n	mean_sales	sd_sales
Q1toQ3	1729	3561.51	1879.12
Q4	1094	3541.85	1782.18

safe_kable("summ", "Opinion 2 – Shipped counts & rates by deal size (Small vs Large)", digits=3)

Opinion 2 – Shipped counts & rates by deal size (Small vs Large)
DEALSIZE	shipped	total	rate
Large	143	157	0.911
Small	1196	1282	0.933

if (exists("lm_cv")) {
  kable(lm_cv$results, caption="Linear Regression 5-fold CV Results (RMSE, R²)")
} else {
  cat("*Skipped (object not found): lm_cv$results\n\n")
}

Linear Regression 5-fold CV Results (RMSE, R²)
intercept	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
TRUE	0.1294282	0.9383337	0.0989613	0.0073696	0.0075595	0.0043599

safe_kable("metrics_tbl", "Forecast accuracy on 6-month test set (ETS vs ARIMA) – lower = better")

Forecast accuracy on 6-month test set (ETS vs ARIMA) – lower = better
Model	MAD	MSE	RMSE	MAPE
ETS	728279.49	533730822655	730568.8	210.32
ARIMA	90059.82	18742773780	136904.2	25.08

Re-print key figures (if available)

# Histogram of SALES (from Q1a)
if (exists("sales_data")) {
  hist(sales_data$SALES, breaks = 30, main = "Sales Amount Distribution (reprint)",
       xlab = "Sales ($)", col = "lightblue")
} else {
  plot.new(); title("Skipped: sales_data not found")
}

# Linear Regression diagnostics (from Q1d)
if (exists("lm_step")) {
  par(mfrow = c(1,2))
  plot(lm_step$fitted.values, residuals(lm_step),
       xlab="Fitted log(SALES)", ylab="Residuals",
       main="Residuals vs Fitted (reprint)"); abline(h=0, lty=2)
  qqnorm(residuals(lm_step)); qqline(residuals(lm_step), col=2)
  par(mfrow = c(1,1))
} else {
  plot.new(); title("Skipped: lm_step not found")
}

# Time-series: Monthly Sales with Train/Test split (from Q1f)
if (exists("ts_df_full")) {
  n <- nrow(ts_df_full); h <- if (exists("h")) h else 6
  library(ggplot2)
  ggplot(ts_df_full, aes(x = YearMonth, y = Monthly_Sales)) +
    geom_line() +
    {if (n > h) geom_vline(xintercept = ts_df_full$YearMonth[n - h], linetype = 2) else NULL} +
    labs(title = "Monthly Sales (Train/Test split – reprint)",
         x = "Month", y = "Revenue ($)") +
    theme_minimal()
} else {
  plot.new(); title("Skipped: ts_df_full not found")
}

# Forecasts vs Actuals (test window) – ETS & ARIMA (from Q1f)
if (exists("ets_fc") && exists("arima_fc") && exists("ts_df_full")) {
  n <- nrow(ts_df_full); h <- length(ets_fc$mean)
  plot_test <- ts_df_full[(n - h + 1):n, ]
  plot_test$ETS_Forecast   <- as.numeric(ets_fc$mean)
  plot_test$ARIMA_Forecast <- as.numeric(arima_fc$mean)

  ggplot(plot_test, aes(x = YearMonth)) +
    geom_line(aes(y = Monthly_Sales), size = 1) +
    geom_line(aes(y = ETS_Forecast),   linetype = 2) +
    geom_line(aes(y = ARIMA_Forecast), linetype = 3) +
    labs(title = "Actual vs Forecast (Test Months – reprint)",
         x = "Month", y = "Revenue ($)",
         subtitle = "Solid = Actual; Dotted = ETS; Dot-dash = ARIMA") +
    theme_minimal()
} else {
  plot.new(); title("Skipped: forecasts not found (ets_fc/arima_fc)")
}

# Optional: ROC curve (from Q1e)
if (exists("roc_obj")) {
  plot(roc_obj, main = "ROC Curve (reprint)")
} else if (exists("test_df") && "pred_prob" %in% names(test_df) && "SHIPPED_FLAG" %in% names(test_df)) {
  if (!requireNamespace("pROC", quietly = TRUE)) install.packages("pROC")
  library(pROC)
  roc_tmp <- roc(response = test_df$SHIPPED_FLAG, predictor = test_df$pred_prob)
  plot(roc_tmp, main = "ROC Curve (reprint)")
} else {
  plot.new(); title("Skipped: ROC objects not found")
}

Session info (for reproducibility)

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22621)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Singapore.utf8  LC_CTYPE=English_Singapore.utf8   
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Singapore.utf8    
## 
## time zone: Asia/Singapore
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forecast_8.24.0 pscl_1.5.9      pROC_1.19.0.1   caret_7.0-1    
##  [5] lattice_0.22-7  sqldf_0.4-11    RSQLite_2.4.3   gsubfn_0.7     
##  [9] proto_1.0.0     knitr_1.50      lubridate_1.9.4 forcats_1.0.0  
## [13] stringr_1.5.2   dplyr_1.1.4     purrr_1.1.0     readr_2.1.5    
## [17] tidyr_1.3.1     tibble_3.3.0    ggplot2_4.0.0   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] blob_1.2.4           S7_0.2.0             fastmap_1.2.0       
##  [7] digest_0.6.37        rpart_4.1.24         timechange_0.3.0    
## [10] lifecycle_1.0.4      survival_3.8-3       magrittr_2.0.3      
## [13] compiler_4.5.1       rlang_1.1.6          sass_0.4.10         
## [16] tools_4.5.1          yaml_2.3.10          data.table_1.17.8   
## [19] labeling_0.4.3       curl_7.0.0           bit_4.6.0           
## [22] TTR_0.24.4           plyr_1.8.9           RColorBrewer_1.1-3  
## [25] withr_3.0.2          stats4_4.5.1         nnet_7.3-20         
## [28] grid_4.5.1           xts_0.14.1           colorspace_2.1-1    
## [31] future_1.67.0        globals_0.18.0       scales_1.4.0        
## [34] iterators_1.0.14     MASS_7.3-65          cli_3.6.5           
## [37] rmarkdown_2.29       chron_2.3-62         generics_0.1.4      
## [40] rstudioapi_0.17.1    future.apply_1.20.0  reshape2_1.4.4      
## [43] tzdb_0.5.0           DBI_1.2.3            cachem_1.1.0        
## [46] splines_4.5.1        parallel_4.5.1       urca_1.3-4          
## [49] vctrs_0.6.5          hardhat_1.4.2        Matrix_1.7-3        
## [52] jsonlite_2.0.0       tseries_0.10-58      hms_1.1.3           
## [55] bit64_4.6.0-1        listenv_0.9.1        foreach_1.5.2       
## [58] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.1       
## [61] quantmod_0.4.28      glue_1.8.0           parallelly_1.45.1   
## [64] codetools_0.2-20     stringi_1.8.7        gtable_0.3.6        
## [67] quadprog_1.5-8       lmtest_0.9-40        pillar_1.11.0       
## [70] htmltools_0.5.8.1    ipred_0.9-15         lava_1.8.1          
## [73] R6_2.6.1             tcltk_4.5.1          evaluate_1.0.5      
## [76] memoise_2.0.1        fracdiff_1.5-3       bslib_0.9.0         
## [79] class_7.3-23         Rcpp_1.1.0           nlme_3.1-168        
## [82] prodlim_2025.04.28   xfun_0.53            zoo_1.8-14          
## [85] ModelMetrics_1.2.2.2 pkgconfig_2.0.3

Question 2(a): Decision tree & EMV — Open a Regional DC?

Data given

Fixed cost if DC is opened: ₹1,200,000
Variable cost if DC is opened: ₹120 per unit
Current cost without DC: ₹180 per unit
Annual demand scenarios (units):
- Low: 8,000 (p = 0.3)
- Medium: 12,000 (p = 0.5)
- High: 18,000 (p = 0.2)

Decision tree

Decision node: Open DC vs Do Not Open DC
- If Open DC → Chance node (Low/Med/High demand). Cost on each branch = Fixed + 120 × Demand
- If Do Not Open DC → Chance node (Low/Med/High demand). Cost on each branch = 180 × Demand
EMV rule: pick the decision with the lower expected annual cost.

R code: compute branch costs & EMV

library(dplyr)
library(knitr)

# Inputs
fixed_dc   <- 1200000
vc_dc      <- 120
c_no_dc    <- 180

scenarios <- tibble::tibble(
  Scenario = c("Low","Medium","High"),
  Demand   = c(8000, 12000, 18000),
  Prob     = c(0.3,    0.5,    0.2)
)

# Costs per scenario under each decision
results <- scenarios %>%
  mutate(
    Cost_OpenDC    = fixed_dc + vc_dc * Demand,
    Cost_NoDC      = c_no_dc * Demand,
    Delta_Saving   = Cost_NoDC - Cost_OpenDC   # +ve means opening DC saves money
  )

kable(results, caption = "Per-scenario annual cost (₹) under each decision", 
      format.args = list(big.mark=","), digits = 0)

Per-scenario annual cost (₹) under each decision
Scenario	Demand	Cost_OpenDC	Cost_NoDC	Delta_Saving
Low	8,000	2,160,000	1,440,000	-720,000
Medium	12,000	2,640,000	2,160,000	-480,000
High	18,000	3,360,000	3,240,000	-120,000

# Expected (EMV = expected cost) for each decision
emv_open  <- sum(results$Prob * results$Cost_OpenDC)
emv_no    <- sum(results$Prob * results$Cost_NoDC)

emv_tbl <- tibble::tibble(
  Decision = c("Open DC","Do Not Open DC"),
  EMV_Expected_Cost = c(emv_open, emv_no)
)

kable(emv_tbl, caption = "EMV (expected annual cost) by decision (₹)",
      format.args = list(big.mark=","), digits = 0)

EMV (expected annual cost) by decision (₹)
Decision	EMV_Expected_Cost
Open DC	2,640,000
Do Not Open DC	2,160,000

# Also show expected saving if we opened the DC (vs no DC)
exp_saving_if_open <- emv_no - emv_open  # +ve => opening DC is better
cat("Expected saving if we OPEN the DC (vs stay as-is): ₹",
    format(round(exp_saving_if_open,0), big.mark=","), "\n")

## Expected saving if we OPEN the DC (vs stay as-is): ₹ -480,000

Results

EMV (Open DC): will show as ₹… in the table
EMV (Do Not Open DC): will show as ₹… in the table
Recommendation (by EMV): choose the option with the lower expected cost.

Quick sense-check (break-even): Opening the DC beats status quo when
Fixed + 120×Demand < 180×Demand ⇒ 1,200,000 < 60×Demand ⇒ Demand > 20,000.
Our “High” case is 18,000, still below 20,000, so opening the DC doesn’t pay off in any of the given scenarios — which is consistent with the EMV table.

Manager takeaway: Given today’s demand outlook, the safer, cheaper choice is not to open the regional DC. If demand projections rise above ~20k units/year, revisit the decision.

Question 2(b): Sensitivity + Monte Carlo (triangular demand)

1) Sensitivity (break-even demand)

We compare annual cost to Open DC vs cost to Not Open:

Open DC cost = Fixed + 120 × Demand
No DC cost = 180 × Demand

Break-even when they’re equal: \[ 1{,}200{,}000 + 120D = 180D \Rightarrow 1{,}200{,}000 = 60D \Rightarrow D^\* = \frac{1{,}200{,}000}{60} = \mathbf{20{,}000\ units} \]

So opening only makes sense if annual demand exceeds 20,000 units.

2) Monte Carlo simulation (triangular demand)

Assumption: Profit difference comes only from cost savings (selling price/revenue same either way).
So, opening the DC “wins” when Cost_Open < Cost_No.

Demand ~ Triangular(min = 8,000; mode = 12,000; max = 18,000)
Iterations: 10,000 (≥ 1,000 as required)
Metric: Probability that opening DC yields higher profit (i.e., lower cost)

Remark: Triangular is bounded between min and max, so simulated demands will always be ≤ 18,000, which is still below the break-even 20,000.

set.seed(123)

# Given cost parameters
fixed_dc <- 1200000
vc_dc    <- 120
c_no_dc  <- 180

# Triangular parameters
a <- 8000   # min
m <- 12000  # mode
b <- 18000  # max

# Custom triangular RNG (inverse CDF), so no extra package needed
rtri_inv <- function(n, a, m, b){
  u <- runif(n)
  Fc <- (m - a) / (b - a)
  x <- ifelse(
    u < Fc,
    a + sqrt(u * (b - a) * (m - a)),
    b - sqrt((1 - u) * (b - a) * (b - m))
  )
  return(x)
}

N <- 10000
demand_sim <- rtri_inv(N, a, m, b)

# Compute annual costs under each decision
cost_open <- fixed_dc + vc_dc * demand_sim
cost_no   <- c_no_dc * demand_sim

# "Savings" if we open (positive = opening is better)
savings <- cost_no - cost_open

# Probability opening DC is better (i.e., savings > 0)
prob_open_better <- mean(savings > 0)

# Also report expected savings (useful sense-check)
exp_saving <- mean(savings)

# Print results
cat("Break-even demand (analytic): 20,000 units\n")

## Break-even demand (analytic): 20,000 units

cat("Simulation: P(opening is better) =", round(prob_open_better, 4), "\n")

## Simulation: P(opening is better) = 0

cat("Simulation: Expected saving if open (₹) =", round(exp_saving, 0), "\n")

## Simulation: Expected saving if open (₹) = -441274

Quick plot (simulated demand vs. break-even)

library(ggplot2)
ggplot(data.frame(Demand = demand_sim), aes(x = Demand)) +
  geom_histogram(bins = 40, fill = "skyblue", color = "white") +
  geom_vline(xintercept = 20000, linetype = 2) +
  labs(title = "Triangular Demand Simulation (min=8k, mode=12k, max=18k)",
       subtitle = "Dashed line = 20,000-unit break-even",
       x = "Annual Demand (units)", y = "Frequency") +
  theme_minimal()

Results

Break-even demand: 20,000 units.
Simulated probability opening is better: should be ~0.0000 because the triangular max is 18,000, which never reaches the 20,000 break-even.
Expected saving if we open: should be negative (i.e., opening costs more on average).
> For reference, the expected demand of a triangular distribution is $(a+m+b)/3 = (8{,}000+12{,}000+18{,}000)/3 \approx 12{,}667$.
> Expected saving ≈ $60 \times 12{,}667 - 1{,}200{,}000 \approx -₹440{,}000$ per year.

Manager takeaway

Under the stated uncertainty (8k–18k demand, most likely 12k), the chance that opening the DC beats the status quo is essentially 0%. Unless demand forecasts credibly move beyond 20,000 units/year, do not open the DC. If new information suggests higher demand, rerun this exact simulation with the updated range/mode.

Question 2(c): Mini-DC option (0, 1, or 2) — MIP model and solution

What we’re deciding:
- Open 0, 1, or 2 mini-DCs (each mini-DC has fixed cost and capacity).
- How many units to ship via each open mini-DC vs the existing network.

Given - Mini-DC fixed cost = ₹700,000 (per mini-DC)
- Mini-DC variable cost = ₹130 per unit
- Mini-DC capacity = 9,000 units
- Existing network (no DC) variable cost = ₹180 per unit
- Demand $D$ (I’ll solve for Low=8k, Medium=12k, High=18k like in 2a, and you can plug any $D$ you want)

Mathematical formulation (Mixed-Integer Linear Program)

Decision variables - $y_1, y_2 \in \{0,1\}$: open mini-DC 1 or 2?
- $x_1, x_2 \ge 0$: units shipped via mini-DC 1 and 2
- $x_0 \ge 0$: units shipped via existing network

Objective (minimize total annual cost) \[ \min\ C = 700{,}000(y_1+y_2) + 130(x_1+x_2) + 180x_0 \]

Constraints \[ \begin{aligned} &x_0 + x_1 + x_2 = D &\text{(meet demand)}\\ &x_1 \le 9{,}000\cdot y_1,\quad x_2 \le 9{,}000\cdot y_2 &\text{(capacity only if opened)}\\ &x_0, x_1, x_2 \ge 0,\quad y_1, y_2 \in \{0,1\} \end{aligned} \]

R code lpSolve and results

# If needed: install.packages("lpSolve")
library(lpSolve)
library(dplyr)
library(knitr)

solve_minidc <- function(D){
  # Vars order: [x0, x1, x2, y1, y2]
  obj <- c(180, 130, 130, 700000, 700000)

  # Constraints:
  # 1) Demand: x0 + x1 + x2 = D
  # 2) x1 - 9000*y1 <= 0
  # 3) x2 - 9000*y2 <= 0
  const.mat <- rbind(
    c(1, 1, 1, 0, 0),
    c(0, 1, 0, -9000, 0),
    c(0, 0, 1, 0, -9000)
  )
  const.dir <- c("=", "<=", "<=")
  const.rhs <- c(D, 0, 0)

  # Solve MIP
  out <- lp(direction = "min",
            objective.in = obj,
            const.mat = const.mat,
            const.dir = const.dir,
            const.rhs = const.rhs,
            all.bin = FALSE,
            binary.vec = c(4,5))  # y1,y2 binary

  sol <- out$solution
  tibble(
    Demand = D,
    TotalCost = out$objval,
    x0 = sol[1], x1 = sol[2], x2 = sol[3],
    y1 = sol[4], y2 = sol[5],
    MinisOpened = sol[4] + sol[5]
  )
}

scenarios <- c(8000, 12000, 18000)
res_tbl <- bind_rows(lapply(scenarios, solve_minidc))

# Add "No mini-DC" baseline for comparison
baseline <- tibble(
  Demand = scenarios,
  BaselineCost_NoMini = 180 * scenarios
)

final_tbl <- res_tbl %>%
  left_join(baseline, by = "Demand") %>%
  mutate(Choice = ifelse(TotalCost < BaselineCost_NoMini, 
                         paste0("Open ", MinisOpened, " mini-DC(s)"),
                         "Open 0 (use existing network)"))

kable(final_tbl, caption = "Mini-DC MIP solution vs baseline (per demand scenario)",
      format.args = list(big.mark=","), digits = 0)

Mini-DC MIP solution vs baseline (per demand scenario)
Demand	TotalCost	x0	BaselineCost_NoMini	Choice
8,000	1,440,000	8,000	1,440,000	Open 0 mini-DC(s)
12,000	2,160,000	12,000	2,160,000	Open 0 mini-DC(s)
18,000	3,240,000	18,000	3,240,000	Open 0 mini-DC(s)

(What the table will show — worked numbers)
For these costs and capacities, the solver selects 0 mini-DCs in all three scenarios:

D = 8,000:
- Baseline = 180×8,000 = ₹1,440,000
- 1 mini (if forced) = 700,000 + 130×8,000 = ₹1,740,000 (worse)
D = 12,000:
- Baseline = ₹2,160,000
- 1 mini (cap 9k) + rest baseline = 700,000 + 130×9,000 + 180×3,000 = ₹2,410,000 (worse)
- 2 minis = 1,400,000 + 130×12,000 = ₹2,960,000 (worse)
D = 18,000:
- Baseline = ₹3,240,000
- 1 mini (9k) + rest baseline = 700,000 + 130×9,000 + 180×9,000 = ₹3,490,000 (worse)
- 2 minis (cap 18k) = 1,400,000 + 130×18,000 = ₹3,740,000 (worse)

Why this happens?

A mini-DC saves ₹50 per unit vs the baseline (180 → 130) but adds a fixed ₹700k and only 9k capacity.
- Max per-mini saving at full capacity = 50×9,000 = ₹450k, which doesn’t cover the ₹700k fixed cost.
- Even two minis (18k capacity) save at most 50×18,000 = ₹900k, still less than the combined fixed cost of ₹1.4M.
So under these inputs, opening mini-DCs cannot beat the existing network.

Manager recommendation

With today’s numbers, the cost-minimizing choice is to open 0 mini-DCs and ship everything through the existing network.
If the fixed cost per mini drops below ~₹450k, or the baseline per-unit cost rises, or mini-DC capacity increases materially, this conclusion could change—then re-run the same model with the updated inputs.

Remark (packages): I used lpSolve to solve this small mixed-integer model. If knitting on a new machine, install.packages("lpSolve").

Advanced Data Analytics - Answers: Term End Assignment

2025-09-14

Dataset Description and Business Relevance

Load Libraries

Load & Prepare Data

Descriptive Analytics

Missing Values Analysis

Summary Statistics

Distribution Analysis

Time Series Analysis

Metrics for Inline Insights

Key Insights

Predictive Business Objective

Reference

Question 1(b): SQL on the dataset using sqldf

Query 1 — Which product lines generated the highest revenue in 2004 from orders that shipped?

Query 2 — Among USA customers in 2005 with shipped orders, who are the top customers by average order-line revenue (min. 3 lines)?

Notes

Question 1(c): Two simple business opinions → hypotheses → tests

Opinion 1

Opinion 2

My takeaway

Question 1(d): Multiple Linear Regression (with stepwise)

Question 1(e): Binary classification with Logistic Regression

Question 1(f): Time-series forecast of Monthly Sales Revenue

What I did actually here?

Build the monthly series

Train / Test split and baseline plot

Model 1: ETS (Exponential Smoothing)

Model 2: ARIMA (auto.arima)

Compare forecasts against the test period

Visualise actuals vs forecasts (test window)

What I see and why I chose these models

Which model is better here?

Managerial interpretation

Appendix (Updated): Packages, Reproducibility & Outputs

Outputs that were produced in this document

Re-print key tables (if available)

Re-print key figures (if available)

Session info (for reproducibility)

Question 2(a): Decision tree & EMV — Open a Regional DC?

Decision tree

R code: compute branch costs & EMV

Results

Question 2(b): Sensitivity + Monte Carlo (triangular demand)

1) Sensitivity (break-even demand)

2) Monte Carlo simulation (triangular demand)

Quick plot (simulated demand vs. break-even)

Results

Manager takeaway

Question 2(c): Mini-DC option (0, 1, or 2) — MIP model and solution

Mathematical formulation (Mixed-Integer Linear Program)

R code lpSolve and results

Why this happens?

Manager recommendation

Question 1(b): SQL on the dataset using `sqldf`