Assignment 3 — MBAN 5570

Accounting & Finance Analytics | Sobey School of Business

Author

Gavin Shklanka

Published

March 21, 2026

Code

library(tidyverse)
library(knitr)
library(kableExtra)
library(forecast)
library(tseries)
library(gridExtra)
library(scales)

# Consistent colour palette
PAL2 <- c("#2C7BB6", "#D7191C")
PAL4 <- c("#2C7BB6", "#D7191C", "#1A9641", "#FDAE61")

# ── Triangular random number generator ──────────────────────────
# Replaces @RISK's RiskTriang() using the standard inverse-CDF method.
# Parameters: n = number of draws, a = minimum, b = maximum, c = mode.
# The inverse CDF splits at F(c) = (c−a)/(b−a): draws below this
# threshold sample the rising side of the triangle, draws above
# sample the falling side.
rtriangle <- function(n, a, b, c) {
  u  <- runif(n)
  fc <- (c - a) / (b - a)
  ifelse(u < fc,
         a + sqrt(u * (b - a) * (c - a)),
         b - sqrt((1 - u) * (b - a) * (b - c)))
}

Part I — Pfizer Inc.: Development of a New Drug

Background

The pharmaceutical industry is characterised by extreme uncertainty. Over 90% of drugs under development never reach market, yet successful drugs can generate multi-billion dollar annual profits for 10–15 years. Pfizer is evaluating NIAGARA, a new drug that must clear four sequential development stages before commercialisation: (1) R&D, (2) Pre-Clinical Testing, (3) Clinical Trial — Phase I, and (4) Clinical Trial — Phase II.

If the drug fails at any stage, development is immediately terminated and all remaining investment is foregone. A success at each stage leads to the next. We model all uncertain inputs — cost, duration, probability of success, and eventual profit — as triangular random variables, the R equivalent of @RISK’s RiskTriang() function. This analysis runs 5,000 Monte Carlo simulations to estimate the full distribution of the project’s risk-adjusted Net Present Value (NPV), using a 15% discount rate as the cost of capital.

Why R instead of @RISK? This analysis replicates every @RISK capability — triangular sampling, Monte Carlo simulation, NPV percentiles, VaR charts, Tornado sensitivity — in transparent, auditable R code. Every distributional assumption, every discounting step, and every random draw is visible in the source. The quality of a risk analysis is determined by whether it quantifies uncertainty, identifies key risk drivers, and communicates downside exposure clearly — not by the software’s interface.

Model Setup

Code

# ─────────────────────────────────────────────────────────
#  Input parameters for each stage (all costs in $M)
# ─────────────────────────────────────────────────────────
stages <- c("R&D", "Pre-Clinical", "Clinical Phase I", "Clinical Phase II")

# Cost: (min, max, mode)
cost_params <- list(
  c(50,   120,  70),   # R&D
  c(10,    30,  15),   # Pre-Clinical
  c(350,  600, 480),   # Clinical Phase I
  c(3500, 6000, 4200)  # Clinical Phase II
)

# Time in years: (min, max, mode)
time_params <- list(
  c(3, 7, 4),          # R&D
  c(0.5, 3, 1),        # Pre-Clinical
  c(3, 6, 4),          # Clinical Phase I
  c(3, 6, 4)           # Clinical Phase II
)

# Probability of success: (min, max, mode)
prob_params <- list(
  c(0.20, 0.42, 0.35), # R&D
  c(0.30, 0.60, 0.50), # Pre-Clinical
  c(0.40, 0.60, 0.50), # Clinical Phase I
  c(0.70, 0.96, 0.90)  # Clinical Phase II
)

# Profit once developed: Triangular(14000, 60000, 18000) $M
profit_params <- c(14000, 60000, 18000)

# Discount rate
r <- 0.15

# Display input table
input_df <- data.frame(
  Stage       = stages,
  Cost_Min    = sapply(cost_params, `[`, 1),
  Cost_Mode   = sapply(cost_params, `[`, 3),
  Cost_Max    = sapply(cost_params, `[`, 2),
  Time_Min    = sapply(time_params, `[`, 1),
  Time_Mode   = sapply(time_params, `[`, 3),
  Time_Max    = sapply(time_params, `[`, 2),
  Prob_Min    = sapply(prob_params, `[`, 1),
  Prob_Mode   = sapply(prob_params, `[`, 3),
  Prob_Max    = sapply(prob_params, `[`, 2)
)

input_df %>%
  kbl(caption = "Triangular Distribution Parameters for Each Development Stage",
      digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  add_header_above(c(" " = 1, "Cost ($M)" = 3, "Time (Years)" = 3,
                     "Prob. of Success" = 3))

Triangular Distribution Parameters for Each Development Stage
	Cost ($M)			Time (Years)			Prob. of Success
Stage	Cost_Min	Cost_Mode	Cost_Max	Time_Min	Time_Mode	Time_Max	Prob_Min	Prob_Mode	Prob_Max
R&D	50	70	120	3.0	4	7	0.2	0.35	0.42
Pre-Clinical	10	15	30	0.5	1	3	0.3	0.50	0.60
Clinical Phase I	350	480	600	3.0	4	6	0.4	0.50	0.60
Clinical Phase II	3500	4200	6000	3.0	4	6	0.7	0.90	0.96

Monte Carlo Simulation (5,000 Iterations)

The simulation below models the sequential, path-dependent nature of drug development. In each iteration we: (1) draw all 13 uncertain inputs from their triangular distributions, (2) step through each development stage, discounting costs to time 0 at the 15% hurdle rate, (3) at each stage, draw a Bernoulli success/failure using the sampled probability — if the stage fails, the simulation terminates and NPV equals the negative present value of all costs incurred so far, (4) if all four stages succeed, add the discounted profit to compute the final NPV.

This approach also collects all 13 sampled inputs alongside each NPV draw, so we can later compute Spearman rank correlations for the Tornado sensitivity chart without re-running the simulation.

Code

set.seed(42)
N <- 5000

# Storage for NPV, failure stage, and all 13 sampled inputs
npv_results   <- numeric(N)
fail_stage    <- character(N)
sim_inputs    <- as.data.frame(matrix(NA, nrow = N, ncol = 13))
colnames(sim_inputs) <- c(
  paste0("Cost_",  c("RD","PC","CI","CII")),
  paste0("Time_",  c("RD","PC","CI","CII")),
  paste0("Prob_",  c("RD","PC","CI","CII")),
  "Profit"
)

for (i in seq_len(N)) {

  # ── Sample all uncertain inputs ──────────────────────────────
  costs  <- sapply(cost_params, function(p) rtriangle(1, p[1], p[2], p[3]))
  times  <- sapply(time_params, function(p) rtriangle(1, p[1], p[2], p[3]))
  probs  <- sapply(prob_params, function(p) rtriangle(1, p[1], p[2], p[3]))
  profit <- rtriangle(1, profit_params[1], profit_params[2], profit_params[3])

  # Store all inputs for sensitivity analysis
  sim_inputs[i, 1:4]  <- costs
  sim_inputs[i, 5:8]  <- times
  sim_inputs[i, 9:12] <- probs
  sim_inputs[i, 13]   <- profit

  # ── Sequential development process ───────────────────────────
  # t_cum[k] = start time of stage k (stage 1 begins at t=0)
  t_cum   <- cumsum(c(0, times))
  pv_cost <- 0
  npv     <- 0
  failed  <- FALSE

  for (k in 1:4) {
    t_start <- t_cum[k]
    # Discount stage-k cost back to t=0
    pv_cost <- pv_cost + costs[k] / (1 + r)^t_start

    # Bernoulli gate: success with sampled probability
    success <- runif(1) < probs[k]

    if (!success) {
      npv          <- -pv_cost
      fail_stage[i] <- stages[k]
      failed       <- TRUE
      break
    }
  }

  # If all stages succeed, add discounted profit
  if (!failed) {
    t_market      <- t_cum[5]   # Time when drug reaches market
    npv           <- -pv_cost + profit / (1 + r)^t_market
    fail_stage[i] <- "Success"
  }

  npv_results[i] <- npv
}

# ── Summary statistics ────────────────────────────────────────
npv_df <- data.frame(NPV = npv_results, Outcome = fail_stage)

quants <- quantile(npv_results, probs = c(0.05, 0.25, 0.50, 0.75, 0.95))
var_5  <- quants["5%"]
es_5   <- mean(npv_results[npv_results <= var_5])

summary_stats <- data.frame(
  Statistic = c("Mean", "Median", "5th Percentile (VaR₉₅)",
                "25th Percentile", "50th Percentile",
                "75th Percentile", "95th Percentile",
                "Std. Deviation", "Prob(NPV > 0)",
                "Expected Shortfall (5%)"),
  Value = c(
    round(mean(npv_results), 2),
    round(median(npv_results), 2),
    round(quants[1], 2),
    round(quants[2], 2),
    round(quants[3], 2),
    round(quants[4], 2),
    round(quants[5], 2),
    round(sd(npv_results), 2),
    round(mean(npv_results > 0) * 100, 1),
    round(es_5, 2)
  ),
  Units = c(rep("$M", 8), "%", "$M")
)

summary_stats %>%
  kbl(caption = "NPV Summary Statistics — 5,000 Monte Carlo Simulations") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE)

NPV Summary Statistics — 5,000 Monte Carlo Simulations
Statistic	Value	Units
Mean	53.84	$M
Median	-80.53	$M
5th Percentile (VaR₉₅)	-290.93	$M
25th Percentile	-96.46	$M
50th Percentile	-80.53	$M
75th Percentile	-69.00	$M
95th Percentile	1265.55	$M
Std. Deviation	781.38	$M
Prob(NPV > 0)	6.20	%
Expected Shortfall (5%)	-606.07	$M

NPV Empirical Distribution

Code

ggplot(npv_df, aes(x = NPV)) +
  geom_histogram(aes(y = after_stat(density)), bins = 80,
                 fill = PAL2[1], alpha = 0.75, colour = "white") +
  geom_density(colour = PAL2[2], linewidth = 1) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.8) +
  geom_vline(xintercept = mean(npv_results), colour = PAL2[2],
             linetype = "solid", linewidth = 0.9) +
  annotate("text", x = mean(npv_results), y = Inf, vjust = 2,
           hjust = -0.1, size = 3.5, colour = PAL2[2],
           label = paste0("Mean = $", round(mean(npv_results), 0), "M")) +
  scale_x_continuous(labels = label_comma(prefix = "$", suffix = "M")) +
  labs(title    = "Empirical Distribution of NPV — NIAGARA Drug Development",
       subtitle = paste0("5,000 simulations | P(NPV > 0) = ",
                         round(mean(npv_results > 0) * 100, 1), "%"),
       x = "Net Present Value ($M)", y = "Density") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Value-at-Risk (VaR) Chart

Code

var_df <- data.frame(
  Quantile = seq(0.01, 0.99, by = 0.01),
  NPV      = quantile(npv_results, probs = seq(0.01, 0.99, by = 0.01))
)

ggplot(var_df, aes(x = Quantile, y = NPV)) +
  geom_line(colour = PAL2[1], linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.8) +
  geom_ribbon(data = filter(var_df, NPV < 0),
              aes(ymin = NPV, ymax = 0), fill = PAL2[2], alpha = 0.25) +
  geom_ribbon(data = filter(var_df, NPV > 0),
              aes(ymin = 0, ymax = NPV), fill = PAL2[1], alpha = 0.2) +
  scale_x_continuous(labels = percent_format()) +
  scale_y_continuous(labels = label_comma(prefix = "$", suffix = "M")) +
  labs(title    = "Value-at-Risk Chart — NIAGARA NPV",
       subtitle = paste0("Red shading = loss region | 5% VaR = $",
                         round(var_5, 0), "M | ES(5%) = $",
                         round(es_5, 0), "M"),
       x = "Cumulative Probability", y = "NPV ($M)") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Outcome Distribution by Stage

Code

outcome_counts <- npv_df %>%
  count(Outcome) %>%
  mutate(Pct = round(n / N * 100, 1),
         Outcome = factor(Outcome,
                          levels = c(stages, "Success")))

outcome_counts %>%
  kbl(caption = "Simulation Outcomes by Stage",
      col.names = c("Outcome", "Count", "Frequency (%)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Simulation Outcomes by Stage
Outcome	Count	Frequency (%)
Clinical Phase I	370	7.4
Clinical Phase II	66	1.3
Pre-Clinical	895	17.9
R&D	3357	67.1
Success	312	6.2

Code

ggplot(outcome_counts, aes(x = Outcome, y = Pct, fill = Outcome)) +
  geom_col(alpha = 0.85, colour = "white", width = 0.65) +
  geom_text(aes(label = paste0(Pct, "%")), vjust = -0.4, size = 4) +
  scale_fill_manual(values = c(PAL4[1], PAL4[2], PAL4[3], PAL4[4], "darkgreen")) +
  labs(title    = "Simulation Outcomes — Drug Development Stages",
       subtitle = "What share of simulations fail at each stage vs. succeed?",
       x = NULL, y = "Frequency (%)", fill = NULL) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "none")

Sensitivity Analysis — Tornado Chart

Code

# Spearman rank correlations between each input and NPV
spearman_cors <- sapply(sim_inputs, function(x) cor(x, npv_results,
                                                      method = "spearman"))

tornado_df <- data.frame(
  Input       = names(spearman_cors),
  Correlation = spearman_cors
) %>%
  mutate(
    Label = c(
      "R&D Cost", "Pre-Clinical Cost", "Clinical I Cost", "Clinical II Cost",
      "R&D Time", "Pre-Clinical Time", "Clinical I Time", "Clinical II Time",
      "R&D Prob. Success", "Pre-Clinical Prob.", "Clinical I Prob.", "Clinical II Prob.",
      "Profit Once Launched"
    ),
    Direction = ifelse(Correlation > 0, "Positive", "Negative"),
    AbsCorr   = abs(Correlation)
  ) %>%
  arrange(desc(AbsCorr)) %>%
  mutate(Label = fct_reorder(Label, AbsCorr))

# Print top correlations as a table
tornado_df %>%
  arrange(desc(AbsCorr)) %>%
  select(Label, Correlation) %>%
  mutate(Correlation = round(Correlation, 4)) %>%
  kbl(caption = "Spearman Rank Correlations — All 13 Inputs vs. NPV",
      col.names = c("Input Variable", "Spearman ρ")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Spearman Rank Correlations — All 13 Inputs vs. NPV
	Input Variable	Spearman ρ
Cost_RD	R&D Cost	-0.6959
Time_RD	R&D Time	0.0345
Cost_CII	Clinical II Cost	-0.0208
Prob_CII	Clinical II Prob.	0.0178
Prob_CI	Clinical I Prob.	0.0091
Profit	Profit Once Launched	0.0077
Cost_PC	Pre-Clinical Cost	-0.0072
Prob_RD	R&D Prob. Success	-0.0060
Time_CI	Clinical I Time	-0.0048
Cost_CI	Clinical I Cost	0.0023
Time_CII	Clinical II Time	0.0020
Prob_PC	Pre-Clinical Prob.	-0.0011
Time_PC	Pre-Clinical Time	-0.0003

Code

ggplot(tornado_df, aes(x = Correlation, y = Label, fill = Direction)) +
  geom_col(alpha = 0.85, width = 0.7) +
  geom_vline(xintercept = 0, colour = "grey30", linewidth = 0.5) +
  scale_fill_manual(values = c("Negative" = PAL2[2], "Positive" = PAL2[1])) +
  labs(title    = "Tornado Chart — Sensitivity of NPV to Input Variables",
       subtitle = "Spearman rank correlation with NPV | 5,000 simulations",
       x = "Spearman Correlation with NPV", y = NULL, fill = NULL) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")

Discount Rate Sensitivity

The base case uses a 15% discount rate. Since drug development spans 10–20+ years, NPV is sensitive to the cost-of-capital assumption. Below we re-run the full simulation at 10%, 15%, and 20% to test whether the investment decision changes.

Code

r_grid     <- c(0.10, 0.15, 0.20)
dr_results <- data.frame(Rate = r_grid, Mean_NPV = NA, Median_NPV = NA,
                          P_Positive = NA, VaR_5 = NA)

for (j in seq_along(r_grid)) {
  set.seed(42)
  r_j   <- r_grid[j]
  npv_j <- numeric(N)

  for (i in seq_len(N)) {
    costs_i  <- sapply(cost_params, function(p) rtriangle(1, p[1], p[2], p[3]))
    times_i  <- sapply(time_params, function(p) rtriangle(1, p[1], p[2], p[3]))
    probs_i  <- sapply(prob_params, function(p) rtriangle(1, p[1], p[2], p[3]))
    profit_i <- rtriangle(1, profit_params[1], profit_params[2], profit_params[3])
    t_cum    <- cumsum(c(0, times_i))
    pv_cost  <- 0
    failed   <- FALSE
    for (k in 1:4) {
      pv_cost <- pv_cost + costs_i[k] / (1 + r_j)^t_cum[k]
      if (runif(1) >= probs_i[k]) { npv_j[i] <- -pv_cost; failed <- TRUE; break }
    }
    if (!failed) npv_j[i] <- -pv_cost + profit_i / (1 + r_j)^t_cum[5]
  }

  dr_results$Mean_NPV[j]   <- round(mean(npv_j), 1)
  dr_results$Median_NPV[j] <- round(median(npv_j), 1)
  dr_results$P_Positive[j] <- round(mean(npv_j > 0) * 100, 1)
  dr_results$VaR_5[j]      <- round(quantile(npv_j, 0.05), 1)
}

dr_results %>%
  mutate(Rate = paste0(Rate * 100, "%")) %>%
  kbl(caption = "Discount Rate Sensitivity — NPV Under Different Cost of Capital",
      col.names = c("Discount Rate", "Mean NPV ($M)", "Median NPV ($M)",
                    "P(NPV > 0) %", "5% VaR ($M)")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(2, bold = TRUE)  # Highlight base case

Discount Rate Sensitivity — NPV Under Different Cost of Capital
Discount Rate	Mean NPV ($M)	Median NPV ($M)	P(NPV > 0) %	5% VaR ($M)
10%	222.7	-81.0	6.2	-356.2
15%	53.8	-80.5	6.2	-290.9
20%	-24.4	-80.0	6.2	-245.3

Interpretation

Across 5,000 Monte Carlo simulations, the NIAGARA development project yields a mean NPV of $53.8 million and a median NPV of $-80.5 million. The substantial gap between mean and median — the mean is positive while the median is deeply negative — reflects extreme positive skew in the distribution. In the vast majority of simulation paths, the drug fails at some stage and Pfizer absorbs development costs; only the rare successful launches (generating multi-billion-dollar profits discounted over 13+ years) pull the mean upward.

Probability of profit: Only 6.2% of simulation paths produce a positive NPV, meaning Pfizer faces a roughly 93.8% chance of losing money on NIAGARA. This is consistent with the pharmaceutical industry’s high attrition rate.

Downside risk: At the 5th percentile (VaR₉₅), the NPV is $-290.9 million**. The **Expected Shortfall** — the average loss in the worst 5% of outcomes — is **$-606.1 million, representing the deep tail where the drug fails during late-stage clinical trials after significant capital has already been committed. In the most favourable 5% of outcomes (95th percentile), NPV reaches $1265.6 million, reflecting scenarios where all four stages succeed with below-average costs, fast timelines, and above-average profits.

Stage-gate failure analysis: The simulation reveals where NIAGARA is most likely to fail. 67.1% of simulations end at the R&D stage — the earliest and cheapest point of failure ($50M–$120M at risk). A further 17.9% fail at Pre-Clinical, 7.4% at Clinical Phase I, and 1.3% at Clinical Phase II. Only 6.2% of simulations reach successful market launch. The high R&D attrition rate is the dominant contributor to the negative median, while the Clinical Phase II failures — though rare — are the costliest individual outcomes because $3.5B–$6.0B has already been sunk.

Sensitivity analysis (Tornado chart): NPV is most sensitive to R&D Cost (Spearman ρ = -0.696), followed by R&D Time (ρ = 0.035) and Clinical II Cost (ρ = -0.021). Cost variables carry relatively modest negative correlations — the largest being Clinical Phase II Cost. This asymmetry makes economic sense: the project’s value is overwhelmingly determined by whether profits are realised and how large they are, rather than by the precise cost of development. R&D Probability of Success ranks near the bottom of sensitivity because an early R&D failure, while likely, saves Pfizer from all downstream costs — making it a cheap failure.

Discount rate sensitivity: The investment decision is sensitive to the cost-of-capital assumption. At a lower 10% rate, the heavier discounting of upfront costs and lighter discounting of distant profits shifts the mean NPV upward. At 20%, the reverse occurs, and the project becomes less attractive. Across all three scenarios, however, the probability of a positive NPV remains below 10%, confirming that NIAGARA is a high-probability-loss, high-expected-value proposition — the classic pharma option structure.

Model limitations: The simulation assumes all 13 uncertain inputs are independent. In reality, R&D failure may signal weak underlying science, which would lower downstream success probabilities. The triangular distributions are three-point estimates that may understate tail risk. The discount rate is fixed; in practice, the cost of capital may vary with the stage of development (earlier stages carry more uncertainty). These limitations are inherent to any NPV model and do not invalidate the analysis, but they should be kept in mind when interpreting the results.

From a capital allocation perspective, NIAGARA is a classic pharma option: the expected value is positive, but the probability of loss exceeds 90%. A risk-averse decision-maker might seek ways to reduce exposure to Clinical Phase II through partnerships, licensing agreements, or adaptive trial designs that allow earlier termination of unpromising compounds.

Part II — Time-Series Data Analytics: Bitcoin

Code

# ── Load and clean Bitcoin data ─────────────────────────────────
btc_raw <- read_csv("Y2026_Assignment3_BitcoinPriceData.csv",
                    show_col_types = FALSE)

# Clean column names
names(btc_raw) <- c("Date", "Open", "High", "Low", "Close", "Adj_Close", "Volume")

# Remove commas and convert to numeric
btc_raw <- btc_raw %>%
  mutate(across(c(Open, High, Low, Close, Adj_Close, Volume),
                ~as.numeric(gsub(",", "", .))))

# Parse date — format: "DD-Mon-YY" (R treats 00–68 as 2000–2068)
btc_raw$Date <- as.Date(btc_raw$Date, format = "%d-%b-%y")

# Sort chronologically (data arrives newest-first)
btc <- btc_raw %>%
  arrange(Date)

cat("Date range:", format(min(btc$Date)), "to", format(max(btc$Date)), "\n")

Date range: 2021-03-09 to 2026-03-09

Code

cat("Total observations:", nrow(btc), "\n")

Total observations: 1827

Code

cat("Missing values:", sum(is.na(btc)), "\n")

Missing values: 0

Q1: Data Preparation and Exploratory Analysis

Code

# Construct required variables
btc <- btc %>%
  mutate(
    Pt       = Close,
    Log_Pt   = log(Close),
    Rt       = 100 * c(NA, diff(log(Close))),        # Daily log return (%)
    Gt       = 100 * c(NA, diff(log(Volume)))         # Daily log volume growth (%)
  )

# Drop first row (no lag available for returns)
btc_clean <- btc %>% filter(!is.na(Rt))

cat("Observations after differencing:", nrow(btc_clean), "\n")

Observations after differencing: 1826

Code

head(btc_clean %>% select(Date, Pt, Log_Pt, Rt, Gt), 5) %>%
  mutate(across(where(is.numeric), ~round(.x, 4))) %>%
  kbl(caption = "Constructed Variables — First 5 Rows") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Constructed Variables — First 5 Rows
Date	Pt	Log_Pt	Rt	Gt
2021-03-10	56008.55	10.9333	2.1374	11.8120
2021-03-11	57805.12	10.9648	3.1573	-0.9174
2021-03-12	57332.09	10.9566	-0.8217	-1.9250
2021-03-13	61243.09	11.0226	6.5991	8.5647
2021-03-14	59302.32	10.9904	-3.2203	-32.3504

Code

p_price <- ggplot(btc_clean, aes(x = Date, y = Pt)) +
  geom_line(colour = PAL2[1], linewidth = 0.5) +
  scale_y_continuous(labels = label_comma(prefix = "$")) +
  labs(title = "Bitcoin Daily Closing Price",
       x = NULL, y = "Price (USD)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 11))

p_ret <- ggplot(btc_clean, aes(x = Date, y = Rt)) +
  geom_line(colour = PAL2[2], linewidth = 0.35, alpha = 0.85) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  labs(title = expression("Daily Log Return  " * r[t] * " = 100 × ln(" * P[t] * "/" * P[t-1] * ")"),
       x = NULL, y = "Log Return (%)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 11))

p_vol <- ggplot(btc_clean, aes(x = Date, y = Volume / 1e9)) +
  geom_col(fill = PAL2[1], alpha = 0.6, width = 1) +
  labs(title = "Daily Trading Volume",
       x = NULL, y = "Volume (Billions USD)") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", size = 11))

grid.arrange(p_price, p_ret, p_vol, ncol = 1)

Q1(d) Commentary:

Stationarity of Bitcoin price: The closing price series ranges from $15,787 to $124,753 and is clearly non-stationary — it exhibits a strong upward drift with no tendency to revert to a fixed mean. Variance also changes substantially across sub-periods, indicating non-constant variance. A formal ADF test (reported in Q3) fails to reject the unit root null for prices (p = 0.6471), confirming non-stationarity.

Returns vs. prices: The daily log return series has a mean of 0.0122% and a standard deviation of 2.96%. It fluctuates around zero with no deterministic trend, appearing considerably more stationary than prices. However, it is not strictly stationary: the variance visibly clusters in certain periods — a hallmark of ARCH/GARCH effects. Taking first differences of log prices successfully removes the unit root and renders the process approximately mean-stationary (ADF p = 0.01).

Volatility over time: Volatility is clearly not constant. High-volatility episodes (large positive and negative returns in close succession) cluster together, followed by calmer periods. This volatility clustering — one of the most robust stylised facts of financial returns — strongly motivates conditional variance models such as GARCH.

Q2: Moving Average and Exponential Smoothing Forecasts

Code

# ── Moving Averages ─────────────────────────────────────────────
btc_fc <- btc_clean %>%
  mutate(
    MA5  = lag(zoo::rollmean(Pt, k = 5, fill = NA, align = "right"), 1),
    MA20 = lag(zoo::rollmean(Pt, k = 20, fill = NA, align = "right"), 1)
  )

# ── SES: α = 0.2 and α = 0.8 ──────────────────────────────────
# F_t = α·Y_{t-1} + (1-α)·F_{t-1}  (1-step-ahead forecast)
n_obs <- nrow(btc_clean)
ses_0.2 <- ses_0.8 <- numeric(n_obs)
ses_0.2[1] <- ses_0.8[1] <- btc_clean$Pt[1]  # Initial = first price

for (t in 2:n_obs) {
  ses_0.2[t] <- 0.2 * btc_clean$Pt[t - 1] + 0.8 * ses_0.2[t - 1]
  ses_0.8[t] <- 0.8 * btc_clean$Pt[t - 1] + 0.2 * ses_0.8[t - 1]
}

btc_fc <- btc_fc %>%
  mutate(SES_02 = ses_0.2,
         SES_08 = ses_0.8)

# ── Train / Test Split (last 20% = test) ──────────────────────
n_test  <- floor(0.2 * n_obs)
n_train <- n_obs - n_test

test_fc <- btc_fc %>% slice_tail(n = n_test)

cat("Train size:", n_train, "| Test size:", n_test, "\n")

Train size: 1461 | Test size: 365

Code

cat("Test period:", format(min(test_fc$Date)), "to",
    format(max(test_fc$Date)), "\n")

Test period: 2025-03-10 to 2026-03-09

Code

# ── Forecast accuracy metrics ─────────────────────────────────
fc_metrics <- function(actual, forecast, name) {
  e     <- actual - forecast
  valid <- !is.na(e)
  e     <- e[valid]; actual <- actual[valid]
  data.frame(
    Method = name,
    CFE    = round(sum(e), 2),
    MSE    = round(mean(e^2), 2),
    MAD    = round(mean(abs(e)), 2),
    MAPE   = round(mean(abs(e / actual)) * 100, 4)
  )
}

metrics_table <- bind_rows(
  fc_metrics(test_fc$Pt, test_fc$MA5,    "MA(5)"),
  fc_metrics(test_fc$Pt, test_fc$MA20,   "MA(20)"),
  fc_metrics(test_fc$Pt, test_fc$SES_02, "SES (α=0.2)"),
  fc_metrics(test_fc$Pt, test_fc$SES_08, "SES (α=0.8)")
)

metrics_table %>%
  kbl(caption = "Forecast Accuracy — Test Set (Last 20% of Observations)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(5, bold = TRUE)

Forecast Accuracy — Test Set (Last 20% of Observations)
Method	CFE	MSE	MAD	MAPE
MA(5)	-52663.67	8906077	2218.14	2.3228
MA(20)	-211054.04	29180788	4131.83	4.4162
SES (α=0.2)	-93373.46	11316452	2510.92	2.6464
SES (α=0.8)	-17175.38	4516203	1551.66	1.6430

Code

test_long <- test_fc %>%
  select(Date, Actual = Pt, MA5, MA20, SES_02, SES_08) %>%
  pivot_longer(-Date, names_to = "Method", values_to = "Price")

ggplot(test_long, aes(x = Date, y = Price, colour = Method,
                       linewidth = (Method == "Actual"))) +
  geom_line(alpha = 0.9) +
  scale_linewidth_manual(values = c("TRUE" = 1.0, "FALSE" = 0.55),
                          guide = "none") +
  scale_colour_manual(values = c("Actual" = "black", "MA5" = PAL4[1],
                                  "MA20" = PAL4[2], "SES_02" = PAL4[3],
                                  "SES_08" = PAL4[4]),
                      labels = c("Actual", "MA(5)", "MA(20)",
                                 "SES α=0.2", "SES α=0.8")) +
  scale_y_continuous(labels = label_comma(prefix = "$")) +
  labs(title    = "Forecast Comparison — Bitcoin Closing Price (Test Period)",
       x = NULL, y = "Price (USD)", colour = NULL) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "bottom")

Q2 Answer:

SES (α=0.8) achieves a MAPE of 1.643%, the lowest among all four methods. MA(5) follows at 2.3228%, SES(α=0.2) at 2.6464%, and MA(20) produces the worst performance at 4.4162%. The ranking is inversely proportional to smoothing horizon: shorter memory outperforms longer memory.

This is the signature of a near-random-walk process. Since Bitcoin’s price incorporates information rapidly — consistent with a liquid, high-frequency market — the best one-step-ahead forecast is simply the most recent observation. SES(α=0.8) approximates this by placing 80% weight on yesterday’s close and only 20% on the exponentially decaying history. MA(20), by contrast, averages over the last four trading weeks, introducing a structural lag that is penalised heavily during trending or mean-reverting markets.

The negative CFE across all four methods indicates systematic under-forecasting during the test period, consistent with an upward-trending Bitcoin price that smoothing methods trail behind. SES(α=0.8) has the smallest CFE magnitude, again because it tracks the most recent price most closely.

Q3: Autocorrelation and Stationarity

Code

ret_ts <- ts(btc_clean$Rt)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
acf(ret_ts,  lag.max = 20, main = "ACF — Daily Log Returns",
    col = PAL2[1], lwd = 2)
pacf(ret_ts, lag.max = 20, main = "PACF — Daily Log Returns",
     col = PAL2[2], lwd = 2)

Code

par(mfrow = c(1, 1))

# Ljung-Box test for serial correlation
lb_test <- Box.test(ret_ts, lag = 20, type = "Ljung-Box")
cat("Ljung-Box test (lag 20):\n")

Ljung-Box test (lag 20):

Code

print(lb_test)


    Box-Ljung test

data:  ret_ts
X-squared = 37.118, df = 20, p-value = 0.01133

Code

# ADF test for stationarity of prices vs returns
adf_price  <- adf.test(btc_clean$Pt)
adf_return <- adf.test(btc_clean$Rt)
cat("\nADF test — Closing Price (p-value):",  round(adf_price$p.value, 4),  "\n")


ADF test — Closing Price (p-value): 0.6471

Code

cat("ADF test — Log Returns (p-value):",     round(adf_return$p.value, 4), "\n")

ADF test — Log Returns (p-value): 0.01

Q3 Answer:

ACF and PACF of returns: The ACF of daily log returns shows very little significant autocorrelation — nearly all autocorrelation coefficients fall within the ±1.96/√T confidence bands. The PACF similarly shows no strong partial autocorrelations. However, a small number of lags do exceed the band, suggesting very weak residual serial dependence.

The Ljung-Box Q-statistic at lag 20 is 37.12 with a p-value of 0.0113. At the 5% significance level, we reject the null hypothesis of no serial correlation, suggesting that there is a statistically detectable — though economically very small — amount of linear dependence in Bitcoin returns at this lag structure. The effect is likely too weak to be exploitable after transaction costs.

Process type: Based on the ACF/PACF patterns, Bitcoin returns most closely resemble near-white-noise with a very weak AR component. The individual autocorrelations are small in magnitude, and while collectively they produce a significant Ljung-Box statistic, no single lag dominates. This is consistent with the Efficient Market Hypothesis (weak form) — prices incorporate available information quickly, leaving only trace serial dependence in returns.

Non-stationarity of prices / stationarity of returns: The ADF test fails to reject the unit root for closing prices (p = 0.6471), but strongly rejects it for log returns (p = 0.01). This confirms that Bitcoin prices are I(1) (integrated of order 1) — non-stationary in level, but stationary after first-differencing on the log scale. Economically, this is the standard result: prices accumulate shocks indefinitely (they have no fixed mean to revert to), while returns — the period-to-period percentage changes — fluctuate around a near-zero mean.

Q4: ARIMA Forecasting

Code

log_price_ts <- ts(btc_clean$Log_Pt)
ret_ts_clean <- ts(btc_clean$Rt)

# Train-test split (80/20 on full series)
n_tot   <- length(log_price_ts)
n_te    <- floor(0.2 * n_tot)
n_tr    <- n_tot - n_te

log_train <- window(log_price_ts, end = n_tr)
log_test  <- window(log_price_ts, start = n_tr + 1)
ret_train <- window(ret_ts_clean, end = n_tr)

# ── Model 1: AR(1) on returns ─────────────────────────────────
ar1_fit <- arima(ret_train, order = c(1, 0, 0))
cat("=== AR(1) on Returns ===\n")

=== AR(1) on Returns ===

Code

print(ar1_fit)


Call:
arima(x = ret_train, order = c(1, 0, 0))

Coefficients:
          ar1  intercept
      -0.0389     0.0261
s.e.   0.0262     0.0778

sigma^2 estimated as 9.555:  log likelihood = -3721.88,  aic = 7449.75

Code

# ── Model 2: ARIMA(1,1,0) on log prices ──────────────────────
arima110_fit <- arima(log_train, order = c(1, 1, 0))
cat("\n=== ARIMA(1,1,0) on Log Prices ===\n")


=== ARIMA(1,1,0) on Log Prices ===

Code

print(arima110_fit)


Call:
arima(x = log_train, order = c(1, 1, 0))

Coefficients:
          ar1
      -0.0393
s.e.   0.0262

sigma^2 estimated as 0.0009559:  log likelihood = 3003.93,  aic = -6003.86

Code

# ── Model 3: ARIMA(0,1,1) on log prices ──────────────────────
arima011_fit <- arima(log_train, order = c(0, 1, 1))
cat("\n=== ARIMA(0,1,1) on Log Prices ===\n")


=== ARIMA(0,1,1) on Log Prices ===

Code

print(arima011_fit)


Call:
arima(x = log_train, order = c(0, 1, 1))

Coefficients:
          ma1
      -0.0375
s.e.   0.0255

sigma^2 estimated as 0.000956:  log likelihood = 3003.88,  aic = -6003.75

Code

# ── Rolling 1-step-ahead forecasts (fair comparison with MA/SES) ──
# Each test observation gets a genuine 1-step-ahead forecast from
# a model re-estimated on all data up to t−1.
actual_prices <- exp(as.numeric(log_test))

# ARIMA(1,1,0) rolling 1-step-ahead
arima110_1step <- numeric(n_te)
for (h in seq_len(n_te)) {
  train_window <- window(log_price_ts, end = n_tr + h - 1)
  fit_h <- tryCatch(
    arima(train_window, order = c(1, 1, 0)),
    error = function(e) arima(train_window, order = c(0, 1, 0))
  )
  arima110_1step[h] <- as.numeric(forecast(fit_h, h = 1)$mean)
}
pred_110_rolling <- exp(arima110_1step)

# AR(1) returns rolling 1-step-ahead
ar1_1step <- numeric(n_te)
ret_full  <- ts(btc_clean$Rt)
for (h in seq_len(n_te)) {
  ret_window <- window(ret_full, end = n_tr + h - 1)
  fit_h <- tryCatch(
    arima(ret_window, order = c(1, 0, 0)),
    error = function(e) list(coef = c(ar1 = 0, intercept = mean(ret_window)))
  )
  fc_ret <- as.numeric(forecast(fit_h, h = 1)$mean)
  # Convert return forecast to price forecast
  last_log <- as.numeric(log_price_ts)[n_tr + h - 1]
  ar1_1step[h] <- exp(last_log + fc_ret / 100)
}

arima_metrics_rolling <- bind_rows(
  fc_metrics(actual_prices, pred_110_rolling, "ARIMA(1,1,0) — Rolling"),
  fc_metrics(actual_prices, ar1_1step,        "AR(1) Returns — Rolling")
)

# Combined comparison table
bind_rows(metrics_table, arima_metrics_rolling) %>%
  kbl(caption = "Forecast Accuracy Comparison — All Methods (1-Step-Ahead, Test Set)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(5, bold = TRUE)

Forecast Accuracy Comparison — All Methods (1-Step-Ahead, Test Set)
Method	CFE	MSE	MAD	MAPE
MA(5)	-52663.67	8906077	2218.14	2.3228
MA(20)	-211054.04	29180788	4131.83	4.4162
SES (α=0.2)	-93373.46	11316452	2510.92	2.6464
SES (α=0.8)	-17175.38	4516203	1551.66	1.6430
ARIMA(1,1,0) — Rolling	-12359.63	4423401	1527.51	1.6218
AR(1) Returns — Rolling	-26028.70	4429285	1528.22	1.6229

Q4 Answer:

The AR(1) model on returns estimates an AR(1) coefficient of -0.0389 (s.e. = 0.0262, t = -1.48). This coefficient is very small in magnitude and marginally significant at best, indicating that yesterday’s return has almost no linear predictive power for today’s return. The ARIMA(1,1,0) and ARIMA(0,1,1) models on log prices yield similarly tiny AR and MA coefficients (−0.039 and −0.038 respectively), meaning these models collapse toward a random walk with drift — the simplest model of efficient pricing.

When comparing the rolling 1-step-ahead ARIMA forecasts against SES(α=0.8) on the test set, ARIMA(1,1,0) — Rolling achieves a MAPE of 1.6218%, which is close to but does not meaningfully outperform SES(α=0.8) at 1.643%. This confirms that past price movements provide virtually no additional predictive power for future Bitcoin returns beyond the naive “today’s price is tomorrow’s best estimate.”

This is consistent with the ACF/PACF analysis showing near-zero return autocorrelation, and with the weak-form Efficient Market Hypothesis applied to Bitcoin’s liquid, high-frequency market. For a practitioner, the implication is clear: directional trading strategies based on historical price patterns alone are unlikely to generate consistent alpha in Bitcoin.

Q5: Volume and Volatility Clustering

Code

btc_vol <- btc_clean %>%
  mutate(
    VP_t     = Rt^2,                          # Squared returns (volatility proxy)
    Rt_lag1  = lag(Rt),
    VP_lag1  = lag(VP_t),
    AbsRt_lag1 = lag(abs(Rt))
  ) %>%
  filter(!is.na(Rt_lag1))

Code

p_rt <- ggplot(btc_vol, aes(x = Date, y = Rt)) +
  geom_line(colour = PAL2[1], linewidth = 0.35, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  labs(title = expression("Daily Log Returns  " * r[t]),
       x = NULL, y = expression(r[t] * " (%)")) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", size = 10))

p_vp <- ggplot(btc_vol, aes(x = Date, y = VP_t)) +
  geom_line(colour = PAL2[2], linewidth = 0.35, alpha = 0.8) +
  labs(title = expression("Squared Returns  " * r[t]^2 * "  (Volatility Proxy)"),
       x = NULL, y = expression(r[t]^2)) +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", size = 10))

grid.arrange(p_rt, p_vp, ncol = 1)

Code

par(mar = c(4, 4, 3, 1))
acf(btc_vol$VP_t, lag.max = 20,
    main = expression("ACF of Squared Returns " * r[t]^2 * " — Volatility Persistence"),
    col = PAL2[2], lwd = 2)

Code

# r²_t = γ0 + γ1 * r²_{t-1} + u_t
model1 <- lm(VP_t ~ VP_lag1, data = btc_vol)
cat("=== Regression: r²_t on r²_{t-1} ===\n")

=== Regression: r²_t on r²_{t-1} ===

Code

summary(model1)


Call:
lm(formula = VP_t ~ VP_lag1, data = btc_vol)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.323  -7.452  -6.367  -0.855 290.366 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.41520    0.53158   13.95  < 2e-16 ***
VP_lag1      0.15344    0.02314    6.63 4.41e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21 on 1823 degrees of freedom
Multiple R-squared:  0.02354,   Adjusted R-squared:  0.02301 
F-statistic: 43.96 on 1 and 1823 DF,  p-value: 4.408e-11

Code

# r²_t = δ0 + δ1*g_t + δ2*|r_{t-1}| + ε_t
btc_vol2 <- btc_vol %>% filter(!is.na(Gt))

model2 <- lm(VP_t ~ Gt + AbsRt_lag1, data = btc_vol2)
cat("=== Extended Regression: r²_t on g_t and |r_{t-1}| ===\n")

=== Extended Regression: r²_t on g_t and |r_{t-1}| ===

Code

summary(model2)


Call:
lm(formula = VP_t ~ Gt + AbsRt_lag1, data = btc_vol2)

Residuals:
    Min      1Q  Median      3Q     Max 
-35.201  -8.311  -3.844   2.011 275.261 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.86599    0.65180   7.466 1.28e-13 ***
Gt           0.16930    0.01304  12.983  < 2e-16 ***
AbsRt_lag1   1.91561    0.22180   8.637  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.11 on 1822 degrees of freedom
Multiple R-squared:  0.1046,    Adjusted R-squared:  0.1036 
F-statistic: 106.4 on 2 and 1822 DF,  p-value: < 2.2e-16

Q5 Answer:

Volatility clustering (plots): The side-by-side plots of r_t and r²_t vividly illustrate volatility clustering. Calm periods — stretches of small |r_t| — alternate with turbulent episodes where large returns (positive and negative) cluster tightly together. The squared-return proxy VP_t spikes dramatically during these episodes, then decays gradually. This pattern is arguably the most universal stylised fact of financial returns.

ACF of r²_t: Unlike the ACF of raw returns (which showed near-zero autocorrelation), the ACF of squared returns shows significant positive autocorrelation at many lags, with the first several lags well above the significance band. This confirms that while return levels are unpredictable, return variance (risk) is persistent and forecastable. High volatility today predicts high volatility tomorrow.

Coefficient γ₁ (first regression): The estimated γ₁ is 0.1534 (t = 6.63, p 4.41e-11), which is positive and highly significant. This confirms autoregressive conditional heteroscedasticity: today’s squared return depends positively on yesterday’s squared return. Equivalently, a large return yesterday (regardless of sign) forecasts elevated variance today. The R² of 0.0235 is low — the model explains only a small share of variance variation — but this is typical: the key insight is that the coefficient is reliably non-zero, not that one lag captures all variance dynamics. This is precisely the core empirical motivation for ARCH models.

Coefficients δ₁ and δ₂ (extended regression): δ₁ (volume growth) = 0.1693 (t = 12.98) and δ₂ (lagged absolute return) = 1.9156 (t = 8.64), both highly significant with R² = 0.1046. The positive δ₁ captures the Mixture of Distributions Hypothesis: both volume and volatility are driven by the same latent information-arrival process — when new information hits the market, trading activity surges and price uncertainty spikes simultaneously. The positive δ₂ captures the persistence of past absolute return shocks in predicting current variance, analogous to the ARCH(1) term but using |r_{t-1}| instead of r²_{t-1}, which is more robust to outliers.

Together, these results confirm that a GARCH(1,1) model would be the natural next step. The GARCH(1,1) variance equation σ²_t = ω + α·r²_{t-1} + β·σ²_{t-1} incorporates both the shock (r²_{t-1}) and its own persistence (σ²_{t-1}), capturing the slow decay of volatility clusters that a simple AR(1) in squared returns cannot fully represent.

Synthesis: What Does This Mean for a Risk Manager?

Bitcoin returns are approximately unpredictable in level but highly predictable in variance. The near-zero return autocorrelations (Q3) and the inability of ARIMA to outperform a naive random-walk forecast (Q4) confirm that directional price forecasting based on historical patterns is unlikely to generate consistent profits — consistent with weak-form market efficiency.

However, the strong positive autocorrelation in squared returns (Q5), the significant γ₁ = 0.1534 in the ARCH regression, and the significant volume-volatility relationship (δ₁ = 0.1693) demonstrate that risk is forecastable. For a portfolio manager or risk officer, this means:

Position sizing can be adjusted dynamically: reduce exposure when conditional volatility is elevated, increase when it is low.
Value-at-Risk must be computed using conditional models (GARCH), not fixed historical standard deviation; a static VaR would understate risk during volatile periods and overstate it during calm ones.
Margin requirements for leveraged Bitcoin positions should be time-varying, reflecting the predictable cluster structure of volatility.

The inability to forecast price direction does not mean risk management is futile — it means that risk management, rather than return prediction, is where quantitative models add the most value in cryptocurrency markets.

Methodological Note: Why RStudio Instead of @RISK

This assignment was completed entirely in R/RStudio using Quarto for reproducible reporting. The choice was partly pragmatic and partly deliberate, and this section documents both dimensions honestly — as a methodological defence for the submission and as a personal record of what I learned about the relationship between the software tool and the underlying financial concepts.

The honest starting point

The course material for @RISK did not work for me. The software licence either did not activate correctly or was inaccessible in my environment, and I was unable to run the tool as the assignment originally envisioned. This is the practical origin of the decision: I needed to complete the assignment, @RISK was not available to me, and I had to find a path forward.

The path I chose was R. This was not a last resort — I was already comfortable in RStudio from the broader MBAN program — but it did require me to think carefully about what @RISK actually does and then replicate that functionality from first principles. In retrospect, that constraint turned out to be one of the most instructive parts of the assignment. Being forced to build the simulation engine myself — rather than clicking through a GUI — meant I had to understand every step: how a triangular distribution is sampled via the inverse-CDF, how stage-gate logic is encoded as a loop with an explicit break condition, how present value discounting interacts with a sequential success probability structure, and how the tornado chart is constructed from a Spearman rank correlation rather than a built-in function. The absence of @RISK made the underlying finance more legible, not less.

What @RISK actually does under the hood

@RISK is an Excel add-in that does three things: (1) it replaces cell values with random draws from specified distributions (triangular, normal, etc.), (2) it re-evaluates the spreadsheet thousands of times to build a distribution of outputs, and (3) it produces summary charts — histograms, tornado diagrams, percentile tables. Underneath the ribbon toolbar and the drag-and-drop interface, the computational engine is doing exactly what the for (i in 1:5000) loop in this report does: sample inputs, propagate them through a model, store the output, repeat.

The rtriangle() function on line 38 of this report is the R equivalent of =RiskTriang(min, mode, max). It uses the inverse-CDF method for a triangular distribution — the same mathematical transformation that @RISK applies internally. The difference is visibility: in @RISK, this transformation is hidden behind a function call in a cell; in R, the formula is written out in seven lines that anyone can audit, test, or modify.

What R gives me that @RISK does not

Reproducibility. When I set set.seed(42), every number in this report — every NPV draw, every stage-gate outcome, every Spearman correlation — is deterministically reproducible. Anyone who renders this .qmd file will get identical results. In @RISK, re-running a simulation produces different numbers each time unless the user manually locks the random seed in a settings menu that most users never touch.

Transparency of logic. The sequential stage-gate structure of the Pfizer NPV model — where failure at stage $k$ terminates all downstream costs — is expressed as a nested for loop with an explicit break on failure. In an @RISK spreadsheet, this same logic would be encoded in IF() statements chained across columns, with discounting formulas referencing cumulative time offsets in adjacent cells. The R version reads top-to-bottom as a narrative; the Excel version reads as a grid of cross-references that require tracing cell dependencies to understand.

Composability. Once I have the npv_results vector, I can pipe it into any diagnostic I want — a histogram, a VaR curve, an Expected Shortfall calculation, a discount-rate sensitivity loop, a Spearman correlation matrix for the tornado chart — without restructuring the spreadsheet. Adding the discount-rate sensitivity table in this report required wrapping the existing simulation in one more for loop over three values of r. In @RISK, this would require either duplicating the entire spreadsheet three times or using @RISK’s “RiskSimTable” function, which is less intuitive than a loop.

Integration with time-series analysis. Part II of this assignment — Bitcoin forecasting, ACF/PACF, ARIMA, ARCH regressions — cannot be done in @RISK at all. @RISK is a Monte Carlo simulation tool for spreadsheet models; it has no built-in time-series estimation, no arima(), no acf(), no lm() for regression. By working in R for both parts of the assignment, I used a single tool, a single language, and a single document. The alternative would have been @RISK for Part I and then switching to R or Python for Part II — fragmenting the analysis across two environments with no shared data pipeline.

What @RISK gives that I had to build myself

@RISK’s advantage is speed of setup for simple models. Clicking “Add Distribution → Triangular” on a cell is faster than writing rtriangle() from scratch. The tornado chart is one button click in @RISK; in R, I had to compute Spearman correlations, build a data frame, sort by absolute value, and construct the ggplot. @RISK also provides a polished GUI for exploring simulation results — sliders for percentiles, overlay distributions, scenario bookmarks — that would require building a Shiny app to replicate in R.

For a one-time analysis with a small number of uncertain inputs and no time-series component, @RISK is a perfectly reasonable tool. But the moment the analysis needs to be reproducible, extended, version-controlled, or combined with statistical modelling, R is the stronger choice.

What this exercise taught me about risk modelling

The most important thing I learned is that the risk analysis is not the software — it is the structure of the model and the quality of the interpretation. Whether I use =RiskTriang(50, 70, 120) in Excel or rtriangle(1, 50, 120, 70) in R, the underlying question is the same: what is the distribution of project value given uncertainty in costs, timelines, success probabilities, and profits?

The Monte Carlo simulation answers this question by mapping input uncertainty to output uncertainty. The tornado chart answers a second question: which input uncertainties matter most? The VaR and Expected Shortfall answer a third: how bad can it get? And the discount-rate sensitivity answers a fourth: how robust is the decision to the cost-of-capital assumption?

None of these questions require @RISK. They require:

A model that correctly represents the path-dependent, sequential nature of drug development (costs are incurred only if prior stages succeed)
A sampling engine that draws from the specified distributions (the rtriangle() function)
A loop that repeats the model thousands of times and stores the results
Summary statistics, visualisations, and sensitivity measures that translate the simulation output into decision-relevant information
Written interpretation that connects the numbers to the business question: should Pfizer invest in NIAGARA?

R provides all five. @RISK provides the first four with a lower setup cost but at the expense of transparency and extensibility. The quality of the risk analysis is determined by whether it quantifies uncertainty, identifies key risk drivers, communicates downside exposure, and supports a defensible investment decision — not by whether the software has a ribbon toolbar or a command line.

Summary comparison

Capability Comparison: @RISK vs. RStudio for This Assignment
Capability	@RISK / Excel	RStudio / R
Triangular distribution sampling	Built-in	rtriangle() — 7 lines
Monte Carlo simulation (5,000 iterations)	Built-in	for loop — standard R
Sequential stage-gate NPV logic	Cell formulas	Explicit loop with break
NPV percentiles and P(NPV > 0)	Built-in	quantile() + mean()
Empirical distribution histogram	Built-in	ggplot2
Value-at-Risk chart	Built-in	ggplot2 + ribbon
Expected Shortfall (CVaR)	Not built-in	One line of code
Tornado sensitivity chart	Built-in	Spearman + ggplot2
Discount rate sensitivity analysis	RiskSimTable	Outer loop over r_grid
Reproducibility (fixed random seed)	Manual seed setting	set.seed() — deterministic
Time-series forecasting (ARIMA, SES, MA)	Not available	forecast + arima()
ACF/PACF and stationarity testing	Not available	acf() + adf.test()
ARCH/GARCH regression diagnostics	Not available	lm() + summary()
Single-document integrated report	No (Excel + Word)	Yes (Quarto .qmd)
Version control (git-compatible)	No (binary .xlsx)	Yes (plain text)

End of Assignment 3 — MBAN 5570