Homework04

Introduction

The following RMD contains answers to exercises in chapter 3 of the Kuhn and Johnson book Applied Predictive Modeling textbook for CUNY SPS DATA 624 Spring 2025 http://appliedpredictivemodeling.com/. This chapter focuses on different tools of a forecaster. The exercises answered here include 3.1 and 3.2. ChatGPT was used to help solve some syntax errors in this markdown’s answers for question 2 (3.2).

library(MASS)  
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Q1

Prompt: Show that the F statistic (3.13) for dropping a single coefficient from a model is equal to the square of the corresponding z-score (3.12).

set.seed(42)  

# Generate synthetic data for 100 observations and get 2 independent normal distributions
N <- 100 
x1 <- rnorm(N) 
x2 <- rnorm(N) 

# Calculate a linear relationship with noise 
y <- 5 + 3*x1 + 2*x2 + rnorm(N)  

# Assign the linear model
full_model <- lm(y ~ x1 + x2)

# Assign the linear model of only x1
reduced_model <- lm(y ~ x1)

# Extract residual sum of squares (RSS)
RSS_full <- sum(residuals(full_model)^2)
RSS_reduced <- sum(residuals(reduced_model)^2)

# Count number of coefficients in each model, including intercepts
p_full <- length(coef(full_model)) 
p_reduced <- length(coef(reduced_model)) 

# Compute the F-statistic
numer <- (RSS_reduced - RSS_full) / (p_full - p_reduced)
denom <- (RSS_full / (N - p_full))
F_stat <- numer / denom

# Compute the corresponding z-score for x2
  # Estimated coefficient
beta_hat <- coef(full_model)["x2"] 
  # Standard error
se_beta <- summary(full_model)$coefficients["x2", "Std. Error"]
  # z-score
z_j <- beta_hat / se_beta

# Compare F-stat and squared z-score
cat("F-statistic: ", F_stat, "\n")

## F-statistic:  342.2059

cat("Squared z-score: ", z_j^2, "\n")

## Squared z-score:  342.2059

The code shows the F statistic for dropping a single coefficient from a model is equal to the square of the corresponding z-score. It fits a full and reduced model, computes the F-statistic from the RSS difference, and calculates the z-score from the coefficient’s standard error. Comparing both confirms the relationship F = z^2; 342.2 = 342.2.

Q2

Prompt: Given data on two variables X and Y, consider fitting a cubic polynomial regression model [see textbook]. In addition to plotting the fitted curve, you would like a 95% confidence band about the curve. Consider the following two approaches:

1. At each point x0, form a 95% confidence interval for the linear function [see textbook]

2. Form a 95% confidence set for β as in (3.15), which in turn generates confidence intervals for f(x0).

How do these approaches differ? Which band is likely to be wider? Conduct a small simulation experiment to compare the two methods.

Get data ready

set.seed(42)   

# Generate synthetic data for 200 observations
N2 <- 200 

# Generate the predictor variable PX3 with normal distribution
PX3 <- rnorm(N2)

# Define the true coefficients for the cubic polynomial model (intercept, coef for x, x2, x3)
beta0 <- 5
beta1 <- 2
beta2 <- 1.5
beta3 <- -0.5 

# Generate the response variable Y with some random noise
Y <- beta0 + beta1 * PX3 + beta2 * PX3^2 + beta3 * PX3^3 + rnorm(N2)

1: Confidence interval for linear function

# Fit and check the cubic polynomial regression
model_1 <- lm(Y ~ poly(PX3, 3))
summary(model_1)

## 
## Call:
## lm(formula = Y ~ poly(PX3, 3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.56628 -0.62943 -0.08741  0.60960  2.44617 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.47827    0.06697  96.739   <2e-16 ***
## poly(PX3, 3)1   2.22759    0.94705   2.352   0.0197 *  
## poly(PX3, 3)2  33.47965    0.94705  35.351   <2e-16 ***
## poly(PX3, 3)3 -14.27951    0.94705 -15.078   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9471 on 196 degrees of freedom
## Multiple R-squared:  0.8832, Adjusted R-squared:  0.8814 
## F-statistic: 494.2 on 3 and 196 DF,  p-value: < 2.2e-16

# Construct a sequence of x0 values for prediction
x0_values_1 <- seq(min(PX3), max(PX3), length.out = 100)

# Construct matrix for the cubic polynomial
X0_matrix <- cbind(1, x0_values_1, x0_values_1^2, x0_values_1^3) 

# Get predictions with confidence intervals for the fitted model
predictions_1 <- predict(model_1, newdata = data.frame(PX3 = x0_values_1), interval = "confidence", level = 0.95)

# Construct the confidence interval for each x
plot(PX3, Y, pch = 16, col = "lightblue", main = "95% Confidence Interval for Given Linear Function")
lines(x0_values_1, predictions_1[, "fit"], col = "salmon", lwd = 2)
matlines(x0_values_1, predictions_1[, c("lwr", "upr")], col = "darkgray", lty = 2)

The above code predicts values for the response variable (Y) using a cubic regression model and calculates 95% confidence intervals for each prediction. It then plots the original data points, the predicted curve, and the confidence intervals around the curve. The results show how the predicted values of Y change across x0 and the range of uncertainty around those predictions.

2: Confidence interval for β

# Fit and check a cubic polynomial model with orthogonal polynomials
model_2 <- lm(Y ~ PX3 + I(PX3^2) + I(PX3^3))
summary(model_2)

## 
## Call:
## lm(formula = Y ~ PX3 + I(PX3^2) + I(PX3^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.56628 -0.62943 -0.08741  0.60960  2.44617 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.00488    0.08200   61.04   <2e-16 ***
## PX3          1.80688    0.11146   16.21   <2e-16 ***
## I(PX3^2)     1.50982    0.05130   29.43   <2e-16 ***
## I(PX3^3)    -0.46033    0.03053  -15.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9471 on 196 degrees of freedom
## Multiple R-squared:  0.8832, Adjusted R-squared:  0.8814 
## F-statistic: 494.2 on 3 and 196 DF,  p-value: < 2.2e-16

# Extract the estimated coefficients and the covariance matrix
beta_hat_2 <- coef(model_2)
cov_matrix_2 <- vcov(model_2)

# Define the critical value for the 95% confidence region from chi-squared
p_2 <- length(beta_hat_2)
alpha_2 <- 0.05
chi_squared_critical_2 <- qchisq(1 - alpha_2, df = p_2)

# Construct the confidence set for beta using the given formula
sigma_hat_sq_2 <- sum(residuals(model_2)^2) / (N2 - p_2) 

# Generate a range of x0 values for prediction
x0_values_2 <- seq(min(PX3), max(PX3), length.out = 100)
X0_matrix_2 <- cbind(1, x0_values_2, x0_values_2^2, x0_values_2^3)

# Predicted values for f(x0) and their variance
f_x0_2 <- X0_matrix_2 %*% beta_hat_2
var_f_x0_2 <- diag(X0_matrix_2 %*% cov_matrix_2 %*% t(X0_matrix_2))

# Calculate the confidence intervals for f(x0)
lower_bound_2 <- f_x0_2 - sqrt(chi_squared_critical_2 * var_f_x0_2)
upper_bound_2 <- f_x0_2 + sqrt(chi_squared_critical_2 * var_f_x0_2)

# Plot the original data, the fitted curve, and the confidence intervals
plot(PX3, Y, pch = 16, col = "lightblue", main = "95% Confidence Interval for Entire Parameter Vector Beta")
lines(x0_values_2, f_x0_2, col = "salmon", lwd = 2) 
matlines(x0_values_2, cbind(lower_bound_2, upper_bound_2), col = "darkgray", lty = 2)

The code fits a cubic polynomial model to the data and calculates the 95% confidence intervals for the predictions. It then plots the original data, the fitted curve, and the confidence intervals, showing the uncertainty in the predictions across the x0 range.

How do these approaches differ? Which band is likely to be wider?

The approaches differ in how they calculate the confidence intervals.

For approach 1, uncertainty is calculated for each value of x in the summation range (0 to 3). Each point is given an interval for how confident we are in the prediction for the Y at each given value.

In approach 2, the uncertainty is calculated on the entire parameter vector (Beta). The entire regression function is what determines the confidence interval.

The wider band is likely to be approach 2’s confidence interval, as it includes variability of the entire model on top of the variability at each point. That is, the confidence interval for the regression function at each point is based on how much uncertainty there is in the coefficients themselves, as well as how those uncertainties propagate through the model. This uncertainty in the overall model is not considered in approach 1, making a smaller confidence interval in that approach 1.

Conduct a small simulation experiment to compare the two methods.

# Width of confidence intervals for Approach 1
width_1 <- predictions_1[, "upr"] - predictions_1[, "lwr"]
avg_width_1 <- mean(width_1)

# Width of confidence intervals for Approach 2
width_2 <- upper_bound_2 - lower_bound_2
avg_width_2 <- mean(width_2)

# Print average widths
cat("Average width for Approach 1:", avg_width_1, "\n")

## Average width for Approach 1: 0.8102816

cat("Average width for Approach 2:", avg_width_2, "\n")

## Average width for Approach 2: 1.265549

Approach 2, as expected, has a higher calculated average width of the confidence interval.