The following RMD contains answers to exercises in chapter 3 of the Kuhn and Johnson book Applied Predictive Modeling textbook for CUNY SPS DATA 624 Spring 2025 http://appliedpredictivemodeling.com/. This chapter focuses on different tools of a forecaster. The exercises answered here include 3.1 and 3.2. ChatGPT was used to help solve some syntax errors in this markdown’s answers for question 2 (3.2).
library(MASS)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Prompt: Show that the F statistic (3.13) for dropping a single coefficient from a model is equal to the square of the corresponding z-score (3.12).
set.seed(42)
# Generate synthetic data for 100 observations and get 2 independent normal distributions
N <- 100
x1 <- rnorm(N)
x2 <- rnorm(N)
# Calculate a linear relationship with noise
y <- 5 + 3*x1 + 2*x2 + rnorm(N)
# Assign the linear model
full_model <- lm(y ~ x1 + x2)
# Assign the linear model of only x1
reduced_model <- lm(y ~ x1)
# Extract residual sum of squares (RSS)
RSS_full <- sum(residuals(full_model)^2)
RSS_reduced <- sum(residuals(reduced_model)^2)
# Count number of coefficients in each model, including intercepts
p_full <- length(coef(full_model))
p_reduced <- length(coef(reduced_model))
# Compute the F-statistic
numer <- (RSS_reduced - RSS_full) / (p_full - p_reduced)
denom <- (RSS_full / (N - p_full))
F_stat <- numer / denom
# Compute the corresponding z-score for x2
# Estimated coefficient
beta_hat <- coef(full_model)["x2"]
# Standard error
se_beta <- summary(full_model)$coefficients["x2", "Std. Error"]
# z-score
z_j <- beta_hat / se_beta
# Compare F-stat and squared z-score
cat("F-statistic: ", F_stat, "\n")
## F-statistic: 342.2059
cat("Squared z-score: ", z_j^2, "\n")
## Squared z-score: 342.2059
The code shows the F statistic for dropping a single coefficient from a model is equal to the square of the corresponding z-score. It fits a full and reduced model, computes the F-statistic from the RSS difference, and calculates the z-score from the coefficient’s standard error. Comparing both confirms the relationship F = z^2; 342.2 = 342.2.
Prompt: Given data on two variables X and Y, consider fitting a cubic polynomial regression model [see textbook]. In addition to plotting the fitted curve, you would like a 95% confidence band about the curve. Consider the following two approaches:
1. At each point x0, form a 95% confidence interval for the linear function [see textbook]
2. Form a 95% confidence set for β as in (3.15), which in turn generates confidence intervals for f(x0).
How do these approaches differ? Which band is likely to be wider? Conduct a small simulation experiment to compare the two methods.
set.seed(42)
# Generate synthetic data for 200 observations
N2 <- 200
# Generate the predictor variable PX3 with normal distribution
PX3 <- rnorm(N2)
# Define the true coefficients for the cubic polynomial model (intercept, coef for x, x2, x3)
beta0 <- 5
beta1 <- 2
beta2 <- 1.5
beta3 <- -0.5
# Generate the response variable Y with some random noise
Y <- beta0 + beta1 * PX3 + beta2 * PX3^2 + beta3 * PX3^3 + rnorm(N2)
# Fit and check the cubic polynomial regression
model_1 <- lm(Y ~ poly(PX3, 3))
summary(model_1)
##
## Call:
## lm(formula = Y ~ poly(PX3, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56628 -0.62943 -0.08741 0.60960 2.44617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.47827 0.06697 96.739 <2e-16 ***
## poly(PX3, 3)1 2.22759 0.94705 2.352 0.0197 *
## poly(PX3, 3)2 33.47965 0.94705 35.351 <2e-16 ***
## poly(PX3, 3)3 -14.27951 0.94705 -15.078 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9471 on 196 degrees of freedom
## Multiple R-squared: 0.8832, Adjusted R-squared: 0.8814
## F-statistic: 494.2 on 3 and 196 DF, p-value: < 2.2e-16
# Construct a sequence of x0 values for prediction
x0_values_1 <- seq(min(PX3), max(PX3), length.out = 100)
# Construct matrix for the cubic polynomial
X0_matrix <- cbind(1, x0_values_1, x0_values_1^2, x0_values_1^3)
# Get predictions with confidence intervals for the fitted model
predictions_1 <- predict(model_1, newdata = data.frame(PX3 = x0_values_1), interval = "confidence", level = 0.95)
# Construct the confidence interval for each x
plot(PX3, Y, pch = 16, col = "lightblue", main = "95% Confidence Interval for Given Linear Function")
lines(x0_values_1, predictions_1[, "fit"], col = "salmon", lwd = 2)
matlines(x0_values_1, predictions_1[, c("lwr", "upr")], col = "darkgray", lty = 2)
The
above code predicts values for the response variable (Y) using a cubic
regression model and calculates 95% confidence intervals for each
prediction. It then plots the original data points, the predicted curve,
and the confidence intervals around the curve. The results show how the
predicted values of Y change across x0 and the range of uncertainty
around those predictions.
# Fit and check a cubic polynomial model with orthogonal polynomials
model_2 <- lm(Y ~ PX3 + I(PX3^2) + I(PX3^3))
summary(model_2)
##
## Call:
## lm(formula = Y ~ PX3 + I(PX3^2) + I(PX3^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56628 -0.62943 -0.08741 0.60960 2.44617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.00488 0.08200 61.04 <2e-16 ***
## PX3 1.80688 0.11146 16.21 <2e-16 ***
## I(PX3^2) 1.50982 0.05130 29.43 <2e-16 ***
## I(PX3^3) -0.46033 0.03053 -15.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9471 on 196 degrees of freedom
## Multiple R-squared: 0.8832, Adjusted R-squared: 0.8814
## F-statistic: 494.2 on 3 and 196 DF, p-value: < 2.2e-16
# Extract the estimated coefficients and the covariance matrix
beta_hat_2 <- coef(model_2)
cov_matrix_2 <- vcov(model_2)
# Define the critical value for the 95% confidence region from chi-squared
p_2 <- length(beta_hat_2)
alpha_2 <- 0.05
chi_squared_critical_2 <- qchisq(1 - alpha_2, df = p_2)
# Construct the confidence set for beta using the given formula
sigma_hat_sq_2 <- sum(residuals(model_2)^2) / (N2 - p_2)
# Generate a range of x0 values for prediction
x0_values_2 <- seq(min(PX3), max(PX3), length.out = 100)
X0_matrix_2 <- cbind(1, x0_values_2, x0_values_2^2, x0_values_2^3)
# Predicted values for f(x0) and their variance
f_x0_2 <- X0_matrix_2 %*% beta_hat_2
var_f_x0_2 <- diag(X0_matrix_2 %*% cov_matrix_2 %*% t(X0_matrix_2))
# Calculate the confidence intervals for f(x0)
lower_bound_2 <- f_x0_2 - sqrt(chi_squared_critical_2 * var_f_x0_2)
upper_bound_2 <- f_x0_2 + sqrt(chi_squared_critical_2 * var_f_x0_2)
# Plot the original data, the fitted curve, and the confidence intervals
plot(PX3, Y, pch = 16, col = "lightblue", main = "95% Confidence Interval for Entire Parameter Vector Beta")
lines(x0_values_2, f_x0_2, col = "salmon", lwd = 2)
matlines(x0_values_2, cbind(lower_bound_2, upper_bound_2), col = "darkgray", lty = 2)
The
code fits a cubic polynomial model to the data and calculates the 95%
confidence intervals for the predictions. It then plots the original
data, the fitted curve, and the confidence intervals, showing the
uncertainty in the predictions across the x0 range.
The approaches differ in how they calculate the confidence intervals.
For approach 1, uncertainty is calculated for each value of x in the summation range (0 to 3). Each point is given an interval for how confident we are in the prediction for the Y at each given value.
In approach 2, the uncertainty is calculated on the entire parameter vector (Beta). The entire regression function is what determines the confidence interval.
The wider band is likely to be approach 2’s confidence interval, as it includes variability of the entire model on top of the variability at each point. That is, the confidence interval for the regression function at each point is based on how much uncertainty there is in the coefficients themselves, as well as how those uncertainties propagate through the model. This uncertainty in the overall model is not considered in approach 1, making a smaller confidence interval in that approach 1.
# Width of confidence intervals for Approach 1
width_1 <- predictions_1[, "upr"] - predictions_1[, "lwr"]
avg_width_1 <- mean(width_1)
# Width of confidence intervals for Approach 2
width_2 <- upper_bound_2 - lower_bound_2
avg_width_2 <- mean(width_2)
# Print average widths
cat("Average width for Approach 1:", avg_width_1, "\n")
## Average width for Approach 1: 0.8102816
cat("Average width for Approach 2:", avg_width_2, "\n")
## Average width for Approach 2: 1.265549
Approach 2, as expected, has a higher calculated average width of the confidence interval.