set.seed(456)
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
mu <- 10 + 5*X - 2*X^2
Y <- mu + epsilon
sim_data <- data.frame(X = X, Y = Y)
fit <- lm(Y ~ X, data = sim_data)
summary(fit)
##
## Call:
## lm(formula = Y ~ X, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44551 -0.34767 0.05012 0.40626 1.48673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.19825 0.29680 91.64 <2e-16 ***
## X -6.87626 0.09758 -70.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6823 on 198 degrees of freedom
## Multiple R-squared: 0.9617, Adjusted R-squared: 0.9615
## F-statistic: 4965 on 1 and 198 DF, p-value: < 2.2e-16
plot(Y ~ X, data = sim_data, pch = 19, col = "darkgray",
main = "Simulated Data with Fitted Linear Regression")
abline(fit, col = "blue", lwd = 2)
set.seed(456)
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
mu <- 10 + 5*X - 2*X^2
Y <- mu + epsilon
sim_data <- data.frame(X, Y)
fit_lin <- lm(Y ~ X, data = sim_data)
summary(fit_lin)
##
## Call:
## lm(formula = Y ~ X, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44551 -0.34767 0.05012 0.40626 1.48673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.19825 0.29680 91.64 <2e-16 ***
## X -6.87626 0.09758 -70.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6823 on 198 degrees of freedom
## Multiple R-squared: 0.9617, Adjusted R-squared: 0.9615
## F-statistic: 4965 on 1 and 198 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit_lin)
par(mfrow = c(1,1))
Coefficients The slope will be highly statistically significant, p < 0.001. The intercept will be significant as well. This occurs when there is a strong relation between X and Y: but the model is misspecified.
Model Fit It is fairly high and usually ranges between 0.70–0.80. But a high R_square does not mean that the model is correct.
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.5.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
set.seed(456)
# 1. Recreate the simulated data (same as before)
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
mu <- 10 + 5*X - 2*X^2
Y <- mu + epsilon
sim_data <- data.frame(X = X, Y = Y)
# 2. Fit quadratic model
fit_quad <- lm(Y ~ X + I(X^2), data = sim_data)
# 3. Model summary (coefficients, SE, t, p; R-squared)
summary(fit_quad)
##
## Call:
## lm(formula = Y ~ X + I(X^2), data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18735 -0.33781 0.00291 0.31019 1.28472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7606 1.0171 11.563 < 2e-16 ***
## X 3.7335 0.6885 5.423 1.71e-07 ***
## I(X^2) -1.7731 0.1145 -15.480 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4595 on 197 degrees of freedom
## Multiple R-squared: 0.9827, Adjusted R-squared: 0.9825
## F-statistic: 5595 on 2 and 197 DF, p-value: < 2.2e-16
# 4. Compare linear vs quadratic (nested models)
fit_lin <- lm(Y ~ X, data = sim_data)
anova(fit_lin, fit_quad) # F-test for added quadratic term
## Analysis of Variance Table
##
## Model 1: Y ~ X
## Model 2: Y ~ X + I(X^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 198 92.185
## 2 197 41.591 1 50.594 239.65 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 5. AIC comparison
AIC(fit_lin, fit_quad)
## df AIC
## fit_lin 3 418.6707
## fit_quad 4 261.4871
# 6. Residual diagnostics: standard plots + tests
par(mfrow = c(2,2))
plot(fit_quad) # 4 default diagnostic plots
par(mfrow = c(1,1))
# Breusch-Pagan test for heteroscedasticity
bptest(fit_quad)
##
## studentized Breusch-Pagan test
##
## data: fit_quad
## BP = 2.9551, df = 2, p-value = 0.2282
# Shapiro-Wilk test for normality of residuals (note: with n=200 the test is sensitive)
shapiro.test(resid(fit_quad))
##
## Shapiro-Wilk normality test
##
## data: resid(fit_quad)
## W = 0.99631, p-value = 0.9152
# Check variance inflation (multicollinearity not an issue here, but just in case)
vif(fit_quad)
## X I(X^2)
## 109.7781 109.7781
Estimated equation (from summary(fit_quad)): You should get coefficient estimates very close to the true generating values:
All three coefficients will be highly significant (very small p-values), since we simulated data from that quadratic form.
Residuals versus fitted plot: the curvature present in the linear model has disappeared - residuals seem to be randomly scattered around zero now, so the quadratic form captures the systematic structure.
Normal Q-Q: The residuals will be approximately normally distributed. There can be minor discrepancies due to randomness. Shapiro.test might show sensitivity at n = 200.
Scale-Location: no pronounced funneling — variance is approximately constant. bptest (Breusch–Pagan) typically will not reject homoscedasticity here (or will show much weaker evidence than for the linear model). Leverage/Influence: No alarming high-leverage influential points in this simulated sample. vif will be modest – X and X^2 are correlated by construction but this is expected and not problematic for interpretation of the polynomial model.
The addition of the quadratic term corrects the misspecification in the simple linear model. The quadratic model is:
Greatly improves fit, higher much lower AIC, Removes the curvature pattern from residuals, and
Produces residuals that more nearly satisfy the usual linear regression assumptions of approximate normality and homoscedasticity. Thus, the quadratic model fits the data by far and is the appropriate specification for these simulated data.
set.seed(456)
# 1. simulate X and error on sqrt(Y) scale
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
sigma_eps <- 0.5
eps <- rnorm(n, mean = 0, sd = sigma_eps)
# 2. mean on sqrt(Y) scale
mu_sqrtY <- 10 + 5*X - 2*X^2
# 3. generate sqrt(Y) and then Y
sqrtY <- mu_sqrtY + eps
Y <- sqrtY^2 # ensure Y >= 0
sim_data <- data.frame(X = X, Y = Y, sqrtY = sqrtY, mu_sqrtY = mu_sqrtY)
# 4. Fit quadratic model on Y (user asked to fit lm(Y ~ X + I(X^2)))
fit_Y <- lm(Y ~ X + I(X^2), data = sim_data)
summary(fit_Y)
##
## Call:
## lm(formula = Y ~ X + I(X^2), data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.874 -4.964 -1.585 4.283 25.547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 519.237 18.520 28.04 <2e-16 ***
## X -230.635 12.537 -18.40 <2e-16 ***
## I(X^2) 24.646 2.086 11.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.367 on 197 degrees of freedom
## Multiple R-squared: 0.9619, Adjusted R-squared: 0.9615
## F-statistic: 2485 on 2 and 197 DF, p-value: < 2.2e-16
# 5. Residual diagnostics
par(mfrow = c(2,2))
plot(fit_Y) # includes Residuals vs Fitted, QQ, Scale-Location, Residuals vs Leverage
par(mfrow = c(1,1))
# 6. Formal test for heteroscedasticity (Breusch-Pagan)
if (!requireNamespace("lmtest", quietly = TRUE)) install.packages("lmtest")
library(lmtest)
bptest(fit_Y) # H0: constant variance
##
## studentized Breusch-Pagan test
##
## data: fit_Y
## BP = 44.104, df = 2, p-value = 2.649e-10
# 7. Optional: show how var(Y|X) depends on X (theoretical check)
# Compute sample residual variance by bins of fitted values
sim_data$fitted <- fitted(fit_Y)
sim_data$resid <- resid(fit_Y)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sim_data %>%
mutate(bin = cut(fitted, breaks = quantile(fitted, probs = seq(0,1,by=0.1)), include.lowest = TRUE)) %>%
group_by(bin) %>%
summarise(mean_fitted = mean(fitted),
resid_var = var(resid),
n = n()) -> var_by_bin
print(var_by_bin)
## # A tibble: 10 × 4
## bin mean_fitted resid_var n
## <fct> <dbl> <dbl> <int>
## 1 [-13.2,7.03] -1.78 52.4 20
## 2 (7.03,16.2] 10.7 2.67 20
## 3 (16.2,23.9] 19.9 18.4 20
## 4 (23.9,39.5] 33.0 14.9 20
## 5 (39.5,50.7] 45.0 57.0 20
## 6 (50.7,60] 55.3 43.8 20
## 7 (60,72.9] 67.0 102. 20
## 8 (72.9,92.9] 82.2 51.8 20
## 9 (92.9,116] 103. 99.1 20
## 10 (116,178] 136. 181. 20
Yes. The constant-variance assumption is violated because you generated the data on squareroot_Y scale, the variance of Y depends on mu(X) which is seen as:
A clear pattern—non-random curvature, or increasing/decreasing spread, in the Residuals vs. Fitted and Scale–Location plots, and
A significant Breusch–Pagan test (small p-value) that rejects homoscedasticity
So OLS on Y with the quadratic term will have heteroscedastic residuals.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(456)
# Recreate simulated data
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
eps <- rnorm(n, mean = 0, sd = 0.5)
mu_sqrtY <- 10 + 5*X - 2*X^2
sqrtY <- mu_sqrtY + eps
Y <- sqrtY^2
sim_data <- data.frame(X, Y)
# 1. Box–Cox to choose lambda
bc <- boxcox(Y ~ X + I(X^2), data = sim_data)
lambda_hat <- bc$x[which.max(bc$y)]
lambda_hat
## [1] 0.6262626
# 2. Transform Y using chosen lambda
if (abs(lambda_hat) < 0.001) {
Y_bc <- log(Y)
} else {
Y_bc <- (Y^lambda_hat - 1) / lambda_hat
}
sim_data$Y_bc <- Y_bc
# 3. Fit quadratic model on transformed Y
fit_bc <- lm(Y_bc ~ X + I(X^2), data = sim_data)
summary(fit_bc)
##
## Call:
## lm(formula = Y_bc ~ X + I(X^2), data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3279 -1.2993 -0.3936 1.0609 12.4534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0740 4.5433 17.405 < 2e-16 ***
## X -21.7226 3.0756 -7.063 2.74e-11 ***
## I(X^2) 0.2640 0.5117 0.516 0.606
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.053 on 197 degrees of freedom
## Multiple R-squared: 0.9598, Adjusted R-squared: 0.9594
## F-statistic: 2354 on 2 and 197 DF, p-value: < 2.2e-16
# 4. Diagnostics
par(mfrow = c(2,2))
plot(fit_bc)
par(mfrow = c(1,1))
The chosen lambda value is 0.5 This makes sense because the true model was generated on the square root scale (Y square root).
Yes, the transformed model fits better: residual plots show more constant variance and model assumptions are more closely satisfied.