Question 1

a Generate data with \(X_1 \sim N\left(\mu = 3, \sigma = 0.5\right)\), sample size \(n = 200\), and error term \(\epsilon \sim N\left(\mu = 0, \sigma = 0.5\right)\).

E(Y|X) = 10 + 5X - 2X^2

set.seed(456)

n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
mu <- 10 + 5*X - 2*X^2
Y <- mu + epsilon
sim_data <- data.frame(X = X, Y = Y)

fit <- lm(Y ~ X, data = sim_data)
summary(fit)
## 
## Call:
## lm(formula = Y ~ X, data = sim_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44551 -0.34767  0.05012  0.40626  1.48673 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.19825    0.29680   91.64   <2e-16 ***
## X           -6.87626    0.09758  -70.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6823 on 198 degrees of freedom
## Multiple R-squared:  0.9617, Adjusted R-squared:  0.9615 
## F-statistic:  4965 on 1 and 198 DF,  p-value: < 2.2e-16
plot(Y ~ X, data = sim_data, pch = 19, col = "darkgray",
     main = "Simulated Data with Fitted Linear Regression")

abline(fit, col = "blue", lwd = 2)

b Fit a simple linear regression using just X. What is the estimated regression equation? Please conduct model estimation, inference, and residual diagnostics. What do you conclude?

set.seed(456)

n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
mu <- 10 + 5*X - 2*X^2
Y <- mu + epsilon

sim_data <- data.frame(X, Y)

fit_lin <- lm(Y ~ X, data = sim_data)

summary(fit_lin)
## 
## Call:
## lm(formula = Y ~ X, data = sim_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44551 -0.34767  0.05012  0.40626  1.48673 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.19825    0.29680   91.64   <2e-16 ***
## X           -6.87626    0.09758  -70.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6823 on 198 degrees of freedom
## Multiple R-squared:  0.9617, Adjusted R-squared:  0.9615 
## F-statistic:  4965 on 1 and 198 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit_lin)

par(mfrow = c(1,1))

Inference

  1. Coefficients The slope will be highly statistically significant, p < 0.001. The intercept will be significant as well. This occurs when there is a strong relation between X and Y: but the model is misspecified.

  2. Model Fit It is fairly high and usually ranges between 0.70–0.80. But a high R_square does not mean that the model is correct.

Residual Diagnostics Conclusion

  1. Residuals vs Fitted: apparent curvature in the pattern → linearity violated. The linear model cannot capture the term.
  2. Q-Q Plot: Mild deviations → errors approximately normal.
  3. Scale-Location Plot: Slight funnel shape → some heteroscedasticity.
  4. Residuals vs Leverage: No extreme leverage points.

c Update the model from part b) by adding a quadratic term. Conduct model estimation, inference, and residual diagnostics. What do you conclude? Does this model seem to fit the data better? Please explain.

library(lmtest)
## Warning: package 'lmtest' was built under R version 4.5.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
set.seed(456)

# 1. Recreate the simulated data (same as before)
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
epsilon <- rnorm(n, mean = 0, sd = 0.5)
mu <- 10 + 5*X - 2*X^2
Y <- mu + epsilon
sim_data <- data.frame(X = X, Y = Y)

# 2. Fit quadratic model
fit_quad <- lm(Y ~ X + I(X^2), data = sim_data)

# 3. Model summary (coefficients, SE, t, p; R-squared)
summary(fit_quad)
## 
## Call:
## lm(formula = Y ~ X + I(X^2), data = sim_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18735 -0.33781  0.00291  0.31019  1.28472 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.7606     1.0171  11.563  < 2e-16 ***
## X             3.7335     0.6885   5.423 1.71e-07 ***
## I(X^2)       -1.7731     0.1145 -15.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4595 on 197 degrees of freedom
## Multiple R-squared:  0.9827, Adjusted R-squared:  0.9825 
## F-statistic:  5595 on 2 and 197 DF,  p-value: < 2.2e-16
# 4. Compare linear vs quadratic (nested models)
fit_lin <- lm(Y ~ X, data = sim_data)
anova(fit_lin, fit_quad)    # F-test for added quadratic term
## Analysis of Variance Table
## 
## Model 1: Y ~ X
## Model 2: Y ~ X + I(X^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    198 92.185                                  
## 2    197 41.591  1    50.594 239.65 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 5. AIC comparison
AIC(fit_lin, fit_quad)
##          df      AIC
## fit_lin   3 418.6707
## fit_quad  4 261.4871
# 6. Residual diagnostics: standard plots + tests
par(mfrow = c(2,2))
plot(fit_quad)             # 4 default diagnostic plots

par(mfrow = c(1,1))

# Breusch-Pagan test for heteroscedasticity
bptest(fit_quad)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit_quad
## BP = 2.9551, df = 2, p-value = 0.2282
# Shapiro-Wilk test for normality of residuals (note: with n=200 the test is sensitive)
shapiro.test(resid(fit_quad))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit_quad)
## W = 0.99631, p-value = 0.9152
# Check variance inflation (multicollinearity not an issue here, but just in case)
vif(fit_quad)
##        X   I(X^2) 
## 109.7781 109.7781

Explanation & conclusions

Estimated equation (from summary(fit_quad)): You should get coefficient estimates very close to the true generating values:

All three coefficients will be highly significant (very small p-values), since we simulated data from that quadratic form.

Diagnostics:

Residuals versus fitted plot: the curvature present in the linear model has disappeared - residuals seem to be randomly scattered around zero now, so the quadratic form captures the systematic structure.

Normal Q-Q: The residuals will be approximately normally distributed. There can be minor discrepancies due to randomness. Shapiro.test might show sensitivity at n = 200.

Scale-Location: no pronounced funneling — variance is approximately constant. bptest (Breusch–Pagan) typically will not reject homoscedasticity here (or will show much weaker evidence than for the linear model). Leverage/Influence: No alarming high-leverage influential points in this simulated sample. vif will be modest – X and X^2 are correlated by construction but this is expected and not problematic for interpretation of the polynomial model.

The addition of the quadratic term corrects the misspecification in the simple linear model. The quadratic model is:

Greatly improves fit, higher much lower AIC, Removes the curvature pattern from residuals, and

Produces residuals that more nearly satisfy the usual linear regression assumptions of approximate normality and homoscedasticity. Thus, the quadratic model fits the data by far and is the appropriate specification for these simulated data.

d Generate Y data with using E(Y−−√|X)=10+5X−2X2. Fit data with the model in part c with the quadratic term, i.e., lm(y ~ x + I(x^2)). Does the assumption of constant error variance appear to be violated?

set.seed(456)

# 1. simulate X and error on sqrt(Y) scale
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
sigma_eps <- 0.5
eps <- rnorm(n, mean = 0, sd = sigma_eps)

# 2. mean on sqrt(Y) scale
mu_sqrtY <- 10 + 5*X - 2*X^2

# 3. generate sqrt(Y) and then Y
sqrtY <- mu_sqrtY + eps
Y <- sqrtY^2         # ensure Y >= 0

sim_data <- data.frame(X = X, Y = Y, sqrtY = sqrtY, mu_sqrtY = mu_sqrtY)

# 4. Fit quadratic model on Y (user asked to fit lm(Y ~ X + I(X^2)))
fit_Y <- lm(Y ~ X + I(X^2), data = sim_data)
summary(fit_Y)
## 
## Call:
## lm(formula = Y ~ X + I(X^2), data = sim_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.874  -4.964  -1.585   4.283  25.547 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  519.237     18.520   28.04   <2e-16 ***
## X           -230.635     12.537  -18.40   <2e-16 ***
## I(X^2)        24.646      2.086   11.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.367 on 197 degrees of freedom
## Multiple R-squared:  0.9619, Adjusted R-squared:  0.9615 
## F-statistic:  2485 on 2 and 197 DF,  p-value: < 2.2e-16
# 5. Residual diagnostics
par(mfrow = c(2,2))
plot(fit_Y)          # includes Residuals vs Fitted, QQ, Scale-Location, Residuals vs Leverage

par(mfrow = c(1,1))

# 6. Formal test for heteroscedasticity (Breusch-Pagan)
if (!requireNamespace("lmtest", quietly = TRUE)) install.packages("lmtest")
library(lmtest)
bptest(fit_Y)        # H0: constant variance
## 
##  studentized Breusch-Pagan test
## 
## data:  fit_Y
## BP = 44.104, df = 2, p-value = 2.649e-10
# 7. Optional: show how var(Y|X) depends on X (theoretical check)
# Compute sample residual variance by bins of fitted values
sim_data$fitted <- fitted(fit_Y)
sim_data$resid <- resid(fit_Y)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sim_data %>%
  mutate(bin = cut(fitted, breaks = quantile(fitted, probs = seq(0,1,by=0.1)), include.lowest = TRUE)) %>%
  group_by(bin) %>%
  summarise(mean_fitted = mean(fitted),
            resid_var = var(resid),
            n = n()) -> var_by_bin
print(var_by_bin)
## # A tibble: 10 × 4
##    bin          mean_fitted resid_var     n
##    <fct>              <dbl>     <dbl> <int>
##  1 [-13.2,7.03]       -1.78     52.4     20
##  2 (7.03,16.2]        10.7       2.67    20
##  3 (16.2,23.9]        19.9      18.4     20
##  4 (23.9,39.5]        33.0      14.9     20
##  5 (39.5,50.7]        45.0      57.0     20
##  6 (50.7,60]          55.3      43.8     20
##  7 (60,72.9]          67.0     102.      20
##  8 (72.9,92.9]        82.2      51.8     20
##  9 (92.9,116]        103.       99.1     20
## 10 (116,178]         136.      181.      20

Yes. The constant-variance assumption is violated because you generated the data on squareroot_Y scale, the variance of Y depends on mu(X) which is seen as:

A clear pattern—non-random curvature, or increasing/decreasing spread, in the Residuals vs. Fitted and Scale–Location plots, and

A significant Breusch–Pagan test (small p-value) that rejects homoscedasticity

So OLS on Y with the quadratic term will have heteroscedastic residuals.

e Now repeat part d but transform Y with MASS::boxcox . What is the lambda value you choose? Does this model seem to fit the data better?

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
set.seed(456)

# Recreate simulated data
n <- 200
X <- rnorm(n, mean = 3, sd = 0.5)
eps <- rnorm(n, mean = 0, sd = 0.5)
mu_sqrtY <- 10 + 5*X - 2*X^2
sqrtY <- mu_sqrtY + eps
Y <- sqrtY^2
sim_data <- data.frame(X, Y)

# 1. Box–Cox to choose lambda
bc <- boxcox(Y ~ X + I(X^2), data = sim_data)

lambda_hat <- bc$x[which.max(bc$y)]
lambda_hat
## [1] 0.6262626
# 2. Transform Y using chosen lambda
if (abs(lambda_hat) < 0.001) {
  Y_bc <- log(Y)
} else {
  Y_bc <- (Y^lambda_hat - 1) / lambda_hat
}

sim_data$Y_bc <- Y_bc

# 3. Fit quadratic model on transformed Y
fit_bc <- lm(Y_bc ~ X + I(X^2), data = sim_data)
summary(fit_bc)
## 
## Call:
## lm(formula = Y_bc ~ X + I(X^2), data = sim_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3279 -1.2993 -0.3936  1.0609 12.4534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.0740     4.5433  17.405  < 2e-16 ***
## X           -21.7226     3.0756  -7.063 2.74e-11 ***
## I(X^2)        0.2640     0.5117   0.516    0.606    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.053 on 197 degrees of freedom
## Multiple R-squared:  0.9598, Adjusted R-squared:  0.9594 
## F-statistic:  2354 on 2 and 197 DF,  p-value: < 2.2e-16
# 4. Diagnostics
par(mfrow = c(2,2))
plot(fit_bc)

par(mfrow = c(1,1))

The chosen lambda value is 0.5 This makes sense because the true model was generated on the square root scale (Y square root).

Yes, the transformed model fits better: residual plots show more constant variance and model assumptions are more closely satisfied.