Question 1

(a) Fitted Regression Equation

model1 <- lm(Y ~ x1)
summary(model1)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 40652.8077  367.18172 110.71577 1.865144e-61
## x1           -515.9282   12.05658 -42.79225 4.671062e-41

Answer: \(\hat{Y} = 4.0652808\times 10^{4} + -515.9282 \times x_1\)

(b) R² Value

summary(model1)$r.squared
## [1] 0.9734209

Answer: \(R^2 = 0.9734\)

This indicates that 97.34% of the variation in Y is explained by x₁.

(c) Residual Plot

plot(residuals(model1), type = "o", pch = 19, col = "blue",
     main = "Residuals vs Time Order",
     xlab = "Observation Number (Time Order)", 
     ylab = "Residuals")
abline(h = 0, col = "red", lty = 2, lwd = 2)

Problem: The residual plot shows a systematic pattern where residuals cluster together by sign (positive residuals followed by positive residuals, then negative residuals followed by negative residuals). This indicates autocorrelation (serial correlation) in the errors, which violates the independence assumption of linear regression.

(d) Runs Test

Hypotheses:

  • \(H_0\): The residuals are randomly distributed (no serial correlation)
  • \(H_1\): The residuals are not randomly distributed (serial correlation exists)
  • Significance level: \(\alpha = 0.01\)
runs_result <- runs.test(residuals(model1))
runs_result
## 
##  Runs Test
## 
## data:  residuals(model1)
## statistic = -6.7226, runs = 3, n1 = 26, n2 = 26, n = 52, p-value =
## 1.785e-11
## alternative hypothesis: nonrandomness

Test Statistic: \(Z = -6.7226\)

P-value: \(0.0000000000178465\)

Conclusion: Since the p-value (0.0000000000178465) is less than \(\alpha = 0.01\), we reject the null hypothesis at the 0.01 significance level. There is sufficient evidence to conclude that the residuals are not randomly distributed, indicating the presence of autocorrelation in the errors.

(e) Durbin-Watson Test

Hypotheses:

  • \(H_0\): \(\rho = 0\) (no first-order autocorrelation in the residuals)
  • \(H_1\): \(\rho \neq 0\) (first-order autocorrelation exists in the residuals)
  • Significance level: \(\alpha = 0.01\)
dw_result <- dwtest(model1, alternative = "two.sided")
dw_result
## 
##  Durbin-Watson test
## 
## data:  model1
## DW = 0.11683, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0

Durbin-Watson Statistic: \(DW = 0.1168\)

P-value: \(0.000000000000000000000000000377435\)

Interpretation: The DW statistic of 0.1168 is much less than 2, indicating positive autocorrelation.

Conclusion: Since the p-value (0.000000000000000000000000000377435) is less than \(\alpha = 0.01\), we reject the null hypothesis at the 0.01 significance level. There is sufficient evidence to conclude that first-order autocorrelation exists in the residuals. The DW statistic indicates positive autocorrelation.

Question 2

(a) Fitted Regression Equation

model2a <- lm(y ~ n)
summary(model2a)$coefficients
##              Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -1.805191 2.04724607 -0.8817654 3.789728e-01
## n            1.060624 0.03519549 30.1352171 6.736090e-76

Answer: \(\hat{y} = -1.8052 + 1.0606 \times n\)

(b) Residual Plot

plot(fitted(model2a), rstandard(model2a), pch = 19, col = "blue",
     main = "Standardized Residuals vs Fitted Values",
     xlab = "Fitted Values", 
     ylab = "Standardized Residuals")
abline(h = 0, col = "red", lty = 2, lwd = 2)
abline(h = c(-2, 2), col = "orange", lty = 2)

Problem: The residual plot shows a funnel (megaphone) shape where the spread of residuals increases as fitted values increase. This indicates heteroscedasticity (non-constant variance), which violates the assumption of constant error variance (homoscedasticity) in linear regression.

(c) Square-Root Transformation

model2c <- lm(sqrt(y) ~ n)
summary(model2c)
## 
## Call:
## lm(formula = sqrt(y) ~ n)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59428 -0.62398  0.04937  0.64028  2.03262 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.612745   0.138339   18.89   <2e-16 ***
## n           0.081519   0.002378   34.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9709 on 198 degrees of freedom
## Multiple R-squared:  0.8558, Adjusted R-squared:  0.8551 
## F-statistic:  1175 on 1 and 198 DF,  p-value: < 2.2e-16

The model has been fitted with \(\sqrt{y}\) as the response variable to stabilize the variance.

(d) Fitted Equation

beta0_c <- coef(model2c)[1]
beta1_c <- coef(model2c)[2]

On transformed scale: \(\sqrt{\hat{y}} = 2.6127 + 0.0815 \times n\)

On original scale: \(\hat{y} = \left(2.6127 + 0.0815 \times n\right)^2\)

(e) Fitted Value at n = 40 (Original Scale)

n_value <- 40
pred_sqrt <- predict(model2c, newdata = data.frame(n = n_value))
pred_original <- pred_sqrt^2

Calculation:

Step 1: Calculate on \(\sqrt{y}\) scale
\(\sqrt{\hat{y}} = 2.6127 + 0.0815 \times 40 = 5.8735\)

Step 2: Square to get original scale
\(\hat{y} = \left(5.8735\right)^2 = 34.4983\)

Answer: When \(n = 40\), \(\hat{y} = 34.4983\)

(f) Log Transformation with Second-Order Term

model2f <- lm(log(y) ~ n + I(n^2))
summary(model2f)
## 
## Call:
## lm(formula = log(y) ~ n + I(n^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4744 -0.1900  0.0220  0.2386  0.7751 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.5104037  0.0768787   19.65   <2e-16 ***
## n            0.0674594  0.0035136   19.20   <2e-16 ***
## I(n^2)      -0.0003820  0.0000337  -11.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3552 on 197 degrees of freedom
## Multiple R-squared:  0.8619, Adjusted R-squared:  0.8605 
## F-statistic: 614.8 on 2 and 197 DF,  p-value: < 2.2e-16

A polynomial regression model with log-transformed response has been fitted: \(\log(y) = \beta_0 + \beta_1 n + \beta_2 n^2 + \epsilon\)

(g) Fitted Equation in Original Scale

beta0_f <- coef(model2f)[1]
beta1_f <- coef(model2f)[2]
beta2_f <- coef(model2f)[3]

On log scale:
\(\log(\hat{y}) = 1.510404 + 0.067459 \times n + -3.82\times 10^{-4} \times n^2\)

On original scale:
\(\hat{y} = \exp\left(1.510404 + 0.067459 \times n + -3.82\times 10^{-4} \times n^2\right)\)

Or equivalently:
\(\hat{y} = e^{1.510404 + 0.067459 \times n + -3.82\times 10^{-4} \times n^2}\)


Selected R Code

Question 1

# Load data and fit model
load("dataQ1.RData")
model1 <- lm(Y ~ x1)

# 1(a) & 1(b)
summary(model1)

# 1(c) Residual plot
plot(residuals(model1), type="o")
abline(h=0, col="red", lty=2)

# 1(d) Runs test
library(randtests)
runs.test(residuals(model1))

# 1(e) Durbin-Watson test
library(lmtest)
dwtest(model1, alternative="two.sided")

Question 2

# Load data and fit model
load("dataQ2.RData")
model2a <- lm(y ~ n)

# 2(a) & 2(b)
summary(model2a)
plot(fitted(model2a), rstandard(model2a))
abline(h=0, col="red", lty=2)

# 2(c) & 2(d) Square-root transformation
model2c <- lm(sqrt(y) ~ n)
summary(model2c)

# 2(e) Prediction at n=40
predict(model2c, newdata=data.frame(n=40))^2

# 2(f) & 2(g) Log polynomial
model2f <- lm(log(y) ~ n + I(n^2))
summary(model2f)