Course Code: STAT 3206 | Course Title: Regression Analysis-II (Lab)
Byers and Williams studied the impact of temperature (the regressor) on the viscosity (the response) of toluene-teraline blends. The following data are for blends with a 0.4 molar fraction of toluene.
| Temperature | 24.9 | 35.0 | 44.9 | 55.1 | 65.2 | 75.2 | 85.2 | 95.2 |
|---|---|---|---|---|---|---|---|---|
| Viscosity | 1.133 | 0.9772 | 0.8532 | 0.7550 | 0.6723 | 0.6021 | 0.5420 | 0.5074 |
Temperature <- c(24.9, 35.0, 44.9, 55.1, 65.2, 75.2, 85.2, 95.2)
Viscosity <- c(1.133, 0.9772, 0.8532, 0.7550, 0.6723, 0.6021, 0.5420, 0.5074)
data <- data.frame(Temperature, Viscosity)
head(data)## Temperature Viscosity
## 1 24.9 1.1330
## 2 35.0 0.9772
## 3 44.9 0.8532
## 4 55.1 0.7550
## 5 65.2 0.6723
## 6 75.2 0.6021
Does it seem likely that a straight line model will be adequate?
plot(Temperature, Viscosity,
main = "Scatterplot of Temperature vs Viscosity",
xlab = "Temperature",
ylab = "Viscosity",
pch = 16,
col = "blue")
abline(lm(Viscosity ~ Temperature, data = data), col = "red", lwd = 2)
lines(smooth.spline(Temperature, Viscosity), col = "green", lwd = 2)
legend("topright",
legend = c("Linear Fit", "Smooth Spline"),
col = c("red", "green"),
lwd = 2)Conclusion: The scatter plot shows a non-linear (curved) decreasing trend between Temperature and Viscosity. A straight line model may not be fully adequate; an exponential or log-transformed model is likely more appropriate.
##
## Call:
## lm(formula = Viscosity ~ Temperature, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.043955 -0.035863 -0.009305 0.019900 0.069559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2815107 0.0468683 27.34 1.58e-07 ***
## Temperature -0.0087578 0.0007284 -12.02 2.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04743 on 6 degrees of freedom
## Multiple R-squared: 0.9602, Adjusted R-squared: 0.9535
## F-statistic: 144.6 on 1 and 6 DF, p-value: 2.007e-05
# Plot the residuals
par(mfrow = c(2, 2))
plot(model, which = 1, main = "Residuals vs Fitted")
plot(model, which = 2, main = "Normal Q-Q")
plot(model, which = 3, main = "Scale-Location")
plot(model, which = 5, main = "Residuals vs Leverage")Conclusion on Model Adequacy: The residual plots reveal a pattern (non-random curvature) in the Residuals vs Fitted plot, indicating that the linear model does not capture the true relationship adequately. The Q-Q plot may also show minor deviations. A transformation is needed.
Physical chemistry suggests: \[\text{Viscosity} = \beta_0 \cdot e^{\beta_1 \cdot \text{Temperature}}\]
Taking natural log of both sides: \[\log(\text{Viscosity}) = \beta_0 + \beta_1 \cdot \text{Temperature}\]
# Apply log transformation to Viscosity
data$log_Viscosity <- log(data$Viscosity)
# Fit a linear regression model on log(Viscosity) ~ Temperature
model_log <- lm(log_Viscosity ~ Temperature, data = data)
# Summary statistics of the model
summary(model_log)##
## Call:
## lm(formula = log_Viscosity ~ Temperature, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02188 -0.01827 -0.01134 0.01220 0.04308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3761794 0.0276526 13.60 9.79e-06 ***
## Temperature -0.0115306 0.0004297 -26.83 1.77e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02799 on 6 degrees of freedom
## Multiple R-squared: 0.9917, Adjusted R-squared: 0.9904
## F-statistic: 719.9 on 1 and 6 DF, p-value: 1.77e-07
# Plot the residuals for log-transformed model
par(mfrow = c(2, 2))
plot(model_log, which = 1, main = "Residuals vs Fitted (log transformation)")
plot(model_log, which = 2, main = "Normal Q-Q (log transformation)")
plot(model_log, which = 3, main = "Scale-Location (log transformation)")
plot(model_log, which = 5, main = "Residuals vs Leverage (log transformation)")Conclusion: The log-transformed model shows much better residual behavior — the residuals are more randomly scattered with no systematic pattern, confirming the exponential model is a better fit than the straight line model.
A solid fuel rocket propellant loses weight after it is produced. The following data are available:
| Months since Production (x) | Weight Loss (y) kg |
|---|---|
| 0.25 | 1.42 |
| 0.50 | 1.39 |
| 0.75 | 1.55 |
| 1.00 | 1.89 |
| 1.25 | 2.43 |
| 1.50 | 3.15 |
| 1.75 | 4.05 |
| 2.00 | 5.15 |
| 2.25 | 6.43 |
| 2.50 | 7.89 |
months_since_production <- c(0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00, 2.25, 2.50)
weight_loss <- c(1.42, 1.39, 1.55, 1.89, 2.43, 3.15, 4.05, 5.15, 6.43, 7.89)\[\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2\]
# Fit a second-order polynomial model (quadratic model)
model <- lm(weight_loss ~ poly(months_since_production, 2, raw = TRUE))
model##
## Call:
## lm(formula = weight_loss ~ poly(months_since_production, 2, raw = TRUE))
##
## Coefficients:
## (Intercept)
## 1.633
## poly(months_since_production, 2, raw = TRUE)1
## -1.232
## poly(months_since_production, 2, raw = TRUE)2
## 1.495
##
## Call:
## lm(formula = weight_loss ~ poly(months_since_production, 2, raw = TRUE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005364 -0.002727 0.001045 0.002409 0.003273
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.633000 0.004196 389.2
## poly(months_since_production, 2, raw = TRUE)1 -1.232182 0.007010 -175.8
## poly(months_since_production, 2, raw = TRUE)2 1.494545 0.002484 601.6
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## poly(months_since_production, 2, raw = TRUE)1 5.09e-14 ***
## poly(months_since_production, 2, raw = TRUE)2 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003568 on 7 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.859e+06 on 2 and 7 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weight_loss
## Df Sum Sq Mean Sq F value
## poly(months_since_production, 2, raw = TRUE) 2 47.31 23.655 1858613
## Residuals 7 0.00 0.000
## Pr(>F)
## poly(months_since_production, 2, raw = TRUE) < 2.2e-16 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: If the overall F-statistic p-value < 0.05, the regression is statistically significant, meaning at least one predictor (linear or quadratic term) significantly explains variation in weight loss.
# Plot observed data
plot(months_since_production, weight_loss,
main = "Weight Loss vs Months Since Production",
xlab = "Months Since Production",
ylab = "Weight Loss (kg)",
pch = 16,
col = "blue")
# Add fitted polynomial curve
x_seq <- seq(min(months_since_production), max(months_since_production), length.out = 200)
y_fitted <- predict(model, newdata = data.frame(months_since_production = x_seq))
lines(x_seq, y_fitted, col = "red", lwd = 2)
legend("topleft", legend = c("Observed", "Fitted Quadratic"),
col = c("blue", "red"), pch = c(16, NA), lty = c(NA, 1), lwd = 2)# Extract p-value for the quadratic term (beta_2)
model_summary <- summary(model)
p_value_quadratic <- model_summary$coefficients[3, 4] # row 3 = quadratic term
cat("p-value for quadratic term (beta_2):", p_value_quadratic, "\n")## p-value for quadratic term (beta_2): 9.255077e-18
# Hypothesis test for H0: B2 = 0 (quadratic term significance)
if (p_value_quadratic < 0.05) {
print("Reject H0: The quadratic term is statistically significant.")
} else {
print("Fail to reject H0: The quadratic term is not statistically significant.")
}## [1] "Reject H0: The quadratic term is statistically significant."
Comment: If \(\beta_2\) is significant (p < 0.05), the quadratic term is necessary and the relationship between weight loss and months is curvilinear. Dropping the quadratic term would lead to a poorly fitting model.
Data with response y and predictors \(x_1, x_2, x_3, x_4\):
y <- c(78.5, 74.3, 104.3, 87.6, 95.9, 109.2, 102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4)
x1 <- c(7, 1, 11, 11, 7, 11, 3, 1, 2, 21, 1, 11, 10)
x2 <- c(26, 29, 56, 31, 52, 55, 71, 31, 54, 47, 40, 66, 68)
x3 <- c(6, 15, 8, 8, 6, 9, 17, 22, 18, 4, 23, 9, 8)
x4 <- c(60, 52, 20, 47, 33, 22, 6, 44, 22, 26, 34, 12, 12)
# Create a data frame
data <- data.frame(y, x1, x2, x3, x4)# Initial null model (no predictors)
null_model <- lm(y ~ 1, data = data)
# Full model with all predictors
full_model <- lm(y ~ x1 + x2 + x3 + x4, data = data)
# Perform forward selection based on AIC
stepwise_model_forward <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward",
trace = 1)## Start: AIC=71.44
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x4 1 1831.90 883.87 58.852
## + x2 1 1809.43 906.34 59.178
## + x1 1 1450.08 1265.69 63.519
## + x3 1 776.36 1939.40 69.067
## <none> 2715.76 71.444
##
## Step: AIC=58.85
## y ~ x4
##
## Df Sum of Sq RSS AIC
## + x1 1 809.10 74.76 28.742
## + x3 1 708.13 175.74 39.853
## <none> 883.87 58.852
## + x2 1 14.99 868.88 60.629
##
## Step: AIC=28.74
## y ~ x4 + x1
##
## Df Sum of Sq RSS AIC
## + x2 1 26.789 47.973 24.974
## + x3 1 23.926 50.836 25.728
## <none> 74.762 28.742
##
## Step: AIC=24.97
## y ~ x4 + x1 + x2
##
## Df Sum of Sq RSS AIC
## <none> 47.973 24.974
## + x3 1 0.10909 47.864 26.944
##
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x4 -0.2365 0.1733 -1.365 0.205395
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
##
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = data)
##
## Coefficients:
## (Intercept) x4 x1 x2
## 71.6483 -0.2365 1.4519 0.4161
# Perform backward elimination based on AIC
stepwise_model_backward <- step(full_model,
direction = "backward",
trace = 1)## Start: AIC=26.94
## y ~ x1 + x2 + x3 + x4
##
## Df Sum of Sq RSS AIC
## - x3 1 0.1091 47.973 24.974
## - x4 1 0.2470 48.111 25.011
## - x2 1 2.9725 50.836 25.728
## <none> 47.864 26.944
## - x1 1 25.9509 73.815 30.576
##
## Step: AIC=24.97
## y ~ x1 + x2 + x4
##
## Df Sum of Sq RSS AIC
## <none> 47.97 24.974
## - x4 1 9.93 57.90 25.420
## - x2 1 26.79 74.76 28.742
## - x1 1 820.91 868.88 60.629
##
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## x4 -0.2365 0.1733 -1.365 0.205395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
##
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = data)
##
## Coefficients:
## (Intercept) x1 x2 x4
## 71.6483 1.4519 0.4161 -0.2365
# Perform stepwise regression based on AIC (both directions)
stepwise_model_both <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "both",
trace = 1)## Start: AIC=71.44
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x4 1 1831.90 883.87 58.852
## + x2 1 1809.43 906.34 59.178
## + x1 1 1450.08 1265.69 63.519
## + x3 1 776.36 1939.40 69.067
## <none> 2715.76 71.444
##
## Step: AIC=58.85
## y ~ x4
##
## Df Sum of Sq RSS AIC
## + x1 1 809.10 74.76 28.742
## + x3 1 708.13 175.74 39.853
## <none> 883.87 58.852
## + x2 1 14.99 868.88 60.629
## - x4 1 1831.90 2715.76 71.444
##
## Step: AIC=28.74
## y ~ x4 + x1
##
## Df Sum of Sq RSS AIC
## + x2 1 26.79 47.97 24.974
## + x3 1 23.93 50.84 25.728
## <none> 74.76 28.742
## - x1 1 809.10 883.87 58.852
## - x4 1 1190.92 1265.69 63.519
##
## Step: AIC=24.97
## y ~ x4 + x1 + x2
##
## Df Sum of Sq RSS AIC
## <none> 47.97 24.974
## - x4 1 9.93 57.90 25.420
## + x3 1 0.11 47.86 26.944
## - x2 1 26.79 74.76 28.742
## - x1 1 820.91 868.88 60.629
##
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0919 -1.8016 0.2562 1.2818 3.8982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.6483 14.1424 5.066 0.000675 ***
## x4 -0.2365 0.1733 -1.365 0.205395
## x1 1.4519 0.1170 12.410 5.78e-07 ***
## x2 0.4161 0.1856 2.242 0.051687 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.8 on 3 and 9 DF, p-value: 3.323e-08
##
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = data)
##
## Coefficients:
## (Intercept) x4 x1 x2
## 71.6483 -0.2365 1.4519 0.4161
Comment: All three variable selection methods (forward selection, backward elimination, and stepwise regression) are based on AIC minimization. They often converge to the same final model. The selected subset model retains only the predictors that contribute meaningfully to explaining variation in y, while discarding those that inflate variance without improving fit. The final model should be interpreted in terms of parsimony and statistical significance of retained predictors.
A dataset of housing features including square footage (sqft), number of rooms (rooms), age (age), and price (price):
sqft <- c(1500, 1800, 1200, 2000, 2200, 1600, 1700, 1400, 1900, 2500)
rooms <- c(3, 4, 3, 5, 5, 4, 4, 3, 4, 6)
age <- c(10, 15, 20, 5, 2, 10, 8, 12, 4, 1)
price <- c(400000, 500000, 350000, 600000, 650000, 450000, 480000, 380000, 550000, 700000)
# Combine into a data frame
data <- data.frame(sqft, rooms, age, price)# Define the model formula
formula <- price ~ sqft + rooms + age
# Initialize variables
n <- nrow(data)
errors <- numeric(n)
# Leave-One-Out Cross-Validation
for (i in 1:n) {
# Split the data into training and test sets
train_data <- data[-i, ] # all except the i-th row
test_data <- data[i, , drop = FALSE] # only the i-th row
# Fit the linear model on the training data
model_cv <- lm(formula, data = train_data)
# Make prediction for the test set (the i-th observation)
predicted <- predict(model_cv, newdata = test_data)
# Calculate the squared error for this fold
errors[i] <- (test_data$price - predicted)^2
}
# Calculate the RMSE (Root Mean Squared Error)
rmse_loocv <- sqrt(mean(errors))
cat("LOOCV RMSE:", rmse_loocv, "\n")## LOOCV RMSE: 25612.06
# Set the number of folds for cross-validation
k <- 5
n <- nrow(data)
# Create indices for the folds
set.seed(123) # Set seed for reproducibility
folds <- sample(rep(1:k, length.out = n))
# Initialize a vector to store RMSE values for each fold
rmse_values <- numeric(k)
# Loop through each fold
for (i in 1:k) {
# Split the data into training and testing sets
test_indices <- which(folds == i)
train_data <- data[-test_indices, ]
test_data <- data[test_indices, ]
# Fit a linear regression model on the training set
model_fold <- lm(price ~ sqft + rooms + age, data = train_data)
# Predict on the test set
predictions <- predict(model_fold, newdata = test_data)
# Calculate RMSE for this fold
rmse_values[i] <- sqrt(mean((predictions - test_data$price)^2))
}
# Average RMSE across all folds
avg_rmse <- mean(rmse_values)
avg_rmse## [1] 25129.48
# Calculate the average RMSE across all folds
print(paste("RMSE for each fold:", toString(rmse_values)))## [1] "RMSE for each fold: 19521.66614908, 20550.0775918097, 28898.116995628, 39377.8852454853, 17299.6613988924"
## Average 5-Fold CV RMSE: 25129.48
Conclusion: LOOCV uses each observation once as a test case, giving an almost unbiased estimate of prediction error but with higher variance. 5-fold CV balances bias-variance tradeoff and is computationally less expensive. Both methods provide an estimate of how well the model generalizes to unseen data.
Data with response y and two predictors \(x_1\) and \(x_2\):
y <- c(16.68, 11.50, 12.03, 14.88, 13.75, 18.11, 8.00, 17.83, 79.24, 21.50, 40.33, 21.00)
x1 <- c(7, 3, 3, 4, 6, 7, 2, 7, 30, 5, 16, 10)
x2 <- c(560, 220, 340, 80, 150, 330, 110, 210, 1460, 605, 688, 215)
# Combine into a data frame
data <- data.frame(y, x1, x2)\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2\]
##
## Call:
## lm(formula = y ~ x1 + x2, data = data)
##
## Coefficients:
## (Intercept) x1 x2
## 1.74124 1.86784 0.01352
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 4054.3 4054.3 501.6551 3.338e-09 ***
## x2 1 62.0 62.0 7.6707 0.02177 *
## Residuals 9 72.7 8.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comment on Significance: If the overall F-statistic p-value < 0.05, we conclude the regression is statistically significant — i.e., at least one of \(x_1\) or \(x_2\) significantly explains variation in y.
##
## Call:
## lm(formula = y ~ x1 + x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7077 -1.1826 0.1314 1.3163 4.5857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.741239 1.251702 1.391 0.1976
## x1 1.867842 0.241182 7.745 2.87e-05 ***
## x2 0.013521 0.004882 2.770 0.0218 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.843 on 9 degrees of freedom
## Multiple R-squared: 0.9826, Adjusted R-squared: 0.9788
## F-statistic: 254.7 on 2 and 9 DF, p-value: 1.198e-08
Conclusion on Role of Variables:
- If the p-value for \(x_1\) (t-test for \(\beta_1\)) < 0.05, we reject \(H_0: \beta_1 = 0\), concluding \(x_1\) has a significant effect on y.
- If the p-value for \(x_2\) (t-test for \(\beta_2\)) < 0.05, we reject \(H_0: \beta_2 = 0\), concluding \(x_2\) has a significant effect on y.
- Variables with p-values ≥ 0.05 may not contribute significantly beyond the other predictor.
## R-squared: 0.9826364
## Adjusted R-squared: 0.9787778
Interpretation:
- \(R^2\) measures the proportion of total variation in y explained by the model.
- Adjusted \(R^2\) penalizes for adding unnecessary predictors; it is more reliable for comparing models with different numbers of predictors.
- A high Adjusted \(R^2\) (close to 1) indicates the model fits the data well.
End of Lab Report — STAT 3206