Polynomial Regression Models
Boston Housing Analysis
You are a data scientist in Awesome Business Analytics (ABA). ABA is a public company traded in Stock exchange. The CEO of ABA wants to invest in the real estate properties in the Boston area. You are given a dataset containing housing values in the suburbs of Boston in the file Boston.csv.
Data Source:
- Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the
demand for clean air. J. Environ. Economics and Management 5,
81–102.
- Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Data Dictionary
| Attributes | Description | Data Type | Constraints / Rules |
|---|---|---|---|
crim |
Per capita crime rate by town | float64 |
Must be ≥ 0 |
zn |
Proportion of residential land zoned for lots over 25,000 sq.ft | float64 |
Range: 0–100 |
indus |
Proportion of non-retail business acres per town | float64 |
Must be ≥ 0 |
chas |
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) | int64 |
Binary: 0 or 1 |
nox |
Nitric oxides concentration (parts per 10 million) | float64 |
Range: Typically 0.3–1.0 |
rm |
Average number of rooms per dwelling | float64 |
Typically between 3 and 9 |
age |
Proportion of owner-occupied units built before 1940 | float64 |
Range: 0–100 |
dis |
Weighted distances to five Boston employment centers | float64 |
Must be > 0 |
rad |
Accessibility index to radial highways | int64 |
Discrete integer; values typically range from 1 to 24 |
tax |
Property-tax rate per $10,000 | int64 |
Must be > 0 |
ptratio |
Pupil-teacher ratio by town | float64 |
Typical range: 12–22 |
lstat |
% of lower status population | float64 |
Range: 0–100 |
medv |
Median home value (in $1000s) | float64 |
Typical range: 5–50 (capped at 50) |
Environment Preparation
Install necessary packages if not installed
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("gam", quietly = TRUE)) {
install.packages("gam")
}Load the libraries
Question 1
Read the dataset in Boston.csv into R. Call the loaded data Boston. Make sure that you have the directory set to the correct location for the data.
Read the dataset into memory
Display the dimensions of the data frame (number of rows and columns)
## [1] 506 14
Display the first six rows of the data frame to understand its structure
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Display the column names of the data frame
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Question 2
The response is nox and the predictor is
dis. Use the poly() function to fit a cubic
polynomial regression to predict nox using
dis. Report the regression output.
Fit a cubic polynomial regression model
Display the summary statistics of the fitted model
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
A cubic polynomial regression model effectively describes the
complex, non-linear relationship between distance to employment centers
dis and nitrogen oxide concentration nox in
Boston. The model exhibits a strong fit (R-squared
~71.48%), significant coefficients, and accurate predictions,
demonstrating that dis is a powerful predictor of
nox.
Question 3
Your assistant data scientist, Tom Johnson, is considering
predicting nox using dis as a predictor. He
proposes models from degree 5, degree 4, and degree 3, and degree 2
polynomial regression. Please perform cross-validation using caret
package to select the optimal degree for the polynomial and justify your
answer.
Create a function to predict nox using dis
as a predictor with various degrees.
polynomial_cv <- function(data, response, predictor,
degrees = 2:5, seed = 1) {
# Set random seed for reproducibility
set.seed(seed)
# Initialize an empty data frame to store results
cv_results <- data.frame(Degree = integer(), RMSE = numeric())
# Loop through different polynomial degrees
for (degree in degrees) {
# Define polynomial regression formula dynamically
fit <- as.formula(paste(response, "~ poly(", predictor, ",", degree, ")"))
# Set up 10-fold cross-validation
train_control <- trainControl(method = "CV", number = 10)
# Train the polynomial regression model using linear regression
poly_model <- train(fit, data = data,
method = "lm", trControl = train_control)
# Store the degree and corresponding RMSE in the results dataframe
cv_results <- rbind(cv_results,
data.frame(Degree = degree,
RMSE = round(poly_model$results$RMSE, 4)))
}
# Identify the polynomial degree with the lowest RMSE
best_degree <- cv_results$Degree[which.min(cv_results$RMSE)]
# Print cross-validation results
cat("\nDegrees for polynomial regression with corresponding RMSE value \n")
print(cv_results)
# Return the optimal polynomial degree with the lowest RMSE
return(cat("\nOptimal degree for polynomial regression is", best_degree,
"with RMSE value of", min(cv_results$RMSE), "\n"))
}Perform cross-validation to find the best polynomial degree
##
## Degrees for polynomial regression with corresponding RMSE value
## Degree RMSE
## 1 2 0.0635
## 2 3 0.0620
## 3 4 0.0624
## 4 5 0.0643
##
## Optimal degree for polynomial regression is 3 with RMSE value of 0.062
A degree 3 polynomial regression model, validated through 10-fold
cross-validation, is the optimal choice for predicting nitrogen oxide
concentration nox from distance to employment centers
dis in the Boston dataset. It yielded the lowest RMSE,
indicating the best fit, while higher-degree models showed overfitting
and lower-degree models underfitting. This suggests a cubic relationship
between dis and nox, and the cross-validation
assures the model’s reliability for unseen data.
Question 4
Tom just took the DSCI 512. You recommend that he perform the
following GAM analysis.
Section A:
Predict nox using a smoothing spline of degree 3
in dis and a smoothing spline of degree 2 in
medv.
Fit the first GAM model
Display summary of the first model
##
## Call: gam(formula = nox ~ s(dis, 3) + s(medv, 2), data = Boston)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.120427 -0.038086 -0.007356 0.026321 0.216850
##
## (Dispersion Parameter for gaussian family taken to be 0.0034)
##
## Null Deviance: 6.781 on 505 degrees of freedom
## Residual Deviance: 1.7112 on 500.0003 degrees of freedom
## AIC: -1428.839
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(dis, 3) 1 3.8465 3.8465 1123.908 < 2.2e-16 ***
## s(medv, 2) 1 0.2492 0.2492 72.802 < 2.2e-16 ***
## Residuals 500 1.7112 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(dis, 3) 2 85.724 < 2.2e-16 ***
## s(medv, 2) 1 8.572 0.003569 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A Generalized Additive Model GAM model effectively
predicts nitrogen oxide nox concentrations in Boston using
distance to employment centers dis and median home values
medv. Both predictors show significant, non-linear
relationships with nox, with dis having a
stronger influence than medv. The GAM’s fit is
satisfactory, demonstrating the necessity of non-linear smooth functions
over linear models for accurately representing these relationships.
Section B:
Predict nox using a smoothing spline of degree 2
in dis and a smoothing spline of degree 1 in
medv.
Fit the second GAM model
Display summary of the second model
##
## Call: gam(formula = nox ~ s(dis, 2) + s(medv, 1), data = Boston)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.118926 -0.039568 -0.008386 0.026542 0.219987
##
## (Dispersion Parameter for gaussian family taken to be 0.0037)
##
## Null Deviance: 6.781 on 505 degrees of freedom
## Residual Deviance: 1.8363 on 501.9999 degrees of freedom
## AIC: -1397.131
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(dis, 2) 1 4.0124 4.0124 1096.874 < 2.2e-16 ***
## s(medv, 1) 1 0.2820 0.2820 77.091 < 2.2e-16 ***
## Residuals 502 1.8363 0.0037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(dis, 2) 1 145.576 < 2.2e-16 ***
## s(medv, 1) 0 39.036 2.444e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This GAM model effectively predicts nitric oxide
concentration nox in the Boston housing dataset using
non-linear relationships with weighted distances to employment centers
dis and median home values medv. Both
predictors show significant, complex, curved effects on
nox, and the model significantly reduces deviance compared
to a null model, indicating a good fit. Residual analysis supports the
model’s robustness and predictive power.
Section C:
Perform anova analysis. Recommend the best model
and justify your answer.
Conduct ANOVA to compare model fit using Chi-Square test
## Analysis of Deviance Table
##
## Model 1: nox ~ s(dis, 2) + s(medv, 1)
## Model 2: nox ~ s(dis, 3) + s(medv, 2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 502 1.8363
## 2 500 1.7112 1.9996 0.12512 1.151e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA analysis compares two models, Model 1 and
Model 2, to determine which provides a better fit for the data. Model 2,
with slightly fewer residual degrees of freedom (500 vs. 502), has a
lower residual deviance (1.7112) compared to Model 1 (1.8363),
indicating a better fit. The additional complexity in Model 2, such as
more smoothing terms for the variables dis and
medv, contributes to this improvement. The Chi-Square test
shows a significant p-value (1.151e-08), confirming that Model 2’s added
complexity leads to a significantly better fit.
In conclusion, Model 2 is the preferred model, as it significantly improves the fit over Model 1, making it the more suitable choice for the data.