Polynomial Regression Models
Boston Housing Analysis
You are a data scientist in Awesome Business Analytics (ABA). ABA is a public company traded in Stock exchange. The CEO of ABA wants to invest in the real estate properties in the Boston area. You are given a dataset containing housing values in the suburbs of Boston in the file Boston.csv.
Data Source:
- Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the
demand for clean air. J. Environ. Economics and Management 5,
81–102.
- Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Data Dictionary
Attributes | Description | Data Type | Constraints / Rules |
---|---|---|---|
crim |
Per capita crime rate by town | float64 |
Must be ≥ 0 |
zn |
Proportion of residential land zoned for lots over 25,000 sq.ft | float64 |
Range: 0–100 |
indus |
Proportion of non-retail business acres per town | float64 |
Must be ≥ 0 |
chas |
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) | int64 |
Binary: 0 or 1 |
nox |
Nitric oxides concentration (parts per 10 million) | float64 |
Range: Typically 0.3–1.0 |
rm |
Average number of rooms per dwelling | float64 |
Typically between 3 and 9 |
age |
Proportion of owner-occupied units built before 1940 | float64 |
Range: 0–100 |
dis |
Weighted distances to five Boston employment centers | float64 |
Must be > 0 |
rad |
Accessibility index to radial highways | int64 |
Discrete integer; values typically range from 1 to 24 |
tax |
Property-tax rate per $10,000 | int64 |
Must be > 0 |
ptratio |
Pupil-teacher ratio by town | float64 |
Typical range: 12–22 |
lstat |
% of lower status population | float64 |
Range: 0–100 |
medv |
Median home value (in $1000s) | float64 |
Typical range: 5–50 (capped at 50) |
Environment Preparation
Install necessary packages if not installed
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("gam", quietly = TRUE)) {
install.packages("gam")
}
Load the libraries
Question 1
Read the dataset in Boston.csv into R. Call the loaded data Boston. Make sure that you have the directory set to the correct location for the data.
Read the dataset into memory
Display the dimensions of the data frame (number of rows and columns)
## [1] 506 14
Display the first six rows of the data frame to understand its structure
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
Display the column names of the data frame
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Question 2
The response is nox
and the predictor is
dis
. Use the poly()
function to fit a cubic
polynomial regression to predict nox
using
dis
. Report the regression output.
Fit a cubic polynomial regression model
Display the summary statistics of the fitted model
##
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 < 2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 < 2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 < 2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: < 2.2e-16
A cubic polynomial regression model effectively describes the
complex, non-linear relationship between distance to employment centers
dis
and nitrogen oxide concentration nox
in
Boston. The model exhibits a strong fit (R-squared
~71.48%), significant coefficients, and accurate predictions,
demonstrating that dis
is a powerful predictor of
nox
.
Question 3
Your assistant data scientist, Tom Johnson, is considering
predicting nox
using dis
as a predictor. He
proposes models from degree 5, degree 4, and degree 3, and degree 2
polynomial regression. Please perform cross-validation using caret
package to select the optimal degree for the polynomial and justify your
answer.
Create a function to predict nox
using dis
as a predictor with various degrees.
polynomial_cv <- function(data, response, predictor,
degrees = 2:5, seed = 1) {
# Set random seed for reproducibility
set.seed(seed)
# Initialize an empty data frame to store results
cv_results <- data.frame(Degree = integer(), RMSE = numeric())
# Loop through different polynomial degrees
for (degree in degrees) {
# Define polynomial regression formula dynamically
fit <- as.formula(paste(response, "~ poly(", predictor, ",", degree, ")"))
# Set up 10-fold cross-validation
train_control <- trainControl(method = "CV", number = 10)
# Train the polynomial regression model using linear regression
poly_model <- train(fit, data = data,
method = "lm", trControl = train_control)
# Store the degree and corresponding RMSE in the results dataframe
cv_results <- rbind(cv_results,
data.frame(Degree = degree,
RMSE = round(poly_model$results$RMSE, 4)))
}
# Identify the polynomial degree with the lowest RMSE
best_degree <- cv_results$Degree[which.min(cv_results$RMSE)]
# Print cross-validation results
cat("\nDegrees for polynomial regression with corresponding RMSE value \n")
print(cv_results)
# Return the optimal polynomial degree with the lowest RMSE
return(cat("\nOptimal degree for polynomial regression is", best_degree,
"with RMSE value of", min(cv_results$RMSE), "\n"))
}
Perform cross-validation to find the best polynomial degree
##
## Degrees for polynomial regression with corresponding RMSE value
## Degree RMSE
## 1 2 0.0635
## 2 3 0.0620
## 3 4 0.0624
## 4 5 0.0643
##
## Optimal degree for polynomial regression is 3 with RMSE value of 0.062
A degree 3 polynomial regression model, validated through 10-fold
cross-validation, is the optimal choice for predicting nitrogen oxide
concentration nox
from distance to employment centers
dis
in the Boston dataset. It yielded the lowest RMSE,
indicating the best fit, while higher-degree models showed overfitting
and lower-degree models underfitting. This suggests a cubic relationship
between dis
and nox
, and the cross-validation
assures the model’s reliability for unseen data.
Question 4
Tom just took the DSCI 512. You recommend that he perform the
following GAM
analysis.
Section A:
Predict nox
using a smoothing spline of degree 3
in dis
and a smoothing spline of degree 2 in
medv
.
Fit the first GAM model
Display summary of the first model
##
## Call: gam(formula = nox ~ s(dis, 3) + s(medv, 2), data = Boston)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.120427 -0.038086 -0.007356 0.026321 0.216850
##
## (Dispersion Parameter for gaussian family taken to be 0.0034)
##
## Null Deviance: 6.781 on 505 degrees of freedom
## Residual Deviance: 1.7112 on 500.0003 degrees of freedom
## AIC: -1428.839
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(dis, 3) 1 3.8465 3.8465 1123.908 < 2.2e-16 ***
## s(medv, 2) 1 0.2492 0.2492 72.802 < 2.2e-16 ***
## Residuals 500 1.7112 0.0034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(dis, 3) 2 85.724 < 2.2e-16 ***
## s(medv, 2) 1 8.572 0.003569 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A Generalized Additive Model GAM
model effectively
predicts nitrogen oxide nox
concentrations in Boston using
distance to employment centers dis
and median home values
medv
. Both predictors show significant, non-linear
relationships with nox
, with dis
having a
stronger influence than medv
. The GAM’s fit is
satisfactory, demonstrating the necessity of non-linear smooth functions
over linear models for accurately representing these relationships.
Section B:
Predict nox
using a smoothing spline of degree 2
in dis
and a smoothing spline of degree 1 in
medv
.
Fit the second GAM model
Display summary of the second model
##
## Call: gam(formula = nox ~ s(dis, 2) + s(medv, 1), data = Boston)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.118926 -0.039568 -0.008386 0.026542 0.219987
##
## (Dispersion Parameter for gaussian family taken to be 0.0037)
##
## Null Deviance: 6.781 on 505 degrees of freedom
## Residual Deviance: 1.8363 on 501.9999 degrees of freedom
## AIC: -1397.131
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(dis, 2) 1 4.0124 4.0124 1096.874 < 2.2e-16 ***
## s(medv, 1) 1 0.2820 0.2820 77.091 < 2.2e-16 ***
## Residuals 502 1.8363 0.0037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(dis, 2) 1 145.576 < 2.2e-16 ***
## s(medv, 1) 0 39.036 2.444e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This GAM
model effectively predicts nitric oxide
concentration nox
in the Boston housing dataset using
non-linear relationships with weighted distances to employment centers
dis
and median home values medv
. Both
predictors show significant, complex, curved effects on
nox
, and the model significantly reduces deviance compared
to a null model, indicating a good fit. Residual analysis supports the
model’s robustness and predictive power.
Section C:
Perform anova
analysis. Recommend the best model
and justify your answer.
Conduct ANOVA to compare model fit using Chi-Square test
## Analysis of Deviance Table
##
## Model 1: nox ~ s(dis, 2) + s(medv, 1)
## Model 2: nox ~ s(dis, 3) + s(medv, 2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 502 1.8363
## 2 500 1.7112 1.9996 0.12512 1.151e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA
analysis compares two models, Model 1 and
Model 2, to determine which provides a better fit for the data. Model 2,
with slightly fewer residual degrees of freedom (500 vs. 502), has a
lower residual deviance (1.7112) compared to Model 1 (1.8363),
indicating a better fit. The additional complexity in Model 2, such as
more smoothing terms for the variables dis
and
medv
, contributes to this improvement. The Chi-Square test
shows a significant p-value (1.151e-08), confirming that Model 2’s added
complexity leads to a significantly better fit.
In conclusion, Model 2 is the preferred model, as it significantly improves the fit over Model 1, making it the more suitable choice for the data.