Polynomial Regression Models

Boston Housing Analysis

You are a data scientist in Awesome Business Analytics (ABA). ABA is a public company traded in Stock exchange. The CEO of ABA wants to invest in the real estate properties in the Boston area. You are given a dataset containing housing values in the suburbs of Boston in the file Boston.csv.

Data Source:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Data Dictionary

Attributes	Description	Data Type	Constraints / Rules
`crim`	Per capita crime rate by town	`float64`	Must be ≥ 0
`zn`	Proportion of residential land zoned for lots over 25,000 sq.ft	`float64`	Range: 0–100
`indus`	Proportion of non-retail business acres per town	`float64`	Must be ≥ 0
`chas`	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)	`int64`	Binary: 0 or 1
`nox`	Nitric oxides concentration (parts per 10 million)	`float64`	Range: Typically 0.3–1.0
`rm`	Average number of rooms per dwelling	`float64`	Typically between 3 and 9
`age`	Proportion of owner-occupied units built before 1940	`float64`	Range: 0–100
`dis`	Weighted distances to five Boston employment centers	`float64`	Must be > 0
`rad`	Accessibility index to radial highways	`int64`	Discrete integer; values typically range from 1 to 24
`tax`	Property-tax rate per $10,000	`int64`	Must be > 0
`ptratio`	Pupil-teacher ratio by town	`float64`	Typical range: 12–22
`lstat`	% of lower status population	`float64`	Range: 0–100
`medv`	Median home value (in $1000s)	`float64`	Typical range: 5–50 (capped at 50)

Environment Preparation

Install necessary packages if not installed

if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("gam", quietly = TRUE)) {
  install.packages("gam")
}

Load the libraries

library(caret)
library(gam)

Question 1

Read the dataset in Boston.csv into R. Call the loaded data Boston. Make sure that you have the directory set to the correct location for the data.

Read the dataset into memory

Boston <- read.csv("data/Boston.csv")

Display the dimensions of the data frame (number of rows and columns)

dim(Boston)

## [1] 506  14

Display the first six rows of the data frame to understand its structure

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Display the column names of the data frame

colnames(Boston)

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Question 2

The response is nox and the predictor is dis. Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output.

Fit a cubic polynomial regression model

Boston.fit <- lm(nox ~ poly(dis, 3), data = Boston)

Display the summary statistics of the fitted model

summary(Boston.fit)

## 
## Call:
## lm(formula = nox ~ poly(dis, 3), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.121130 -0.040619 -0.009738  0.023385  0.194904 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.554695   0.002759 201.021  < 2e-16 ***
## poly(dis, 3)1 -2.003096   0.062071 -32.271  < 2e-16 ***
## poly(dis, 3)2  0.856330   0.062071  13.796  < 2e-16 ***
## poly(dis, 3)3 -0.318049   0.062071  -5.124 4.27e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared:  0.7148, Adjusted R-squared:  0.7131 
## F-statistic: 419.3 on 3 and 502 DF,  p-value: < 2.2e-16

A cubic polynomial regression model effectively describes the complex, non-linear relationship between distance to employment centers dis and nitrogen oxide concentration nox in Boston. The model exhibits a strong fit (R-squared ~71.48%), significant coefficients, and accurate predictions, demonstrating that dis is a powerful predictor of nox.

Question 3

Your assistant data scientist, Tom Johnson, is considering predicting nox using dis as a predictor. He proposes models from degree 5, degree 4, and degree 3, and degree 2 polynomial regression. Please perform cross-validation using caret package to select the optimal degree for the polynomial and justify your answer.

Create a function to predict nox using dis as a predictor with various degrees.

polynomial_cv <- function(data, response, predictor, 
                          degrees = 2:5, seed = 1) {
  
  # Set random seed for reproducibility
  set.seed(seed)
  
  # Initialize an empty data frame to store results
  cv_results <- data.frame(Degree = integer(), RMSE = numeric())
  
  # Loop through different polynomial degrees
  for (degree in degrees) {
    # Define polynomial regression formula dynamically
    fit <- as.formula(paste(response, "~ poly(", predictor, ",", degree, ")"))
    
    # Set up 10-fold cross-validation
    train_control <- trainControl(method = "CV", number = 10)
    
    # Train the polynomial regression model using linear regression
    poly_model <- train(fit, data = data, 
                        method = "lm", trControl = train_control)
    
    # Store the degree and corresponding RMSE in the results dataframe
    cv_results <- rbind(cv_results, 
                        data.frame(Degree = degree, 
                                   RMSE = round(poly_model$results$RMSE, 4)))
  }
  
  # Identify the polynomial degree with the lowest RMSE
  best_degree <- cv_results$Degree[which.min(cv_results$RMSE)]
  
  # Print cross-validation results
  cat("\nDegrees for polynomial regression with corresponding RMSE value \n")
  print(cv_results)
  
  # Return the optimal polynomial degree with the lowest RMSE
  return(cat("\nOptimal degree for polynomial regression is", best_degree,
             "with RMSE value of", min(cv_results$RMSE), "\n"))
}

Perform cross-validation to find the best polynomial degree

polynomial_cv(data = Boston, response = "nox", predictor = "dis")

## 
## Degrees for polynomial regression with corresponding RMSE value 
##   Degree   RMSE
## 1      2 0.0635
## 2      3 0.0620
## 3      4 0.0624
## 4      5 0.0643
## 
## Optimal degree for polynomial regression is 3 with RMSE value of 0.062

A degree 3 polynomial regression model, validated through 10-fold cross-validation, is the optimal choice for predicting nitrogen oxide concentration nox from distance to employment centers dis in the Boston dataset. It yielded the lowest RMSE, indicating the best fit, while higher-degree models showed overfitting and lower-degree models underfitting. This suggests a cubic relationship between dis and nox, and the cross-validation assures the model’s reliability for unseen data.

Question 4

Tom just took the DSCI 512. You recommend that he perform the following GAM analysis.

Section A:

Predict nox using a smoothing spline of degree 3 in dis and a smoothing spline of degree 2 in medv.

Fit the first GAM model

gam1_fit <- gam(nox ~ s(dis, 3) + s(medv, 2), data = Boston)

Display summary of the first model

summary(gam1_fit)

## 
## Call: gam(formula = nox ~ s(dis, 3) + s(medv, 2), data = Boston)
## Deviance Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.120427 -0.038086 -0.007356  0.026321  0.216850 
## 
## (Dispersion Parameter for gaussian family taken to be 0.0034)
## 
##     Null Deviance: 6.781 on 505 degrees of freedom
## Residual Deviance: 1.7112 on 500.0003 degrees of freedom
## AIC: -1428.839 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## s(dis, 3)    1 3.8465  3.8465 1123.908 < 2.2e-16 ***
## s(medv, 2)   1 0.2492  0.2492   72.802 < 2.2e-16 ***
## Residuals  500 1.7112  0.0034                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F     Pr(F)    
## (Intercept)                             
## s(dis, 3)         2 85.724 < 2.2e-16 ***
## s(medv, 2)        1  8.572  0.003569 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Generalized Additive Model GAM model effectively predicts nitrogen oxide nox concentrations in Boston using distance to employment centers dis and median home values medv. Both predictors show significant, non-linear relationships with nox, with dis having a stronger influence than medv. The GAM’s fit is satisfactory, demonstrating the necessity of non-linear smooth functions over linear models for accurately representing these relationships.

Section B:

Predict nox using a smoothing spline of degree 2 in dis and a smoothing spline of degree 1 in medv.

Fit the second GAM model

gam2_fit <- gam(nox ~ s(dis, 2) + s(medv, 1), data = Boston)

Display summary of the second model

summary(gam2_fit)

## 
## Call: gam(formula = nox ~ s(dis, 2) + s(medv, 1), data = Boston)
## Deviance Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.118926 -0.039568 -0.008386  0.026542  0.219987 
## 
## (Dispersion Parameter for gaussian family taken to be 0.0037)
## 
##     Null Deviance: 6.781 on 505 degrees of freedom
## Residual Deviance: 1.8363 on 501.9999 degrees of freedom
## AIC: -1397.131 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## s(dis, 2)    1 4.0124  4.0124 1096.874 < 2.2e-16 ***
## s(medv, 1)   1 0.2820  0.2820   77.091 < 2.2e-16 ***
## Residuals  502 1.8363  0.0037                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df  Npar F     Pr(F)    
## (Intercept)                              
## s(dis, 2)         1 145.576 < 2.2e-16 ***
## s(medv, 1)        0  39.036 2.444e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This GAM model effectively predicts nitric oxide concentration nox in the Boston housing dataset using non-linear relationships with weighted distances to employment centers dis and median home values medv. Both predictors show significant, complex, curved effects on nox, and the model significantly reduces deviance compared to a null model, indicating a good fit. Residual analysis supports the model’s robustness and predictive power.

Section C:

Perform anova analysis. Recommend the best model and justify your answer.

Conduct ANOVA to compare model fit using Chi-Square test

anova(gam2_fit, gam1_fit, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: nox ~ s(dis, 2) + s(medv, 1)
## Model 2: nox ~ s(dis, 3) + s(medv, 2)
##   Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)    
## 1       502     1.8363                              
## 2       500     1.7112 1.9996  0.12512 1.151e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA analysis compares two models, Model 1 and Model 2, to determine which provides a better fit for the data. Model 2, with slightly fewer residual degrees of freedom (500 vs. 502), has a lower residual deviance (1.7112) compared to Model 1 (1.8363), indicating a better fit. The additional complexity in Model 2, such as more smoothing terms for the variables dis and medv, contributes to this improvement. The Chi-Square test shows a significant p-value (1.151e-08), confirming that Model 2’s added complexity leads to a significantly better fit.

In conclusion, Model 2 is the preferred model, as it significantly improves the fit over Model 1, making it the more suitable choice for the data.