Statistical Learning: Week-6

Polynomial regression and the use of \(\texttt{poly()}\) function.

set.seed(123)
n = 10
xr = seq(0, n, by = 0.1)  # Generate a sequence of numerical values, from 0 to n=10, by an increment of 0.1
# Generate a random data using sin function plus some random errors
yr = sin(xr/2) + rnorm(length(xr))/2
# Combine x and y into a dataframe (df)
df = data.frame(x = xr, y = yr)
# Plot the data
plot(df)

# Regress y on x
lm.fit = lm(y ~ x, data = df)
abline(lm.fit, col = "red")

# If the degree of the polynomial function is large enough, it should approximate most non-linear distribution patterns, but warnings ahead! 
# Run a 5-degree polynomial regression (note how polynomial function of x (to the 5th degree) is specified)
# Note the argument inside the poly() function, the first argument is the variable you want to transform with the polynomial function, the second is the number of knots
poly.5 = lm(y ~ poly(x, 5), data = df)  # 5-degree polynomial (5 knots)
summary(poly.5)

## 
## Call:
## lm(formula = y ~ poly(x, 5), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15647 -0.31216 -0.02439  0.31324  1.03605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17831    0.04594   3.881 0.000192 ***
## poly(x, 5)1 -5.48263    0.46171 -11.875  < 2e-16 ***
## poly(x, 5)2 -3.17422    0.46171  -6.875 6.49e-10 ***
## poly(x, 5)3  2.17617    0.46171   4.713 8.34e-06 ***
## poly(x, 5)4  0.72057    0.46171   1.561 0.121932    
## poly(x, 5)5 -0.79754    0.46171  -1.727 0.087353 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4617 on 95 degrees of freedom
## Multiple R-squared:  0.6944, Adjusted R-squared:  0.6784 
## F-statistic: 43.18 on 5 and 95 DF,  p-value: < 2.2e-16

plot(df)
lines(xr, predict(poly.5), col = "red")  # Output a line with xr on the horizontal axis and its 5-degree polynomial predicted values on the vertical axis. You also need to provide the base plot (this is done by plot(df)).

# Note in particular the "raw" argument inside poly(), when we set raw to FALSE (the default), poly() will compute orthogonalized polynomials, so that each polynomial term of x is orthogonal (un-related) to the previous one. The advantage of this setting is that one can see whether a certain order in the polynomial significantly improves the regression over the lower orders (by looking at the size of the coefficient); also, since the polynomial terms are orthogonal, they are easier to become statistically significant (if you're chasing p-values...). So you can use this setting as a quick visual check to see at what degree polynomial does a variable stop being significant, because the orthogonally coded (raw = FALSE) polynomial terms capture the quadratic part that has NOT been explained by the original term x. This should become clear when only the linear (1st order) term x is significant while its polynomial terms are insignificant, meaning that the relationship between x and y is largely linear as higher-order terms contribute little variance; it's the other way around when x only becomes significant beyond the second order.

#Let's do the same model again but this time set raw to TRUE to see the difference:

poly.5T = lm(y ~ poly(x, 5, raw=TRUE), data = df)
summary(poly.5T) # The magnitudes of the new coef. estimates are now much smaller than when we set raw = FALSE, because higher-order terms of x are correlated.

## 
## Call:
## lm(formula = y ~ poly(x, 5, raw = TRUE), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15647 -0.31216 -0.02439  0.31324  1.03605 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)              0.3101481  0.2534042   1.224   0.2240  
## poly(x, 5, raw = TRUE)1 -0.2465389  0.5209432  -0.473   0.6371  
## poly(x, 5, raw = TRUE)2  0.4994234  0.3268684   1.528   0.1299  
## poly(x, 5, raw = TRUE)3 -0.1581306  0.0833390  -1.897   0.0608 .
## poly(x, 5, raw = TRUE)4  0.0172687  0.0092052   1.876   0.0637 .
## poly(x, 5, raw = TRUE)5 -0.0006328  0.0003663  -1.727   0.0874 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4617 on 95 degrees of freedom
## Multiple R-squared:  0.6944, Adjusted R-squared:  0.6784 
## F-statistic: 43.18 on 5 and 95 DF,  p-value: < 2.2e-16

# We see that when raw is set to TRUE, the p-values of higher order terms shrink dramatically. However, the 3rd, the 4th, and the 5th order terms of x remain significant, meaning that the relationship between x and y is certainly beyond linear (captured by the 1st order (original) term) even if higher-order terms are allowed to be correlated with the base variable (1st order term).

# One takeaway from the raw argument is that when set raw = TRUE you only interpret the p-value of the base variable x since other polynomial terms are highly correlated with the original term.

# Note that we can also specify a polynomial regression using indicator function I(), the result is the same as poly() when its "raw" argument is set to TRUE (i.e., when the base variable and its higher-order terms are allowed to be correlated)
# The I() function works like you're opening another function inside the lm() function
poly.5I = lm(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5), data = df)
summary(poly.5I)

## 
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.15647 -0.31216 -0.02439  0.31324  1.03605 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.3101481  0.2534042   1.224   0.2240  
## x           -0.2465389  0.5209432  -0.473   0.6371  
## I(x^2)       0.4994234  0.3268684   1.528   0.1299  
## I(x^3)      -0.1581306  0.0833390  -1.897   0.0608 .
## I(x^4)       0.0172687  0.0092052   1.876   0.0637 .
## I(x^5)      -0.0006328  0.0003663  -1.727   0.0874 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4617 on 95 degrees of freedom
## Multiple R-squared:  0.6944, Adjusted R-squared:  0.6784 
## F-statistic: 43.18 on 5 and 95 DF,  p-value: < 2.2e-16

# "Overfitting" the non-linearity may not be a good thing as it creates too many inflection points and substantive interpretation will be difficult
# Run a 25-degree polynomial regression
poly.25 = lm(y ~ poly(x, 25), data = df)
summary(poly.25)

## 
## Call:
## lm(formula = y ~ poly(x, 25), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22857 -0.21895  0.01087  0.20265  1.11554 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.17831    0.04534   3.933 0.000186 ***
## poly(x, 25)1  -5.48263    0.45569 -12.032  < 2e-16 ***
## poly(x, 25)2  -3.17422    0.45569  -6.966 1.08e-09 ***
## poly(x, 25)3   2.17617    0.45569   4.776 8.67e-06 ***
## poly(x, 25)4   0.72057    0.45569   1.581 0.118019    
## poly(x, 25)5  -0.79754    0.45569  -1.750 0.084172 .  
## poly(x, 25)6  -0.95432    0.45569  -2.094 0.039616 *  
## poly(x, 25)7   0.01844    0.45569   0.040 0.967821    
## poly(x, 25)8  -0.62784    0.45569  -1.378 0.172365    
## poly(x, 25)9  -0.06985    0.45569  -0.153 0.878583    
## poly(x, 25)10  0.02506    0.45569   0.055 0.956296    
## poly(x, 25)11 -0.19167    0.45569  -0.421 0.675240    
## poly(x, 25)12 -0.59639    0.45569  -1.309 0.194608    
## poly(x, 25)13  0.61533    0.45569   1.350 0.180965    
## poly(x, 25)14 -0.64837    0.45569  -1.423 0.158925    
## poly(x, 25)15 -0.01123    0.45569  -0.025 0.980403    
## poly(x, 25)16  1.10816    0.45569   2.432 0.017408 *  
## poly(x, 25)17  0.14043    0.45569   0.308 0.758813    
## poly(x, 25)18  0.29925    0.45569   0.657 0.513386    
## poly(x, 25)19  0.24567    0.45569   0.539 0.591405    
## poly(x, 25)20  0.38016    0.45569   0.834 0.406778    
## poly(x, 25)21  0.37230    0.45569   0.817 0.416507    
## poly(x, 25)22  0.21186    0.45569   0.465 0.643330    
## poly(x, 25)23 -0.03766    0.45569  -0.083 0.934347    
## poly(x, 25)24 -0.63116    0.45569  -1.385 0.170136    
## poly(x, 25)25 -0.22477    0.45569  -0.493 0.623275    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4557 on 75 degrees of freedom
## Multiple R-squared:  0.765,  Adjusted R-squared:  0.6867 
## F-statistic: 9.767 on 25 and 75 DF,  p-value: 6.881e-15

plot(df)
lines(xr, predict(poly.25), col = "red")

# Is the polynomial regression robust to small perturbation in x values?
plot(df)
lines(xr, predict(poly.25), col = "red", lty = 1)     # Set color to "red", line type (lty) to solid line lty = 1
yrm = yr  # Create a new dataframe to store the new value by replicating existing dataframe yr
yrm[31] = yr[31] - 2    # Subtract the value of the 31st observation by 2
new_poly.25 = lm(yrm ~ poly(xr, 25), data = df)  # Regress the new y data on the existing x variable
lines(xr, predict(new_poly.25), col = "blue", lty = 2)

# Set color to "blue", line type to dashed line lty = 2
# Add a legend at the coordinate (5.5, 2.0)
legend(5.5, 2.0, legend=c("25-degree Poly", "New 25-degree Poly"), col=c("red", "blue"), lty=1:2, cex=0.8)

Fit different versions of polynomial regressions on Boston Housing data

Polynomial regression, train-test split, model-fitting, and plotting.

# load required packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)  # For train-test split

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(MASS)   # Where the Boston data is stored

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

theme_set(theme_classic())  # Set ggplot2 plotting theme

data("Boston")   # Alternatively, you can use "data("Boston", package = "MASS")"

# Train-test split on the dependent variable "medv" of the data
set.seed(123)
training.samples <- Boston$medv %>%
  createDataPartition(p = 0.8, list = FALSE)   # 80-20 train-test split
training  <- Boston[training.samples, ]
testing <- Boston[-training.samples, ]

ggplot(training, aes(lstat, medv) ) +   # lstat on the x-axis, medv on the y-axis
  geom_point(alpha = 0.2) +         # Using alpha to adjust transparency level
  labs(x = "% of lower status population", y = "Median value of owner-occupied homes", title = "Scatter plot of the medv vs. lstat") +
  stat_smooth(col = "red")  # Smoothing function

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Fit a linear model between medv and lstat
lm.fit <- lm(medv ~ lstat, data = training)

# Make predictions
lm.pred <- lm.fit %>% predict(testing)    # Pipeline function would require the dplyr package. lm.fit will enter as the first argument in the predict() function

# Or you can use
# lm.pred <- predict(lm.fit, testing)

# Model performance: assessing the values predicted from the testing data using the model trained on the training data against the actual values of the testing data
data.frame(
  RMSE = RMSE(lm.pred, testing$medv),  # Calculate the the RMSE between the predicted versus actual data
  R2 = R2(lm.pred, testing$medv)  # Compute R^2 of the model on the testing data
)

##       RMSE       R2
## 1 6.503817 0.513163

# Plot the predicted regression line on the bivariate plot
ggplot(training, aes(lstat, medv) ) +
  geom_point(alpha = 0.2) +         # Using alpha to adjust transparency level
  labs(x = "% of lower status population (medv)", y = "Median value of owner-occupied homes (lstat)", title = "Predicted medv as a function of lstat") +
  stat_smooth(method = lm, formula = y ~ x, col = "red")  # Enter your model specification in the formula argument

## Polynomial regression of degree 2, there are two ways to do this, both are easy and straightforward
poly.2 <- lm(medv ~ poly(lstat, 2, raw = TRUE), data = training)
# poly.2 <- lm(medv ~ lstat + I(lstat^2), data = training)
poly.2 %>% summary()

## 
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3654  -3.8250  -0.6439   2.2733  25.2922 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 42.573638   0.964887   44.12   <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.267309   0.135846  -16.69   <2e-16 ***
## poly(lstat, 2, raw = TRUE)2  0.041198   0.004095   10.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.501 on 404 degrees of freedom
## Multiple R-squared:  0.6418, Adjusted R-squared:   0.64 
## F-statistic: 361.9 on 2 and 404 DF,  p-value: < 2.2e-16

## Polynomial regression of degree 10
lm(medv ~ poly(lstat, 10, raw = TRUE), data = training) %>%
summary()

## 
## Call:
## lm(formula = medv ~ poly(lstat, 10, raw = TRUE), data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8345  -3.0331  -0.7352   2.2464  26.6078 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)                    3.961e+01  3.343e+01   1.185    0.237
## poly(lstat, 10, raw = TRUE)1   1.753e+01  3.309e+01   0.530    0.596
## poly(lstat, 10, raw = TRUE)2  -1.020e+01  1.330e+01  -0.767    0.444
## poly(lstat, 10, raw = TRUE)3   2.184e+00  2.876e+00   0.759    0.448
## poly(lstat, 10, raw = TRUE)4  -2.534e-01  3.735e-01  -0.678    0.498
## poly(lstat, 10, raw = TRUE)5   1.784e-02  3.076e-02   0.580    0.562
## poly(lstat, 10, raw = TRUE)6  -7.993e-04  1.641e-03  -0.487    0.626
## poly(lstat, 10, raw = TRUE)7   2.297e-05  5.646e-05   0.407    0.684
## poly(lstat, 10, raw = TRUE)8  -4.116e-07  1.208e-06  -0.341    0.733
## poly(lstat, 10, raw = TRUE)9   4.203e-09  1.459e-08   0.288    0.773
## poly(lstat, 10, raw = TRUE)10 -1.875e-11  7.597e-11  -0.247    0.805
## 
## Residual standard error: 5.192 on 396 degrees of freedom
## Multiple R-squared:  0.6872, Adjusted R-squared:  0.6793 
## F-statistic: 86.98 on 10 and 396 DF,  p-value: < 2.2e-16

## Polynomial regression of degree 5
lm(medv ~ poly(lstat, 5, raw = TRUE), data = training) %>%
summary()

## 
## Call:
## lm(formula = medv ~ poly(lstat, 5, raw = TRUE), data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1519  -3.1235  -0.5927   2.0962  27.1286 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.765e+01  4.273e+00  15.831  < 2e-16 ***
## poly(lstat, 5, raw = TRUE)1 -1.177e+01  1.786e+00  -6.588 1.40e-10 ***
## poly(lstat, 5, raw = TRUE)2  1.220e+00  2.580e-01   4.727 3.16e-06 ***
## poly(lstat, 5, raw = TRUE)3 -6.385e-02  1.644e-02  -3.884 0.000120 ***
## poly(lstat, 5, raw = TRUE)4  1.577e-03  4.714e-04   3.345 0.000901 ***
## poly(lstat, 5, raw = TRUE)5 -1.459e-05  4.954e-06  -2.945 0.003421 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.205 on 401 degrees of freedom
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6777 
## F-statistic: 171.7 on 5 and 401 DF,  p-value: < 2.2e-16

# Plot the predicted regression line on the bivariate plot
ggplot(training, aes(lstat, medv) ) +
  geom_point(alpha = 0.2) +
  labs(x = "% of lower status population (medv)", y = "Median value of owner-occupied homes (lstat)", title = "Predicted medv as a function of lstat") +
  stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE), col = "red")

# log transformation of X: just apply log() function over x inside the lm() function

lm.trans <- lm(medv ~ log(lstat), data = training)

# Make predictions
trans.pred <- lm.trans %>% predict(testing)

# Model performance
data.frame(
  RMSE = RMSE(trans.pred, testing$medv),
  R2 = R2(trans.pred, testing$medv)
)

##       RMSE        R2
## 1 5.467124 0.6570091

# Plot the predicted regression line on the bivariate plot
ggplot(training, aes(lstat, medv) ) +
  geom_point(alpha = 0.5) +   # set color transparency level
  labs(x = "% of lower status population (medv)", y = "Median value of owner-occupied homes (lstat)", title = "Predicted medv as a function of log(lstat)") +
  stat_smooth(method = lm, formula = y ~ log(x), col = "red")

# Compare the performance of polynomial regression of degree 5 against log-transformed function
poly.5 <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = training)

poly5.pred <- poly.5 %>% predict(testing)

data.frame(
  RMSE = RMSE(poly5.pred, testing$medv),
  R2 = R2(poly5.pred, testing$medv)
)

##       RMSE        R2
## 1 5.270374 0.6829474

# Now use what you just learned to compare model performance
# Step 1: combine relevant statistics into a dataframe for the ease of manipulation
comparison <- data.frame(
  model = c("log-transformed", "Poly degree 5"),
  RMSE = c(RMSE(trans.pred, testing$medv), RMSE(poly5.pred, testing$medv)),
  R2 = c(R2(trans.pred, testing$medv), R2(poly5.pred, testing$medv))
)
# Step 2: print it into a grid table
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(grid)
grid.table(comparison)

## There is an easier way to do this, simply require the ggigraph and ggiraphExtra packages
##library(ggiraph)
##library(ggiraphExtra)
##library(dplyr)
# ggPredict(lm.trans, se=TRUE, interactive=TRUE)

Other fixes: removing outliers or high-leverage data points

If it is just a few data points that are driving our estimate, meaning that the true relationship between \(X\) and \(Y\) is not seriously offended by non-linearity, then you might consider the following fixes.

## install and load required packages
# install.packages("olsrr")
library(olsrr)

## Warning: package 'olsrr' was built under R version 4.4.3

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

library(MASS)
data(Boston)
lm.fit <- lm(medv ~ lstat, data = Boston)

# Cook's distance: a measure of both residual (Y - Xb) and leverage (i-th obs's effect on Yhat) 
ols_plot_cooksd_bar(lm.fit)

# Use Cook's distance chart to detect observations that strongly influence fitted values of the model
ols_plot_cooksd_chart(lm.fit)

# DFBeta measures the difference in each parameter estimate with and without the influential data point(s)
ols_plot_dffits(lm.fit)

# Influence plot from car package
library(car) # add method = "identify" to let you pinpoint suspicious data points

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

influencePlot(lm.fit, id = list(method = "identify"))

Spline function

Set up a spline function, compare model fit at different knot lengths.

## load required package splines
# install.packages("splines")
library(splines)

attach(df)  # Attach the data to your current R session

# Set up knots boundary and generate a spline function of X. # Use "knots" argument to set the internal breakpoints (where the curvature changes) that define the spline
# "degree": degree of the piecewise polynomial (default is 3 (cubic polynomial))
B = bs(x, knots = c(3), Boundary.knots = c(0, 10), degree = 3)   # from 0 to 10 knots

spline.fit = lm(y ~ B)
plot(df) # We first generate a base plot and then overlay the splines of predicted values on the base plot
lines(x[x <= 3], predict(spline.fit)[x <= 3], col = "red")  # Split at the 3rd knot
lines(x[x >= 3], predict(spline.fit)[x >= 3], col = "blue")

# The prediction obtained with this spline can be compared to regressions on subsets of the data (the dotted lines, lty = 2)

smaller3.fit = lm(y ~ x, subset = x<=3) # on localized subset of data
larger3.fit = lm(y ~ x, subset = x>=3)
lines(x[x <= 3], predict(smaller3.fit)[x <= 3], col = "red", lty = 2)  # lty = 2 means dashed line
lines(x[x >= 3], predict(larger3.fit)[x >= 3], col = "blue", lty = 2)

# An easier way is to simply do the following:
# Specify a bs() spline function inside a lm() function: bs() function generates a matrix to represent the family of piecewise polynomials with the specified interior knots and degree, evaluated at x

spline3.fit = lm(y ~ bs(x, knots = c(3), Boundary.knots = c(0,10), degree=3))

# We can set up more knots and include confidence intervals around the spline
B = bs(x, knots = 1:9, Boundary.knots = c(0, 10), degree = 3)  # Here we first generate the spline object and then wrap it inside the lm() function (Note: this may not be permissible by other R functions)
plot(df)
spline9.fit = lm(y ~ B)
lines(x, predict(spline9.fit), col = "red")

spline9.pred = predict(spline9.fit, interval = "confidence") # Add confidence interval
# The predicted raster object "spline9.pred" contains three column, fitted value (yhat), lower CI, and upper CI, enter spline9.pred[1, 1:3] to see for yourself
spline9.pred[1, 1:3]

##         fit         lwr         upr 
##  0.01830168 -0.71480678  0.75141015

# We can then plot the data and wrap the distribution of x with predicted values plus 95% CIs (upper and lower bounds)
plot(df, col = "white")
polygon(c(x, rev(x)), c(spline9.pred[, 2], rev(spline9.pred[, 3])), col = "grey", border = NA)  # Specify confidence intervals, rev() means the reverse of something
points(df)  # Adding data points to the base plot
lines(x, spline9.pred[, 1], col = "red") # spline9.pred[,1] are predicted values
abline(v = c(0, 1, 2, 3, 5, 8, 10), lty = 2, col = "grey") # "v" indicates the positions to place vertical lines

detach(df)  # Remember to detach the dataframe after you finish this analysis

Smoothers

Spline, LOWESS, kernel, GAM.

## Spline and smoothing spline

library(ISLR)
data(Wage)
library(splines)
attach(Wage)
# Segment age in more fine-tuned way
agelims = range(age)
# The range() method generates an object that contains the minimum [1] and the maximum [2] values of a variable
age.grid = seq(from = agelims[1], to = agelims[2])

spline.fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)

# Plotting
plot(Wage$age, Wage$wage, col = "grey")
lines(age.grid, predict(spline.fit, list(age = age.grid)), col = "red", lwd = 2)
abline(v = c(25, 40, 60), lty = 2, col = "lightblue")

# Alternatively, you can specify the spline using smooth.spline(), which does not require knot selection, because it has a smoothing parameter specified via the "df" option
# df means the desired equivalent number of degrees of freedom (the number of points at which the curvature is allowed to change)
smooth.spline.fit = smooth.spline(age, wage, df = 16)  # 16 knots
lines(smooth.spline.fit, col = "green", lwd = 2)

# Aren't they close?


## LOWESS smooth with different spans (alpha value, though the function uses "f" to represent alpha)

lowess10 <- lowess(age, wage, f = 0.10)  # 10% smoothing span
lowess25 <- lowess(age, wage, f = 0.25)  # 25% smoothing span
lowess50 <- lowess(age, wage, f = 0.50)  # 50% smoothing span

# Plotting
plot(age, wage, main="Lowess Smoothing with different spans", xlab="Age", ylab="Wage", cex = 0.3) # Generate base plot and then overlay LOWESS lines on top
lines(lowess10, col="red")
lines(lowess25, col="green")
lines(lowess50, col="blue")

## Kernel smoothers: wider bandwidths (h) have flatter distribution curves and thus cover more data points

kernel01 <- ksmooth(age, wage, kernel = "normal", bandwidth = 0.1,
        range.x = range(age))
kernel2 <- ksmooth(age, wage, kernel = "normal", bandwidth = 2,
        range.x = range(age))
kernel10 <- ksmooth(age, wage, kernel = "normal", bandwidth = 10,
        range.x = range(age))

# Plotting
# x11(width = 4, height = 6) # Set up plotting environment if needed
plot(age, wage, main="Kernel Smoothing with different bandwidths", xlab="Age", ylab="Wage", cex = 0.3)
lines(kernel01, col="red")
lines(kernel2, col="green")
lines(kernel10, col="blue")

detach(Wage)

Generalized Additive Model (GAM)

A generalized linear model that relates linear response variable (\(Y\)) to unknown smoothing functions of the predictor variables.

## GAM function
# Require necessary package: mgcv
# install.packages("mgcv")
library(mgcv)

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

# Fit a linear regression model using wage as the dependent variable, smoothed "age" as the explanatory variable

gam.fit <- gam(wage ~ s(age), data = Wage)
summary(gam.fit)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## wage ~ s(age)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 111.7036     0.7282   153.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(age) 5.298  6.399 44.34  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0864   Deviance explained =  8.8%
## GCV = 1594.2  Scale est. = 1590.9    n = 3000

# Plotting
plot(gam.fit, residuals = TRUE, main = "Gam function of Age to Wage")

# Improving aesthetics...

maxage <- max(Wage$age) # Get the maximum value of age
minage <- min(Wage$age) # Get the minimum value of age
age.seq <- seq(minage, maxage, length=300)  # Bin the data into finer subsets
age.seq <- data.frame(age = age.seq)  # Convert to dataframe

gam.pred <- predict(gam.fit, newdata = age.seq, se.fit = TRUE)  # Use the same model to predict on the transformed data

# Generate predicted values, lower bound, and upper bound
Age <- age.seq$age  # Actual values
fit <- gam.pred$fit # Predicted values
fit.upper95 <- fit - 1.96 * gam.pred$se.fit
fit.lower95 <- fit + 1.96 * gam.pred$se.fit

plot(Age, fit, type = "l", lwd=3, xlim=c(18, 80), ylim=c(60, 120), # Set the range of values for x-and y-axis 
     main="GAM function of Age to Wage",  # main title
     ylab="Predicted Wage")    # y-axis title

polygon(c(Age, rev(Age)), 
        c(fit.lower95, rev(fit.upper95)), col = "grey",
        border=NA)

lines(Age, fit, lwd=2, col = "red")

pred.orig <- predict(gam.fit)
partial.resids <- pred.orig + residuals(gam.fit)
# Mapping colors to predicted values and residuals (upper and lower)
points(Wage$age, partial.resids, pch = 16, col = rgb(0, 0, 1, 0.25))

## Using the gam function for logit estimation, note how the binary dependent variable is generated using indicator function: I()

logit.gam <- gam(I(wage > 250) ~ s(age), data = Wage, family = binomial)
summary(logit.gam)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## I(wage > 250) ~ s(age)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.8932     0.1704  -22.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df Chi.sq p-value  
## s(age) 4.319  5.301  15.36  0.0124 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.00608   Deviance explained = 4.06%
## UBRE = -0.76284  Scale est. = 1         n = 3000

# You can also include s.e. estimate in the plot:
# plot(logit.gam, se = TRUE)   # Not shown here