Polynomial regression and the use of \(\texttt{poly()}\) function.
set.seed(123)
n = 10
xr = seq(0, n, by = 0.1) # Generate a sequence of numerical values, from 0 to n=10, by an increment of 0.1
# Generate a random data using sin function plus some random errors
yr = sin(xr/2) + rnorm(length(xr))/2
# Combine x and y into a dataframe (df)
df = data.frame(x = xr, y = yr)
# Plot the data
plot(df)
# Regress y on x
lm.fit = lm(y ~ x, data = df)
abline(lm.fit, col = "red")
# If the degree of the polynomial function is large enough, it should approximate most non-linear distribution patterns, but warnings ahead!
# Run a 5-degree polynomial regression (note how polynomial function of x (to the 5th degree) is specified)
# Note the argument inside the poly() function, the first argument is the variable you want to transform with the polynomial function, the second is the number of knots
poly.5 = lm(y ~ poly(x, 5), data = df) # 5-degree polynomial (5 knots)
summary(poly.5)
##
## Call:
## lm(formula = y ~ poly(x, 5), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15647 -0.31216 -0.02439 0.31324 1.03605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17831 0.04594 3.881 0.000192 ***
## poly(x, 5)1 -5.48263 0.46171 -11.875 < 2e-16 ***
## poly(x, 5)2 -3.17422 0.46171 -6.875 6.49e-10 ***
## poly(x, 5)3 2.17617 0.46171 4.713 8.34e-06 ***
## poly(x, 5)4 0.72057 0.46171 1.561 0.121932
## poly(x, 5)5 -0.79754 0.46171 -1.727 0.087353 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4617 on 95 degrees of freedom
## Multiple R-squared: 0.6944, Adjusted R-squared: 0.6784
## F-statistic: 43.18 on 5 and 95 DF, p-value: < 2.2e-16
plot(df)
lines(xr, predict(poly.5), col = "red") # Output a line with xr on the horizontal axis and its 5-degree polynomial predicted values on the vertical axis. You also need to provide the base plot (this is done by plot(df)).
# Note in particular the "raw" argument inside poly(), when we set raw to FALSE (the default), poly() will compute orthogonalized polynomials, so that each polynomial term of x is orthogonal (un-related) to the previous one. The advantage of this setting is that one can see whether a certain order in the polynomial significantly improves the regression over the lower orders (by looking at the size of the coefficient); also, since the polynomial terms are orthogonal, they are easier to become statistically significant (if you're chasing p-values...). So you can use this setting as a quick visual check to see at what degree polynomial does a variable stop being significant, because the orthogonally coded (raw = FALSE) polynomial terms capture the quadratic part that has NOT been explained by the original term x. This should become clear when only the linear (1st order) term x is significant while its polynomial terms are insignificant, meaning that the relationship between x and y is largely linear as higher-order terms contribute little variance; it's the other way around when x only becomes significant beyond the second order.
#Let's do the same model again but this time set raw to TRUE to see the difference:
poly.5T = lm(y ~ poly(x, 5, raw=TRUE), data = df)
summary(poly.5T) # The magnitudes of the new coef. estimates are now much smaller than when we set raw = FALSE, because higher-order terms of x are correlated.
##
## Call:
## lm(formula = y ~ poly(x, 5, raw = TRUE), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15647 -0.31216 -0.02439 0.31324 1.03605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3101481 0.2534042 1.224 0.2240
## poly(x, 5, raw = TRUE)1 -0.2465389 0.5209432 -0.473 0.6371
## poly(x, 5, raw = TRUE)2 0.4994234 0.3268684 1.528 0.1299
## poly(x, 5, raw = TRUE)3 -0.1581306 0.0833390 -1.897 0.0608 .
## poly(x, 5, raw = TRUE)4 0.0172687 0.0092052 1.876 0.0637 .
## poly(x, 5, raw = TRUE)5 -0.0006328 0.0003663 -1.727 0.0874 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4617 on 95 degrees of freedom
## Multiple R-squared: 0.6944, Adjusted R-squared: 0.6784
## F-statistic: 43.18 on 5 and 95 DF, p-value: < 2.2e-16
# We see that when raw is set to TRUE, the p-values of higher order terms shrink dramatically. However, the 3rd, the 4th, and the 5th order terms of x remain significant, meaning that the relationship between x and y is certainly beyond linear (captured by the 1st order (original) term) even if higher-order terms are allowed to be correlated with the base variable (1st order term).
# One takeaway from the raw argument is that when set raw = TRUE you only interpret the p-value of the base variable x since other polynomial terms are highly correlated with the original term.
# Note that we can also specify a polynomial regression using indicator function I(), the result is the same as poly() when its "raw" argument is set to TRUE (i.e., when the base variable and its higher-order terms are allowed to be correlated)
# The I() function works like you're opening another function inside the lm() function
poly.5I = lm(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5), data = df)
summary(poly.5I)
##
## Call:
## lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.15647 -0.31216 -0.02439 0.31324 1.03605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3101481 0.2534042 1.224 0.2240
## x -0.2465389 0.5209432 -0.473 0.6371
## I(x^2) 0.4994234 0.3268684 1.528 0.1299
## I(x^3) -0.1581306 0.0833390 -1.897 0.0608 .
## I(x^4) 0.0172687 0.0092052 1.876 0.0637 .
## I(x^5) -0.0006328 0.0003663 -1.727 0.0874 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4617 on 95 degrees of freedom
## Multiple R-squared: 0.6944, Adjusted R-squared: 0.6784
## F-statistic: 43.18 on 5 and 95 DF, p-value: < 2.2e-16
# "Overfitting" the non-linearity may not be a good thing as it creates too many inflection points and substantive interpretation will be difficult
# Run a 25-degree polynomial regression
poly.25 = lm(y ~ poly(x, 25), data = df)
summary(poly.25)
##
## Call:
## lm(formula = y ~ poly(x, 25), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22857 -0.21895 0.01087 0.20265 1.11554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17831 0.04534 3.933 0.000186 ***
## poly(x, 25)1 -5.48263 0.45569 -12.032 < 2e-16 ***
## poly(x, 25)2 -3.17422 0.45569 -6.966 1.08e-09 ***
## poly(x, 25)3 2.17617 0.45569 4.776 8.67e-06 ***
## poly(x, 25)4 0.72057 0.45569 1.581 0.118019
## poly(x, 25)5 -0.79754 0.45569 -1.750 0.084172 .
## poly(x, 25)6 -0.95432 0.45569 -2.094 0.039616 *
## poly(x, 25)7 0.01844 0.45569 0.040 0.967821
## poly(x, 25)8 -0.62784 0.45569 -1.378 0.172365
## poly(x, 25)9 -0.06985 0.45569 -0.153 0.878583
## poly(x, 25)10 0.02506 0.45569 0.055 0.956296
## poly(x, 25)11 -0.19167 0.45569 -0.421 0.675240
## poly(x, 25)12 -0.59639 0.45569 -1.309 0.194608
## poly(x, 25)13 0.61533 0.45569 1.350 0.180965
## poly(x, 25)14 -0.64837 0.45569 -1.423 0.158925
## poly(x, 25)15 -0.01123 0.45569 -0.025 0.980403
## poly(x, 25)16 1.10816 0.45569 2.432 0.017408 *
## poly(x, 25)17 0.14043 0.45569 0.308 0.758813
## poly(x, 25)18 0.29925 0.45569 0.657 0.513386
## poly(x, 25)19 0.24567 0.45569 0.539 0.591405
## poly(x, 25)20 0.38016 0.45569 0.834 0.406778
## poly(x, 25)21 0.37230 0.45569 0.817 0.416507
## poly(x, 25)22 0.21186 0.45569 0.465 0.643330
## poly(x, 25)23 -0.03766 0.45569 -0.083 0.934347
## poly(x, 25)24 -0.63116 0.45569 -1.385 0.170136
## poly(x, 25)25 -0.22477 0.45569 -0.493 0.623275
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4557 on 75 degrees of freedom
## Multiple R-squared: 0.765, Adjusted R-squared: 0.6867
## F-statistic: 9.767 on 25 and 75 DF, p-value: 6.881e-15
plot(df)
lines(xr, predict(poly.25), col = "red")
# Is the polynomial regression robust to small perturbation in x values?
plot(df)
lines(xr, predict(poly.25), col = "red", lty = 1) # Set color to "red", line type (lty) to solid line lty = 1
yrm = yr # Create a new dataframe to store the new value by replicating existing dataframe yr
yrm[31] = yr[31] - 2 # Subtract the value of the 31st observation by 2
new_poly.25 = lm(yrm ~ poly(xr, 25), data = df) # Regress the new y data on the existing x variable
lines(xr, predict(new_poly.25), col = "blue", lty = 2)
# Set color to "blue", line type to dashed line lty = 2
# Add a legend at the coordinate (5.5, 2.0)
legend(5.5, 2.0, legend=c("25-degree Poly", "New 25-degree Poly"), col=c("red", "blue"), lty=1:2, cex=0.8)
Polynomial regression, train-test split, model-fitting, and plotting.
# load required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # For train-test split
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(MASS) # Where the Boston data is stored
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
theme_set(theme_classic()) # Set ggplot2 plotting theme
data("Boston") # Alternatively, you can use "data("Boston", package = "MASS")"
# Train-test split on the dependent variable "medv" of the data
set.seed(123)
training.samples <- Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE) # 80-20 train-test split
training <- Boston[training.samples, ]
testing <- Boston[-training.samples, ]
ggplot(training, aes(lstat, medv) ) + # lstat on the x-axis, medv on the y-axis
geom_point(alpha = 0.2) + # Using alpha to adjust transparency level
labs(x = "% of lower status population", y = "Median value of owner-occupied homes", title = "Scatter plot of the medv vs. lstat") +
stat_smooth(col = "red") # Smoothing function
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
# Fit a linear model between medv and lstat
lm.fit <- lm(medv ~ lstat, data = training)
# Make predictions
lm.pred <- lm.fit %>% predict(testing) # Pipeline function would require the dplyr package. lm.fit will enter as the first argument in the predict() function
# Or you can use
# lm.pred <- predict(lm.fit, testing)
# Model performance: assessing the values predicted from the testing data using the model trained on the training data against the actual values of the testing data
data.frame(
RMSE = RMSE(lm.pred, testing$medv), # Calculate the the RMSE between the predicted versus actual data
R2 = R2(lm.pred, testing$medv) # Compute R^2 of the model on the testing data
)
## RMSE R2
## 1 6.503817 0.513163
# Plot the predicted regression line on the bivariate plot
ggplot(training, aes(lstat, medv) ) +
geom_point(alpha = 0.2) + # Using alpha to adjust transparency level
labs(x = "% of lower status population (medv)", y = "Median value of owner-occupied homes (lstat)", title = "Predicted medv as a function of lstat") +
stat_smooth(method = lm, formula = y ~ x, col = "red") # Enter your model specification in the formula argument
## Polynomial regression of degree 2, there are two ways to do this, both are easy and straightforward
poly.2 <- lm(medv ~ poly(lstat, 2, raw = TRUE), data = training)
# poly.2 <- lm(medv ~ lstat + I(lstat^2), data = training)
poly.2 %>% summary()
##
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3654 -3.8250 -0.6439 2.2733 25.2922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.573638 0.964887 44.12 <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.267309 0.135846 -16.69 <2e-16 ***
## poly(lstat, 2, raw = TRUE)2 0.041198 0.004095 10.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.501 on 404 degrees of freedom
## Multiple R-squared: 0.6418, Adjusted R-squared: 0.64
## F-statistic: 361.9 on 2 and 404 DF, p-value: < 2.2e-16
## Polynomial regression of degree 10
lm(medv ~ poly(lstat, 10, raw = TRUE), data = training) %>%
summary()
##
## Call:
## lm(formula = medv ~ poly(lstat, 10, raw = TRUE), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8345 -3.0331 -0.7352 2.2464 26.6078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.961e+01 3.343e+01 1.185 0.237
## poly(lstat, 10, raw = TRUE)1 1.753e+01 3.309e+01 0.530 0.596
## poly(lstat, 10, raw = TRUE)2 -1.020e+01 1.330e+01 -0.767 0.444
## poly(lstat, 10, raw = TRUE)3 2.184e+00 2.876e+00 0.759 0.448
## poly(lstat, 10, raw = TRUE)4 -2.534e-01 3.735e-01 -0.678 0.498
## poly(lstat, 10, raw = TRUE)5 1.784e-02 3.076e-02 0.580 0.562
## poly(lstat, 10, raw = TRUE)6 -7.993e-04 1.641e-03 -0.487 0.626
## poly(lstat, 10, raw = TRUE)7 2.297e-05 5.646e-05 0.407 0.684
## poly(lstat, 10, raw = TRUE)8 -4.116e-07 1.208e-06 -0.341 0.733
## poly(lstat, 10, raw = TRUE)9 4.203e-09 1.459e-08 0.288 0.773
## poly(lstat, 10, raw = TRUE)10 -1.875e-11 7.597e-11 -0.247 0.805
##
## Residual standard error: 5.192 on 396 degrees of freedom
## Multiple R-squared: 0.6872, Adjusted R-squared: 0.6793
## F-statistic: 86.98 on 10 and 396 DF, p-value: < 2.2e-16
## Polynomial regression of degree 5
lm(medv ~ poly(lstat, 5, raw = TRUE), data = training) %>%
summary()
##
## Call:
## lm(formula = medv ~ poly(lstat, 5, raw = TRUE), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1519 -3.1235 -0.5927 2.0962 27.1286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.765e+01 4.273e+00 15.831 < 2e-16 ***
## poly(lstat, 5, raw = TRUE)1 -1.177e+01 1.786e+00 -6.588 1.40e-10 ***
## poly(lstat, 5, raw = TRUE)2 1.220e+00 2.580e-01 4.727 3.16e-06 ***
## poly(lstat, 5, raw = TRUE)3 -6.385e-02 1.644e-02 -3.884 0.000120 ***
## poly(lstat, 5, raw = TRUE)4 1.577e-03 4.714e-04 3.345 0.000901 ***
## poly(lstat, 5, raw = TRUE)5 -1.459e-05 4.954e-06 -2.945 0.003421 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.205 on 401 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6777
## F-statistic: 171.7 on 5 and 401 DF, p-value: < 2.2e-16
# Plot the predicted regression line on the bivariate plot
ggplot(training, aes(lstat, medv) ) +
geom_point(alpha = 0.2) +
labs(x = "% of lower status population (medv)", y = "Median value of owner-occupied homes (lstat)", title = "Predicted medv as a function of lstat") +
stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE), col = "red")
# log transformation of X: just apply log() function over x inside the lm() function
lm.trans <- lm(medv ~ log(lstat), data = training)
# Make predictions
trans.pred <- lm.trans %>% predict(testing)
# Model performance
data.frame(
RMSE = RMSE(trans.pred, testing$medv),
R2 = R2(trans.pred, testing$medv)
)
## RMSE R2
## 1 5.467124 0.6570091
# Plot the predicted regression line on the bivariate plot
ggplot(training, aes(lstat, medv) ) +
geom_point(alpha = 0.5) + # set color transparency level
labs(x = "% of lower status population (medv)", y = "Median value of owner-occupied homes (lstat)", title = "Predicted medv as a function of log(lstat)") +
stat_smooth(method = lm, formula = y ~ log(x), col = "red")
# Compare the performance of polynomial regression of degree 5 against log-transformed function
poly.5 <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = training)
poly5.pred <- poly.5 %>% predict(testing)
data.frame(
RMSE = RMSE(poly5.pred, testing$medv),
R2 = R2(poly5.pred, testing$medv)
)
## RMSE R2
## 1 5.270374 0.6829474
# Now use what you just learned to compare model performance
# Step 1: combine relevant statistics into a dataframe for the ease of manipulation
comparison <- data.frame(
model = c("log-transformed", "Poly degree 5"),
RMSE = c(RMSE(trans.pred, testing$medv), RMSE(poly5.pred, testing$medv)),
R2 = c(R2(trans.pred, testing$medv), R2(poly5.pred, testing$medv))
)
# Step 2: print it into a grid table
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(grid)
grid.table(comparison)
## There is an easier way to do this, simply require the ggigraph and ggiraphExtra packages
##library(ggiraph)
##library(ggiraphExtra)
##library(dplyr)
# ggPredict(lm.trans, se=TRUE, interactive=TRUE)
If it is just a few data points that are driving our estimate, meaning that the true relationship between \(X\) and \(Y\) is not seriously offended by non-linearity, then you might consider the following fixes.
## install and load required packages
# install.packages("olsrr")
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.4.3
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
library(MASS)
data(Boston)
lm.fit <- lm(medv ~ lstat, data = Boston)
# Cook's distance: a measure of both residual (Y - Xb) and leverage (i-th obs's effect on Yhat)
ols_plot_cooksd_bar(lm.fit)
# Use Cook's distance chart to detect observations that strongly influence fitted values of the model
ols_plot_cooksd_chart(lm.fit)
# DFBeta measures the difference in each parameter estimate with and without the influential data point(s)
ols_plot_dffits(lm.fit)
# Influence plot from car package
library(car) # add method = "identify" to let you pinpoint suspicious data points
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
influencePlot(lm.fit, id = list(method = "identify"))
Set up a spline function, compare model fit at different knot lengths.
## load required package splines
# install.packages("splines")
library(splines)
attach(df) # Attach the data to your current R session
# Set up knots boundary and generate a spline function of X. # Use "knots" argument to set the internal breakpoints (where the curvature changes) that define the spline
# "degree": degree of the piecewise polynomial (default is 3 (cubic polynomial))
B = bs(x, knots = c(3), Boundary.knots = c(0, 10), degree = 3) # from 0 to 10 knots
spline.fit = lm(y ~ B)
plot(df) # We first generate a base plot and then overlay the splines of predicted values on the base plot
lines(x[x <= 3], predict(spline.fit)[x <= 3], col = "red") # Split at the 3rd knot
lines(x[x >= 3], predict(spline.fit)[x >= 3], col = "blue")
# The prediction obtained with this spline can be compared to regressions on subsets of the data (the dotted lines, lty = 2)
smaller3.fit = lm(y ~ x, subset = x<=3) # on localized subset of data
larger3.fit = lm(y ~ x, subset = x>=3)
lines(x[x <= 3], predict(smaller3.fit)[x <= 3], col = "red", lty = 2) # lty = 2 means dashed line
lines(x[x >= 3], predict(larger3.fit)[x >= 3], col = "blue", lty = 2)
# An easier way is to simply do the following:
# Specify a bs() spline function inside a lm() function: bs() function generates a matrix to represent the family of piecewise polynomials with the specified interior knots and degree, evaluated at x
spline3.fit = lm(y ~ bs(x, knots = c(3), Boundary.knots = c(0,10), degree=3))
# We can set up more knots and include confidence intervals around the spline
B = bs(x, knots = 1:9, Boundary.knots = c(0, 10), degree = 3) # Here we first generate the spline object and then wrap it inside the lm() function (Note: this may not be permissible by other R functions)
plot(df)
spline9.fit = lm(y ~ B)
lines(x, predict(spline9.fit), col = "red")
spline9.pred = predict(spline9.fit, interval = "confidence") # Add confidence interval
# The predicted raster object "spline9.pred" contains three column, fitted value (yhat), lower CI, and upper CI, enter spline9.pred[1, 1:3] to see for yourself
spline9.pred[1, 1:3]
## fit lwr upr
## 0.01830168 -0.71480678 0.75141015
# We can then plot the data and wrap the distribution of x with predicted values plus 95% CIs (upper and lower bounds)
plot(df, col = "white")
polygon(c(x, rev(x)), c(spline9.pred[, 2], rev(spline9.pred[, 3])), col = "grey", border = NA) # Specify confidence intervals, rev() means the reverse of something
points(df) # Adding data points to the base plot
lines(x, spline9.pred[, 1], col = "red") # spline9.pred[,1] are predicted values
abline(v = c(0, 1, 2, 3, 5, 8, 10), lty = 2, col = "grey") # "v" indicates the positions to place vertical lines
detach(df) # Remember to detach the dataframe after you finish this analysis
Spline, LOWESS, kernel, GAM.
## Spline and smoothing spline
library(ISLR)
data(Wage)
library(splines)
attach(Wage)
# Segment age in more fine-tuned way
agelims = range(age)
# The range() method generates an object that contains the minimum [1] and the maximum [2] values of a variable
age.grid = seq(from = agelims[1], to = agelims[2])
spline.fit = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
# Plotting
plot(Wage$age, Wage$wage, col = "grey")
lines(age.grid, predict(spline.fit, list(age = age.grid)), col = "red", lwd = 2)
abline(v = c(25, 40, 60), lty = 2, col = "lightblue")
# Alternatively, you can specify the spline using smooth.spline(), which does not require knot selection, because it has a smoothing parameter specified via the "df" option
# df means the desired equivalent number of degrees of freedom (the number of points at which the curvature is allowed to change)
smooth.spline.fit = smooth.spline(age, wage, df = 16) # 16 knots
lines(smooth.spline.fit, col = "green", lwd = 2)
# Aren't they close?
## LOWESS smooth with different spans (alpha value, though the function uses "f" to represent alpha)
lowess10 <- lowess(age, wage, f = 0.10) # 10% smoothing span
lowess25 <- lowess(age, wage, f = 0.25) # 25% smoothing span
lowess50 <- lowess(age, wage, f = 0.50) # 50% smoothing span
# Plotting
plot(age, wage, main="Lowess Smoothing with different spans", xlab="Age", ylab="Wage", cex = 0.3) # Generate base plot and then overlay LOWESS lines on top
lines(lowess10, col="red")
lines(lowess25, col="green")
lines(lowess50, col="blue")
## Kernel smoothers: wider bandwidths (h) have flatter distribution curves and thus cover more data points
kernel01 <- ksmooth(age, wage, kernel = "normal", bandwidth = 0.1,
range.x = range(age))
kernel2 <- ksmooth(age, wage, kernel = "normal", bandwidth = 2,
range.x = range(age))
kernel10 <- ksmooth(age, wage, kernel = "normal", bandwidth = 10,
range.x = range(age))
# Plotting
# x11(width = 4, height = 6) # Set up plotting environment if needed
plot(age, wage, main="Kernel Smoothing with different bandwidths", xlab="Age", ylab="Wage", cex = 0.3)
lines(kernel01, col="red")
lines(kernel2, col="green")
lines(kernel10, col="blue")
detach(Wage)
A generalized linear model that relates linear response variable (\(Y\)) to unknown smoothing functions of the predictor variables.
## GAM function
# Require necessary package: mgcv
# install.packages("mgcv")
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# Fit a linear regression model using wage as the dependent variable, smoothed "age" as the explanatory variable
gam.fit <- gam(wage ~ s(age), data = Wage)
summary(gam.fit)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(age)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7282 153.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 5.298 6.399 44.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0864 Deviance explained = 8.8%
## GCV = 1594.2 Scale est. = 1590.9 n = 3000
# Plotting
plot(gam.fit, residuals = TRUE, main = "Gam function of Age to Wage")
# Improving aesthetics...
maxage <- max(Wage$age) # Get the maximum value of age
minage <- min(Wage$age) # Get the minimum value of age
age.seq <- seq(minage, maxage, length=300) # Bin the data into finer subsets
age.seq <- data.frame(age = age.seq) # Convert to dataframe
gam.pred <- predict(gam.fit, newdata = age.seq, se.fit = TRUE) # Use the same model to predict on the transformed data
# Generate predicted values, lower bound, and upper bound
Age <- age.seq$age # Actual values
fit <- gam.pred$fit # Predicted values
fit.upper95 <- fit - 1.96 * gam.pred$se.fit
fit.lower95 <- fit + 1.96 * gam.pred$se.fit
plot(Age, fit, type = "l", lwd=3, xlim=c(18, 80), ylim=c(60, 120), # Set the range of values for x-and y-axis
main="GAM function of Age to Wage", # main title
ylab="Predicted Wage") # y-axis title
polygon(c(Age, rev(Age)),
c(fit.lower95, rev(fit.upper95)), col = "grey",
border=NA)
lines(Age, fit, lwd=2, col = "red")
pred.orig <- predict(gam.fit)
partial.resids <- pred.orig + residuals(gam.fit)
# Mapping colors to predicted values and residuals (upper and lower)
points(Wage$age, partial.resids, pch = 16, col = rgb(0, 0, 1, 0.25))
## Using the gam function for logit estimation, note how the binary dependent variable is generated using indicator function: I()
logit.gam <- gam(I(wage > 250) ~ s(age), data = Wage, family = binomial)
summary(logit.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## I(wage > 250) ~ s(age)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8932 0.1704 -22.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(age) 4.319 5.301 15.36 0.0124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.00608 Deviance explained = 4.06%
## UBRE = -0.76284 Scale est. = 1 n = 3000
# You can also include s.e. estimate in the plot:
# plot(logit.gam, se = TRUE) # Not shown here