ACTL Contet
Polynomial Regression
- Polynomial is a global smoothing method, where the coefficients can easily be modelled by least squares estimation
- Polynomial regression is a special case of linear regression
- Global method therefore can be too flexible and unstable, especially near the boundaries.
- Often don’t use d>4
\[ y_i = \beta_0+\beta_1x_i+\beta_2x_i^2+...+\beta_px_i^p+\epsilon_i\]
Step Functions
- Step functions have a local structure and partition \(X\) into bins and fit different constants to each bin.
- Construct \(K+1\) new variables for \(K\) cutpoints in the range of \(X\)
\[ C_0(x) = \mathbb{I}(X<c_1),\quad C_1(x) = \mathbb{I}(c_1\leq X < c_2),\quad ...,\quad C_K(x)=\mathbb{I}(c_k\leq X) \] * Only 1 \(C_k(x)\) can be non-zero for each \(X\) + Act like dummy variables
\[ y_i = \beta_0 + \beta_1C_1(x_i) + ... + \beta_KC_K(x_i) + \epsilon_i\]
- \(B_j\) represents the relative average increase in reponse compared to \(X<c_1 (B_0)\)
Scatterplot Smoothing
- Scatterplot smoothing is a non-parametric method with a single predictor.
- The scatterplot points are treated only as points on a plane. without regard to an underlying probabalistic model
- Assumes \(x_i's\) are distinct and ordered
\[ y_i = f(x_i) + \epsilon_i\]
Running Mean
- Running mean smoother considers a symmetric neighbourhood of \(2k\) neighbours (\(k\) neighbours on either side).
- If there is not \(k\) observations on either side, use as many as possible.
- k is the smoothing parameter that needs to be tuned
- The higher the k, the lower the flexibility. Low variance but high bias
- The lower the k, the higher the flexibility. High variance but low bias
- Notation:
\[ \{x_j:j\in N^S_k(x_i)\},\quad\text{where }n_{ik}\text{ is the length of } N^S_k(x_i)\]
- Therefore our predictive smoother is:
\[ \hat{f}(x_i)=\frac{\sum_{j\in N^S_k(x_i)}y_j}{n_{ik}}\]
- Running mean encounters two common problems
- ‘Boundary Bias’ near end points as neighbourhoods may be very asymmetric in these regions
- Not very smooth as all observations in neighbourhoods have equal weight while those outside have 0. Point weights change heavily as you move along and poitns enter/exit neighbourhood.
Running Line
- Running line smoother fits a linear regression model locally to the data in the symmetric neighbourhood \(N^S_k(x_i)\)
- Slope in the local linear regression captures the trend near the boundary to reduce boundary bias
Loess Algorithm
Loess algorithm builds on the local regression idea, but adds that smoother fits would be possible if weights were assigned to the observations based on distance from \(x_0\).
- Weighted least squared criterion is considered and minimised, where weights decrease as distance from \(x_0\) to \(x_i\) increases.
\[ \sum_i w_i(x_0)(y_i-\beta_0(x_0)-\beta_1(x_0)x_i)^2\]
Loess algorithm is an example of this, and is as follows:
- Pick a point \(x_0\) and find the \(k\) nearest values, creating the (not necessarily symmetric) neighbourhood \(N_k(x_i)\)
- Calculate the greatest distance within this neighbourhood:
\[ \Delta(x_0) = \max_{i\in N_k(x_i)}|x_0-x_i| \]
- Assign weights to each point as:
\[ W\left(\frac{|x_0-x_i|}{\Delta(x_0)}\right) \] \[ W(u) = \begin{cases} (1-u^3)^3,\quad 0\leq u\leq 1 \\ 0,\quad\text{otherwise} \end{cases}\]
- Calculate the weighted least squares line within the neighbourhood \(N_k(x_0)\) and take the fitted value at \(x_0\) as \(\hat{f}(x_0)\). Which means to minimise this criterion:
\[ \sum_i w_i(x_0)(y_i-\beta_0(x_0)-\beta_1(x_0)x_i)^2\]
- Repeat for every desired value of \(x_0\)
We can implement this algorithm into R
- Note that the default value for the smoothing parameter is \(k=2/3\), which would use \(2/3\) of the dataset as the neighbourhood. This is not usually sensible and we can change this by altering span=.
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(loess.smooth(x = diabetes$Age, y = log(diabetes$C.Peptide), span = 0.5),
col = 2)
title('Loess Smooth')Kernel Smoothing
- Kernel smoothing is an extension on the running mean smoother, where we form weighted averages rather than equal weights. These weights depends on the distance of \(x_i\) from \(x_0\).
- h is a smoothing parameter called the ‘bandwidth’
- The higher the \(h\), the less flexible the fit
- The lower the \(h\), the more flexible the fit
- h is a smoothing parameter called the ‘bandwidth’
\[ \hat{f}(x_0)=\sum_iw_i(x_0)y_i,\quad\text{where }w_i(x_0)=\frac{K\left(\frac{x_i-x_0}{h}\right)}{\sum_jK\left(\frac{x_j-x_0}{h}\right)}\] * The kernel function \(K(t)\) can be any symmetric function decreasing in \(|t|\), which ensures the weights decrease with an increased distance of \(x_i\) from \(x_0\), and this rate of decrease occurs at the rate of decrease determined by \(h\) + Note that \(\sum_iw_i(x_0) = 1\)
- Common Kernels:
- Standard Normal kernel:
- Epanechikov kernel:
- Box kernel:
- Triangular kernel:
- We can do this in R:
- Can set the kernel to either box or normal, and can set the bandwith parameter.
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(ksmooth(diabetes$Age, log(diabetes$C.Peptide),
bandwidth = 5, kernel = 'normal'), col = 3)
title('Kernel Smooth')plot(diabetes$Age, log(diabetes$C.Peptide))
lines(ksmooth(diabetes$Age, log(diabetes$C.Peptide),
bandwidth = 2, kernel = 'normal'), col = 3)
title('Kernel Smooth')Regression Splines
- Regression splines represent a fit as a piecewise polynominal, with the regions that define the pieces being sperated by a sequence of knots/breakpoints.
- This offers a compromise between explicit local fits (mean/line smoothers) and global fits (polynomial regression)
- To create a smoothly connected spline, we utilise positive basis functions. The following is a generalisation of the spline basis, which is known as the truncated power basis of degree \(p\).
- This would create the feature matrix \(\mathbf{X}\) for the model.
- \(\kappa\) is usually called a knot, which is the breakpoint for the spline basis.
\[ 1, x, ..., x^p, (x-\kappa_1)^p_+, ..., (x-\kappa_K)^p_+\]
- Therefore the general form for the spline model is:
- Due to this form, it can be fit using least squares like a usual linear regression.
\[ f(x) = \beta_0+\beta_1x+ ... +\beta_p+\sum^K_{k=1}\beta_k(x-\kappa_k)^p_+\]
- Linear spline model does not have continuous first derivative, therefore it has sharp corners
- Quadratic spline has continous first derivative, cubic splines have continous second derivative etc.
Cubic Spline Regression
- We often utilise the cubic spline regression as it doesn’t get much better with higher orders, and it looks smooth to the naked eye at the knot points.
- Has continuous first and second derivatives
- The model is the cubic case of the general model:
- Assume ordered observations \(a = x_0 < x_1 < ... < x_n < x_{n+1}=b\)
- The natural cubic spline is linear in the end intervals \([a,x_1],[x_n,b]\)
\[ y_i = \theta_0+\theta_1x_i+\theta_2x^2+\theta_3x^3+\sum^m_{j=1}\beta_j(x_i-k_j)^3_++\epsilon_i\] * We can model regression splines in R using the regular lm() function: + Noting here we specify the knot points as uniformly placed quantiles.
library(splines)
knots <- quantile(diabetes$Age,
p = c(0.25, 0.5,0.75))
model.linear <- lm(log(C.Peptide)~bs(Age, degree = 1, knots = knots),
data = diabetes)
model.quadratic <- lm(log(C.Peptide)~bs(Age, degree = 2, knots = knots),
data = diabetes)
model.cubic <-lm(log(C.Peptide)~bs(Age, degree = 3, knots = knots),
data = diabetes)
age.grid <- seq(from =min(diabetes$Age), to =max(diabetes$Age), 0.1)
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(age.grid, predict(model.linear, newdata = list(Age = age.grid)), col = 2, lwd = 2)
lines(age.grid, predict(model.quadratic, newdata = list(Age = age.grid)), col = 'blue', lwd = 2)
lines(age.grid, predict(model.cubic, newdata = list(Age = age.grid)), col = 'green', lwd = 2)Penalised Spline Regression
- The roughness of a spline fit is due to there being too many knots.
- One way to overcome this is to fit with all knots but restrain the influence of them. This is a similar approach to lasso/ridge regression.
- Coefficient of spline basis function is shrunken towards 0 and can be contrasted with knot selection.
- Noting the ordinary least squares solution of a linear spline (in matrix notation) is the minimisier of:
\[ ||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2,\quad \boldsymbol{\beta} = [\beta_0,\beta_1,\underbrace{\beta_{11}, ..., \beta_{1K}}_{\beta_{1k} \text{ knot coefficients}}]\] * And a tool we can utilise the matrix \(\mathbf{D}\): + Assumes one intercept and predictor coefficient, and \(K\) knot coefficients
\[ \mathbf{D} = \begin{bmatrix}0_{2*2} & 0_{2*K}\\ 0_{k*2} & I_{k*K}\end{bmatrix} \]
- Common constraints on $$ are:
- A good choice of C will lead to a smoother fit. This is a similar concept to \(\lambda\) for smoothers
\[ \max|\beta_{1K}|<C \]
\[ \sum\beta_{1k}<C\]
\[ \sum\beta_{1k}^2<C\]
- The third constraint is the easiest to work with.
- This constaint can be written as minimising \(||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2\) subject to \(\boldsymbol\beta^\top\mathbf{D}\boldsymbol{\beta}\). This is equivalent to finding \(\boldsymbol\beta\) that minimises the following for some \(\lambda>0\):
\[ ||\mathbf{y}-\mathbf{X}\boldsymbol{\beta}||^2 +\lambda^2\boldsymbol\beta^\top\mathbf{D}\boldsymbol{\beta}\]
- The solution to this is:
\[ \hat{\boldsymbol\beta}_\lambda = (\mathbf{X}^\top\mathbf{X}+\lambda^2\mathbf{D})^{-1}\mathbf{X}^\top\mathbf{y}\]
We note that \(\lambda^2\boldsymbol\beta^\top\mathbf{D}\boldsymbol{\beta}\) is called the roughness penalty as it penalises fits that are too rough.
We can also consider other penalties
- Quadratic integral penalty where
- \(q+1\leq p\)
- \(x\in [a,b]\)
- Quadratic integral penalty where
\[ \lambda^{2p}\int^b_{a}\{f^{q+1}(x)\}^2dx\]
Smoothing Spline
- One approach to creating a smoother is to minimise the RSS (residuals) subject to a roughness penalty.
- \(\lambda\) is chosen to determine how much penalty we want
- Without such a penalty \((\lambda = 0)\), the minimiser of the RSS is a simple interpolation of data.
- As \(\lambda\rightarrow 0\) the fit gets more flexible
- If we have too high of a penalty \((\lambda\rightarrow\infty)\), the minimiser would be the least squares fit.
- As \(\lambda\rightarrow\infty\) the fit gets more flexible
\[ \sum_i(y_i-f(x_i))^2 + \lambda\int f^{''}(x)^2 dx \]
- The solution to this minimisation problem is the natural cubic spline with knots at \(x_1,...,x_n\)
- This is a shrunken version of the basis version of the cubic spline, with effective degrees of freedom based on the severity of the penalty \(\lambda\)
- We can fit cubic smoothing splines in R:
- R automatically chooses the optimal \(\lambda\) using CV
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(smooth.spline(diabetes$Age, log(diabetes$C.Peptide)),
col = 'blue', lwd = 2)Linear Smoothers
- A scatterplot smoother maps \(\mathbf{y}\) to a function \(\hat{f}(x)\). This mapping can be written as \(g(\mathbf{y}|\mathbf{x})\)
- A linear smoother is one where the following formula holds (all of those we have looked at):
\[ g(a\mathbf{y}_1+b\mathbf{y}_2|\mathbf{x}) = ag(\mathbf{y}_1|\mathbf{x})+bg(\mathbf{y}_2|\mathbf{x})\] * For linear smoothers, there exists a smoothing matrix \(S\) such that: + This serves the same purpose as the hat matrix for linear regression + Each row is weights for each response
\[ \hat{f} = \mathbf{S}\mathbf{y} \] * Expected value
\[ \mathbb{E}[\hat{f}_\lambda] = \mathbf{S}_\lambda\mathbf{y}\]
- Variance
\[ \mathbb{V}ar(\hat{f}_\lambda) = \sigma^2\mathbf{S}_\lambda\mathbf{S}_\lambda^\top\]
- Bias
\[ \mathbb{E}[y-\hat{f}_\lambda] = f - \mathbf{S}_\lambda f\]
How to Choose the Smoothing Parameter
- Picking an appropriate smoothing parameter is paramount for smoothing implementation. All smoothing parameters will be denoted as \(\lambda\) for this section
- \(h\) for kernel, \(k\) for loess, \(\lambda\) for cubic smoothing splines
- To evaluate performance two common metrics are:
- MSE for the training set
- PSE for new data (\(Y^*_i\) is an imaginary new observation at \(x_i\))
- We can estimate PSE by using crossvalidation:
\[ CV(\lambda) = \frac{1}{n}\sum^n_{i=1}(y_i-\hat{f}^{-i}_\lambda(x_i))^2\]
- Mathematically \(\mathbb{E}[CV(\lambda)]\approx PSE(\lambda)\) if \(\hat{f}^{-i}_\lambda(x_i)\approx\hat{f}_\lambda(x_i)\) as:
- Noting \(\sigma^2 = \mathbb{E}[(y_i-f(x_i))^2]\) (RSS)
\[ \mathbb{E}[(y_i-\hat{f}^{-i}_\lambda(x_i))^2] = \sigma^2 + \mathbb{E}[(\hat{f}^{-i}_\lambda(x_i)-f(x_i))^2]\]
\[ \mathbb{E}[(Y^*_i-\hat{f}_\lambda(x_i))^2] = \sigma^2 + \mathbb{E}[(\hat{f}_\lambda(x_i)-f(x_i))^2] \]
- Computation of the CV is computationally efficient for the cubic smoothing spline:
- Mathematically this holds due to the fact adding a new point to a fit won’t change the fit if it lies on the previous fit (implication of the first equation).
- This is basically LOOCV from other models.
- Noting \(\sum_jS_{ij}(\lambda)y_j = \hat{f}_\lambda(x_i)\)
\[\hat{f}^{-i}_\lambda(x_i) = \sum_{j\neq i}\mathbf{S}_{ij}(\lambda)y_j + \mathbf{S}_{ii}\hat{f}^{-i}_\lambda(x_i) \]
\[ RSS_{CV(\lambda)} = \sum^n_{i=1}(y_i-\hat{f}^{-i}_\lambda(x_i))^2 = \sum^n_{i=1}\frac{y_i-\hat{f}_\lambda(x_i)}{1-\mathbf{S}_{ii}}\]
- Generalised cross validation is another way to approximate PSE. In GCV the diagonal terms of the smoothing matrix \((\mathbf{S}_{ii}(\lambda)\) is replaced by the average of the diagonal \(\frac{tr(\mathbf{S}_\lambda)}{n}\)
- This can have the statistical advantage of giving less weight to points which are influential in the smooth (having large \(\mathbf{S}_{ii}\))
- This can be computationally implented in R through a logical condition in the smooth.spline function:
- cv = T will result in CV being used in the fit rather than GCV (the default)
smooth.cv <- smooth.spline(diabetes$Age,
log(diabetes$C.Peptide), cv=T)
smooth.gcv <- smooth.spline(diabetes$Age,
log(diabetes$C.Peptide))
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(smooth.cv, col = 'red', lwd = 2)
lines(smooth.gcv, col = 'blue', lwd = 2, lty = 2)- Another way to define the smoothness of the smoother is the degrees of freedom, which can be found from the trace of the smoothing matrix.
- The effective degrees of freedom can be interpreted as the amount of ‘coefficients’ in a regular fit
\[ tr(\mathbf{S}_\lambda) = DOF\]
- In R we can input the degrees of freedom which will be used for the smoothness rather than the minimised \(\lambda\)
smooth.df5 <- smooth.spline(diabetes$Age,
log(diabetes$C.Peptide), df = 5)
smooth.df10 <- smooth.spline(diabetes$Age,
log(diabetes$C.Peptide), df = 10)
smooth.df20 <- smooth.spline(diabetes$Age,
log(diabetes$C.Peptide), df = 20)
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(smooth.df5, col = 'green', lwd = 2)
lines(smooth.df10, col = 'red', lty = 2, lwd = 2)
lines(smooth.df20, col = 'blue', lty = 3, lwd = 2)Confidence Bands for Smoothers
- Utilising the covariance matrix of \(\hat{f}\) and some added assumptions, we can create approximate 95% confidence intervals
- If the bias of \(\hat{f}\) is small, errors are Gaussian and we have an estimate of \(\hat{\sigma}\)
\[ Cov(\hat{f}) = \sigma^2\mathbf{S}_\lambda\mathbf{S}^\top_{\lambda}\]
- CI:
\[ \hat{f}(x_i)\pm 1.96\hat{\sigma}\sqrt{((\mathbf{S}_\lambda\mathbf{S}^\top_\lambda))_{ii}}\]
- Approximations for \(\hat{\sigma}^2\):
- CV RSS:
- If the LOOCV CV formula holds we can use:
- In R we can create these confidence intervals:
- fit$lev gives the diagonal elements of
- fit$yin gives the original responses
- fit$y gives the smoothed responses
fit <- smooth.spline(diabetes$Age,log(diabetes$C.Peptide))
res <- (fit$yin-fit$y)/(1-fit$lev)
sigma <- sqrt(mean(res^2))
upper <- fit$y+1.96*sigma*sqrt(fit$lev)
lower <- fit$y-1.96*sigma*sqrt(fit$lev)
plot(diabetes$Age, log(diabetes$C.Peptide))
lines(fit, col = 2)
lines(fit$x, upper, lty = 2, col = 3)
lines(fit$x, lower, lty = 2, col = 3)Generalised Additive Models
- Generalising smoothing methods would occur with the model:
\[ y_i = f(x_{i1},...,x_{ik})+\epsilon_i\]
- However smoothing methods struggle in higher dimensions due to the curse of dimensionality
- As dimensionality increases, larger neighbourhoods and more data is needed to get reasonable amount of neighbours
- A possible solution is additive models
- Assumes enough structure to the mean response to beat the curse of dimensionality while maintaining interpretability
- A large limitation however is that it is based on the assumption of no interaction
\[ y_i = \beta_0+\sum^k_{j=1}f_j(x_{ij})+\epsilon_i\]
We can interpret the impact of each predictor on the mean response thanks to the addtitive structure:
- As there is no interaction we can also understand the impact of each predictor on the mean response by plotting smooth univariate functions \(f_j\)
- For a increase of \(\Delta\) for \(x_j\) the mean response changes by:
\[ f_j(x_j+\Delta) - f_j(x_j)\]
In R we can model these additive models
- We can select whether we want each variable smoothed or linear and what smoothing technique to use
library(mgcv)
par(mfrow = c(1,2))
diabetes <- read.table('diabetes.txt', head = T)
diabetes.gam1 <- gam(log(diabetes$C.Peptide)
~s(diabetes$Age)+s(diabetes$Base.Deficit),
data = diabetes)
plot(diabetes.gam1)- While by default additive models do not have interactions, we can attempt to guess some interactions by creating smooth univariate functions for the product of covariates.
\[ y_i = f_1(x_{i1})+f_2(x_{i2})+f_{12}(x_{i1}x_{i2})+\epsilon_i\]
- In R this would be done:
Backfitting Algorithm
- GAM uses a backfitting algorithm to fit each univariate smoothed model
- This is done using transformed responses
\[ y_i = \sum_jf_j(x_{ij}) + \epsilon_i\]
\[ y_i-\sum_{j\neq k} f_j(x_{ij}) = f_k(x_{ik})+\epsilon_i \]
\[ z_i = f_k(x_{ik}) + \epsilon_i\]
- The algorithm makes an initial guess, often the linear model, and successively estimates each univariate form using scatterplot smoothers holding the other terms constant
- Form \(z_i = y_i-\sum_{l\neq j}\hat{f}_l(x_{il})\)
- Replace \(\hat{f}_j\) by scatterplot smooth of \(z_i\) against \(x_{ij}\)
- Repeat until convergence
Projection Pursuit Regression
- Extension of additive model that makes allowance for interactions
- \(\mathbf{x}_i = (x_{i1},...,x_{ik})\) is the vector of predictors
- \(\alpha_j\) are k-dimensional unit vectors multiplied by some scalar
\[ y_i = \alpha_0 + \sum^M_{j=1}f_j(\alpha_j^\top\mathbf{x}_i)+\epsilon_i\]
- The goal is to find the best set of projections of the response vector. \(M\) one dimensional projections of the original predictors are created and add up to the smooth univariate terms of the projection
- Allows for interaction due to direction of projection
- PPR is an additive model of the derived features \(Z_j = \alpha^\top_j\mathbf{x}_i\)
- \(f_j\) varies only in the direction defined by the vector \(\alpha_j\)
- This can be implemented in R
- nterms is the number of projections, \(M\) is fitted in the final model
- max.terms is how many projections to add (done one at a time). This is then pruned back to \(M\)/nterms of the best projections
- sm.method is the smoothing method used
- alpha stores the M projection directions, giving the weight of each predictor in each projection
diabetes.ppr <- ppr(log(C.Peptide)~Age+Base.Deficit,
nterms = 1, max.terms = 5,
sm.methpd = 'gcvspline', data = diabetes)
diabetes.ppr$alpha## Age Base.Deficit
## -0.9938018 0.1111668
Generalised Additive Models
- Generalised additive models allow for the distribution of the \(ith\) response to have non-linear dependence on the covariates \(x_i\)
- Responses are independent with density function from the exponential family
\[ g(\mu_i) = \sum^k_{j=1}f_j(x_{ij}) \]
The peanlised least squares criterion can be generalised through the use of the likelihood function of the exponential response
- This can be proven by taking the normal example and rearranging
- This minimisation problem still results in the same natural cubic spline, minimised through fisher scoring
- Backfitting is used in each fisher scoring step, with the final estimate being the cubic spline
- Can fit smoother other than cubic spline in the backfitting algorithm if desired
\[ -2\phi\ell(f)+\sum^k_{j=1}\lambda_{j}\int^{b_j}_{a_j}f^{''}_j(x_j)^2dx_j\]
This can be implemented in R:
spam <- read.table('spam2.txt', head = T, sep = ',')
spam.gam <- gam(isjunk~s(freq.excl)+s(freq.dollar),
data=spam[spam$freq.excl<10,], family = binomial)
summary(spam.gam)##
## Family: binomial
## Link function: logit
##
## Formula:
## isjunk ~ s(freq.excl) + s(freq.dollar)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.37295 0.07203 -5.178 2.25e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(freq.excl) 8.063 8.739 799.4 <2e-16 ***
## s(freq.dollar) 5.792 6.213 601.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.502 Deviance explained = 42.6%
## UBRE = -0.22356 Scale est. = 1 n = 4599