Basado en Biostatistics

knitr::opts_chunk$set(echo = TRUE)
library(fda)
## Warning: package 'fda' was built under R version 4.4.3
## Cargando paquete requerido: splines
## Cargando paquete requerido: fds
## Cargando paquete requerido: rainbow
## Cargando paquete requerido: MASS
## Warning: package 'MASS' was built under R version 4.4.3
## Cargando paquete requerido: pcaPP
## Cargando paquete requerido: RCurl
## Cargando paquete requerido: deSolve
## Warning: package 'deSolve' was built under R version 4.4.3
## 
## Adjuntando el paquete: 'fda'
## The following object is masked from 'package:graphics':
## 
##     matplot
library(refund)
## Warning: package 'refund' was built under R version 4.4.3

Overview

\(\circ\;\) Scatterplot smoothing using basis expansions

\(\;\)

Motivation: Canadian Weather Data

data(CanadianWeather)
attach(CanadianWeather)
y.precip=dailyAv[,,2]
l = which(place=="Vancouver") 
t.day = 1:365  
y=y.precip[,l]
plot(t.day, y, xlab='Day', ylab='Precipitation', type='b')
title('Average Daily Precipitation in Vancouver (BC)')

Motivation: Canadian Weather Data

plot(t.day, y, xlab='Day', ylab='Precipitation', type='b')
title('Average Daily Precipitation in Vancouver (BC)')
lines(lowess(t.day,y),col=4, lwd=3)
lines(smooth.spline(t.day, y),lwd=3,col=2)
legend(100,8, c("Lowess Smoothing", "B-spline (GCV)"), col=c(4,2), lwd=3)

Some R packages for FDA

The Smoothing Problem

Let \(y_i \in \mathbb{R}\) and \(x_i\in D\), \(D\) compact

The smoothing problem consists in estimating the function \(g(x)\) in the model

\[y_i(x_i) = g(x_i) + \epsilon_i(x_i), \mbox{ with } E[\epsilon_i(x_i)] = 0\]

Estimation is often carried out assuming \(g(x):D \rightarrow \mathbb{R}\), in a function space with specific properties, e.g. continuity and differentiability … smoothness

\(\;\)

Basis Functions

\[g(x)\approx \sum_{j=1}^k c_j B_j(x)\]

Examples of Basis Functions

x <- seq(0,1,length=1000)
par(mfrow=c(1,3))
# Monomial
mon.b <- create.monomial.basis(c(0,1),5)
B.mon <- eval.basis(x, mon.b)
matplot(x, B.mon, type='l', lwd=2)
title('Monomial basis')
# Fourier
fou.b <- create.fourier.basis(c(0,1),5)
B.fou <- eval.basis(x, fou.b)
matplot(x, B.fou, type='l', lwd=2)
title('Fourier basis')
# B-splines
spl.b <- create.bspline.basis(c(0,1),nbasis=5)
B.spl <- eval.basis(x, spl.b)
matplot(x, B.spl, type='l', lwd=2)
title('B-spline basis')

Monomial Basis

\[g(x) \approx a_0 + a_1 x + a_2 x^2 + \cdots + a_k x^{(k-1)}\] Proposition [Polynomial Approximations]

Let \(g\in C^\alpha[0,1]\), be the set of functions on \([0,1]\) with \(\alpha\) continuous derivatives. There exists a constant \(M < \infty\) that depends only on \(\alpha\), s.t. \(\forall g\in C^\alpha[0,1]\), and \(k\in \mathbb{N}\), \(\exists\) a polynomial \(P\) of degree \(k\), s.t.

\[||g - P||_\infty = \sup_{x\in D}|g(x) - P(x)| \leq M k^{-\alpha}||g||_{C^\alpha}\] - The uniform distance between a function \(g\) and the closest polynomial of degree \(k\) is of the order \(k^{-\alpha}\)

Monomial Basis

Advantages

Limitations

Splines

Definition [Spline Function] Given \(k\) knots \(a=t_0<t_1<\cdots<t_{k-1}=b\), a function \(f(x):[a,b]\rightarrow\mathbb{R}\) is a spline of order \(q\), if the restriction \(f_{[t_j, ,t_{j+1}]}(x)\) of \(f\) to any sub-interval \([t_j, t_j+1]\) is a polynomial of degree \(q-1\), and \(f\in C^{q-2}\), provided \(q \geq 2\).

B-Splines

There are many classes of spline functions. The class of \(B\)-splines is often used for computational convenience

x <- seq(0,1,length=1000)
par(mfrow=c(1,3))
# B-splines
spl.b <- create.bspline.basis(c(0,1), breaks=c(0,0.333,0.666,1), norder=2)
B.spl <- eval.basis(x, spl.b)
matplot(x, B.spl, type='l', lwd=2)
abline(v=c(0,0.333,0.666,1))
points(c(0,0.333,0.666,1), rep(0,4), pch=19, cex=2)
title('B-splines (Order 2)')
#
spl.b <- create.bspline.basis(c(0,1), breaks=c(0,0.333,0.666,1), norder=3)
B.spl <- eval.basis(x, spl.b)
matplot(x, B.spl, type='l', lwd=2)
abline(v=c(0,0.333,0.666,1))
points(c(0,0.333,0.666,1), rep(0,4), pch=19, cex=2)
title('B-spline (Order 3)')
#
spl.b <- create.bspline.basis(c(0,1), breaks=c(0,0.333,0.666,1), norder=4)
B.spl <- eval.basis(x, spl.b)
matplot(x, B.spl, type='l', lwd=2)
abline(v=c(0,0.333,0.666,1))
points(c(0,0.333,0.666,1), rep(0,4), pch=19, cex=2)
title('B-spline (Order 4)')

B-Splines Properties

Let \(\{B_j(x)\}_{j=1}^{J=(k+q)}\) be a set of B-spline basis functions with \(k\) knots or order \(q\). We have:

\(\;\)

B-Splines Properties

Proposition [Spline Approximations]

Let \(g\in C^\alpha[0,1]\), be the set of functions on \([0,1]\) with \(\alpha\) continuous derivatives. Consider a set of \(J\) spline basis functions of order \(q\geq\alpha>0\), s.t. \[g(x) \approx \sum_{j=1}^J\theta_j B_j(x) = \boldsymbol{\theta}^T B_J.\]

There exists a constant \(M(q,\alpha) < \infty\) that depends only on \(\alpha\) and \(q\), s.t. \(\forall g\in C^\alpha[0,1]\), we can find \(\boldsymbol{\theta}\in\mathbb{R}^J\), with \(||\boldsymbol{\theta}||_\infty < ||g||_{C^\alpha}\) ,s.t.

\[||g - \boldsymbol{\theta}^TB_J||_\infty \leq M(\alpha, q) J^{-\alpha}||g||_{C^\alpha}\] - The uniform distance between a function \(g\) and the closest spline approximation of order \(q\geq \alpha\) is of the order \(J^{-\alpha}\)

Smoothing with Splines

\[y(x) \approx \sum_{j=1}^J\theta_{j}B_j(x) + \epsilon(x) = \boldsymbol{\theta}^T{\bf B}_J(x) + \epsilon(x)\]

\[{\bf B}_J({\bf x}) = [1, B_1({\bf x}),\ldots,B_J(\bf x)]\]

\[\boldsymbol{\hat{\theta}} = \{{\bf B}_J({\bf x})^T{\bf B}_J({\bf x})\}^{-1}{\bf B}_J({\bf x})^T {\bf y}\]

Smoothing with Splines (LS)

\[\boldsymbol{\hat{\theta}} = \{{\bf B}_J({\bf x})^T{\bf B}_J({\bf x})\}^{-1}{\bf B}_J({\bf x})^T {\bf y}\]

Challenges

  1. What spline order \(q\)?

  2. How many knots \(k\)?

  3. Where to place knots?

Example: Canadian Weather Data

plot(t.day, y, xlab='Day', ylab='Precipitation')
title('Average Daily Precipitation in Vancouver (BC)')
#
# B-splines
x <- t.day
spl.b1 <- create.bspline.basis(c(1,365), nbasis=5, norder=2)
B1 <- eval.basis(x, spl.b1)
th1 <- solve(t(B1)%*%B1, t(B1)%*%y)
#
spl.b2 <- create.bspline.basis(c(0,365), nbasis=5, norder=4)
B2 <- eval.basis(x, spl.b2)
th2 <- solve(t(B2)%*%B2, t(B2)%*%y)
#
spl.b3 <- create.bspline.basis(c(0,365), nbasis=20, norder=4)
B3 <- eval.basis(x, spl.b3)
th3 <- solve(t(B3)%*%B3, t(B3)%*%y)
#
lines(t.day,B1%*%th1, col=4, lwd=3)
lines(t.day,B2%*%th2, col=3, lwd=3)
lines(t.day,B3%*%th3, col=2, lwd=3)
legend(100,8, c("(J=5, q=2)", "(J=5, q=4)", "(J=20, q=4)"), col=c(4,3,2), lwd=3)

Selecting the Number of Basis Functions

\(\;\)

Mean Square Error of \(\hat{g}_J(x) = \boldsymbol{\hat{\theta}}^TB_J(x)\),

\[MSE\{\hat{g}_J\} := E[\{\hat{g}_J - g\}^2] = [E\{\hat{g}_J - g\}]^2 + E[\{\hat{g}_J - E(\hat{g}_J)\}^2]\]

\(\;\)

\(\;\)

Estimation Using Roughness Penalties

\[PSSE_\lambda = ||{\bf y} - g({\bf x})||^2 + \lambda P(g)\] where \(P(g)\) measures ``roughness’’ of the function \(g\) and \(\lambda>0\) defines a trade off between goodness of fit (bias) and smoothness (variance)

Penalized Regression Splines

\(\;\)

Example: Canadian Weather

#
# B-splines
x <- t.day
spl.b <- create.bspline.basis(c(0,365), nbasis=365, norder=4)
B <- eval.basis(x, spl.b)
D <- bsplinepen(spl.b)
#
par(mfrow=c(1,3))
#
lambda <- 0.001
th <- solve(t(B)%*%B + lambda*D, t(B)%*%y)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('lambda=0.001')
lines(t.day,B%*%th, col=4, lwd=3)
#
lambda <- 10^2
th <- solve(t(B)%*%B + lambda*D, t(B)%*%y)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('lambda=10^2')
lines(t.day,B%*%th, col=3, lwd=3)
#
lambda <- 10^4
th <- solve(t(B)%*%B + lambda*D, t(B)%*%y)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('lambda=10^4')
lines(t.day,B%*%th, col=2, lwd=3)

Derivation of the penalty matrix \(D\)

The roughness penalty matrix \(D\) arises as follows:

$$ \[\begin{array}{lll} \int [g(x)'']^2 dx &=&\int\,[\boldsymbol{\theta}^T{\bf B}(x)][\boldsymbol{\theta}^T{\bf B}(x)]^Tdx\\[.2in] &=& \boldsymbol{\theta}^T\int {\bf B}(x){\bf B}(x)^T dx\;\boldsymbol{\theta}\\[.2in] &=& \boldsymbol{\theta}^T D \;\boldsymbol{\theta} \end{array}\]

Overview

\(\circ\;\) Scatterplot smoothing using basis expansions

\(\;\)

Motivation: Canadian Weather Data

data(CanadianWeather)
attach(CanadianWeather)
## The following objects are masked from CanadianWeather (pos = 5):
## 
##     coordinates, dailyAv, geogindex, monthlyPrecip, monthlyTemp, place,
##     province, region
y.precip=dailyAv[,,2]
l = which(place=="Vancouver") 
t.day = 1:365  
y=y.precip[,l]
plot(t.day, y, xlab='Day', ylab='Precipitation', type='b')
title('Average Daily Precipitation in Vancouver (BC)')

Smoothing with Splines

\[y(x) \approx \sum_{j=1}^J\theta_{j}B_j(x) + \epsilon(x) = \boldsymbol{\theta}^T{\bf B}_J(x) + \epsilon(x)\]

\[{\bf B}_J({\bf x}) = [1, B_1({\bf x}),\ldots,B_J(\bf x)]\]

\[\boldsymbol{\hat{\theta}} = \{{\bf B}_J({\bf x})^T{\bf B}_J({\bf x})\}^{-1}{\bf B}_J({\bf x})^T {\bf y}\]

Estimation Using Roughness Penalties

\[PSSE_\lambda = ||{\bf y} - g({\bf x})||^2 + \lambda P(g)\] where \(P(g)\) measures ``roughness’’ of the function \(g\) and \(\lambda>0\) defines a trade off between goodness of fit (bias) and smoothness (variance)

Penalized Regression Splines

Example: Canadian Weather

#
# B-splines
x <- t.day
spl.b <- create.bspline.basis(c(0,365), nbasis=365, norder=4)
B <- eval.basis(x, spl.b)
D <- bsplinepen(spl.b)
#
par(mfrow=c(1,3))
#
lambda <- 0.001
th <- solve(t(B)%*%B + lambda*D, t(B)%*%y)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('lambda=0.001')
lines(t.day,B%*%th, col=4, lwd=3)
#
lambda <- 10^2
th <- solve(t(B)%*%B + lambda*D, t(B)%*%y)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('lambda=10^2')
lines(t.day,B%*%th, col=3, lwd=3)
#
lambda <- 10^4
th <- solve(t(B)%*%B + lambda*D, t(B)%*%y)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('lambda=10^4')
lines(t.day,B%*%th, col=2, lwd=3)

Selection of Regularization \(\lambda\) through Cross Validation

Inference for Linear Smoothers

\[ \begin{array}{lllllll} \left[\begin{array}{c} \hat{g}_\lambda(x_1)\\ \vdots\\ \hat{g}_\lambda(x_n)\end{array}\right] & = & {\bf B}\hat{\boldsymbol{\theta}}_\lambda & = & {\bf B}\,[{\bf B}^T{\bf B} + \lambda D]^{-1}{\bf B}^T{\bf y} & = & S_\lambda\,{\bf y} \end{array} \]

\[\left[\{{\bf B}^T{\bf B} + \lambda D\}^{-1}{\bf B}^T {\bf B} \,\{{\bf B}^T{\bf B} + \lambda D\}^{-1}\right]^{-1/2}(\hat{\boldsymbol{\theta}}_\lambda - \boldsymbol{\theta})\rightarrow_d N(0, \sigma^2 I)\]

\[ \begin{array}{lll} E\{\hat{g}_\lambda(x)\} & = & S_\lambda(x)^TE({\bf y}) = S_\lambda(x)^Tg(x)\\[.1in] Var\{\hat{g}_\lambda(x)\} & = & Var\{S_\lambda(x)^T {\bf y}\} = \sigma^2 S_\lambda(x)^T S_\lambda(x) = \sigma^2||S(x)_\lambda||^2 \end{array} \]

Inference for Linear Smoothers

\[ \frac{\hat{g}_\lambda(x) - g(x) - b_\lambda(x)}{\sigma||S_\lambda(x)||}\rightarrow_d N(0,1) \]

\[ \frac{\hat{g}_\lambda(x) - g(x)}{\sigma||S_\lambda(x)||} = \frac{\hat{g}_\lambda(x) - \bar{g}_\lambda(x)}{\sigma||S_\lambda(x)||} + \frac{\bar{g}_\lambda(x) - g(x)}{\sigma||S_\lambda(x)||} = Z_n(x) + \frac{b_\lambda(x)}{\sigma||S_\lambda(x)||} \]

Smoothing as a Linear Mixed Model

\[\left\{1, x, \ldots,x^p,|x-\eta_1|^p_+,\ldots, |x-\eta_k|^p_+\right\}\]

x   <- seq(0, 1, length=500)
eta <- c(0.25,0.5,0.75)
sp1 <- function(x, eta, p){
  J <- p + length(eta) + 1
  B <- matrix(1, nrow=length(x), ncol=J)
  for(i in 2:(p+1)) B[,i] <- x^(i-1) 
  k <- 1
  for(i in (p+2):J){
    x1 <- (x-eta[k])
    x1[x1<0] <- 0.0
    B[,i] <- x1^p
    k <- k+1
  }
  return(B)
}
par(mfrow=c(1,3))
B1 <- sp1(x, eta, 1)
B2 <- sp1(x, eta, 2)
B3 <- sp1(x, eta, 3)
#
matplot(x, B1, type='l', xlab='x', ylab='B(x)', lwd=2)
abline(v=eta,, col='grey')
points(eta, rep(0,3), pch=19, cex=2)
title("Linear")
#
#
matplot(x, B2, type='l', xlab='x', ylab='B(x)', lwd=2)
abline(v=eta, col='grey')
points(eta, rep(0,3), pch=19, cex=2)
title("Quadratic")
#
#
matplot(x, B3, type='l', xlab='x', ylab='B(x)', lwd=2)
abline(v=eta, col='grey')
points(eta, rep(0,3), pch=19, cex=2)
title("Cubic")

#

Linear Mixed Model Representation

\[ \begin{array}{ll} {\bf y \mid b} & \sim \hspace{2in}\\[.2in] {\bf y} &\sim \end{array} \]

Estimation of \(g(x)\) through LMM

\[\hat{\boldsymbol{\beta}} = \{X^T\hat{V}^{-1}X\}X^T\hat{V}^{-1}{\bf y}\]

\[{\bf\tilde{b}} = \hat{\sigma}^2_b Z^T\hat{V}^{-1}({\bf y} - X\hat{\boldsymbol{\beta}})\]

\[\hat{\boldsymbol \theta} = [C^TC + \lambda D]C^T{\bf y}\]

Estimation of Variance Components

\[\ell_p(\alpha) = -\frac{1}{2}\log|V(\alpha)| - \frac{1}{2}({\bf y} - X^T\hat{\boldsymbol \beta}(\alpha))^TV(\alpha)^{-1}({\bf y} - X^T\hat{\boldsymbol \beta}(\alpha))\]

\[\ell_R(\alpha) = -\frac{1}{2}\log|X^TV(\alpha)^{-1}X| + \ell_p(\alpha)\]

Bayesian Interpretation

LMM and Uncertainty around \(\hat{g}(x)\)

\[Var\{\hat{g}(x)\mid {\bf b}\} = \sigma^2_\epsilon ||S_\lambda(x)||^2\]

\[Cov\left.\left(\begin{array}{c} \hat{\boldsymbol{\beta}}\\\tilde{\bf b}\end{array} \right. \mid b\right) = \sigma^2_\epsilon (C^TC + \lambda D) ^{-1}C^TC (CC+\lambda D)^{-1}\]

A Study of Bias

\[ \begin{array}{lll} E\{\hat{g} - g \mid {\bf b}\} &=& E\{X\hat{\boldsymbol{\beta}} - X\boldsymbol{\beta} + Z\tilde{\bf b} - Z{\bf b}\}\\[.1in] &=&X\{E(\hat{\boldsymbol \beta}\mid {\bf b}) - \boldsymbol{\beta}\} + Z\{E(\tilde{\bf b}\mid {\bf b}) - {\bf b}\}\\[.1in] &=&-\frac{\sigma^2_\epsilon}{\sigma^2_b} C\left(C^TC + \frac{\sigma^2_\epsilon}{\sigma^2_b}D\right)^{-1} \left[\begin{array}{c}0\\{\bf b}\end{array}\right] \end{array} \]

\[E(\hat{g} - g) = 0\]

\[Var(\hat{g} - g) = \sigma^2_\epsilon \,C\left(CC + \frac{\sigma^2_\epsilon}{\sigma^2_b}D\right)^{-1}C^T\]

Canadian Weather

y.precip=dailyAv[,,2]
l = which(place=="Vancouver") 
t.day = 1:365  
y=y.precip[,l]
par(mfrow=c(2,3))
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('GCV')
fit1 <- gam(y~s(t.day), method='GCV.Cp')
lines(t.day, fit1$fitted.values, lwd=2, col=2)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('REML')
fit2 <- gam(y~s(t.day), method='REML')
lines(t.day, fit2$fitted.values, lwd=2, col=3)
plot(t.day, y, xlab='Day', ylab='Precipitation')
title('ML')
fit3 <- gam(y~s(t.day), method='ML')
lines(t.day, fit3$fitted.values, lwd=2, col=4)
#
plot(fit1, residuals=T)
plot(fit2, residuals=T)
plot(fit3, residuals=T)

THANK YOU