The use of Natural (restricted) spline regression model has been very popular to model non-linear effects of continuous covariates. Statistical software such as R, SAS, STATA and SPSS, et al all can be used to perform the natural spline regression. However, the output results by these software sometimes are quite ‘confusing’ therefore, if we can conduct the regression model ‘by hand’ then we probably can understand the meaning of the results of any statistical software.

Here, is the definition of a spline from the https://www.merriam-webster.com/dictionary/spline

1: a thin wood or metal strip used in building construction.

2: a key that is fixed to one of two connected mechanical parts and fits into a keyway in the other. also : a keyway for such a key.

3: a function that is defined on an interval, is used to approximate a given function, and is composed of pieces of simple functions defined on subintervals and joined at their endpoints with a suitable degree of smoothness.

Here is a spline that we may use in our daily life.

Spline from the Bunnings

For the illustration we will use the data set “triceps” which is included in the R package MultiKink developed by Chuang Wan (https://cran.r-project.org/web/packages/MultiKink/index.html). The data are derived from an anthropometric study of 892 females under 50 years in three Gambian villages in West Africa. There were 892 observations on the following 3 variables.

age: Age of respondents.
lntriceps: Log of the triceps skinfold thickness.
triceps: Triceps skinfold thichness

Here is a webpage that gives a very good introduction to piecewise linear regressions and spline regressions using the data set ‘triceps’ by Professor Armando Teixeira-Pinto from the Sydney University. https://bookdown.org/tpinto_home/Beyond-Linearity/piecewise-regression-and-splines.html

I will use the same data set to conduct the analysis.

Truncated power funcation and spline

When use splines, we need to partition the range of the continuous covariate \(x\) into smaller intervals, and those partition points are called knots. For piecewise continuous model (the lines connected between different intervals) we usually use the truncated power functions to fit the data. The truncated power function is defined as:

\[ (x-k_i)^d_{+} = \begin{cases} 0, & \text{ if } x < k_i \\ (x-k_i)^d, & \text{ if } x \geq k_i\\ \end{cases} \tag{1}\]

Where \(d=1,2,3,..\)

Using above truncated power function, we can specify a cubic spline regression model with five knots by the following formula:

\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 (x-k_1)^3_{+} + \beta_5 (x-k_2)^3_{+} + \beta_6 (x-k_3)^3_{+} + \beta_7 (x-k_4)^3_{+} + \beta_8 (x-k_5)^3_{+} +\varepsilon \tag{2} \] Note, there is no ‘natural’ or ‘restricted’ mentioned for above model. It is just a cubic spline regression with continuous connection at each knot.

“A natural cubic splines adds additional constraints, namely that function is linear beyond the boundary knots.”\(^1\) Splines with this additional constraint are also known as “restricted” splines. We will discuss what is exactly meaning for the ‘natural’ and ‘restricted’ mathematically later.

Conduct an “un-natural” cubic spline regression

Let us use the euqation \((2)\) and the data set tricepts to conduct an ‘un-nutural’ cubic spline regression, i.e. there are no any restrictions on the equation \((2)\)

Scatter plot between age and triceps.

library(MultiKink) #for the data
library(ggplot2)   #for the plots
library(splines)   #will use this later
library(rms)       #Regression Modeling Strategies

data("triceps")
data_plot <- ggplot(triceps, aes(x=age, y=triceps)) + geom_point(alpha=0.55, color="black") +theme_minimal() 
data_plot

Conduct an un-natural cubic spline regression with knots at age 5, 10,20,30 and 40. Note this is just an usual linear regression with a \(x\) value transformed by truncated power power function at different intervals.

lm_unnatural<-lm(triceps~ age + I(age^2)+I(age^3) +I((age-5)^3*(age>=5)) +
               I((age-10)^3*(age >= 10)) +
               I((age-20)^3*(age >= 20)) +
               I((age-30)^3*(age >= 30)) +
               I((age-40)^3*(age >= 40)),
             data = triceps)
pred_unnatural<-predict(lm_unnatural)
summary(lm_unnatural)

## 
## Call:
## lm(formula = triceps ~ age + I(age^2) + I(age^3) + I((age - 5)^3 * 
##     (age >= 5)) + I((age - 10)^3 * (age >= 10)) + I((age - 20)^3 * 
##     (age >= 20)) + I((age - 30)^3 * (age >= 30)) + I((age - 40)^3 * 
##     (age >= 40)), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5234  -1.6912  -0.2917   1.1356  23.0922 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.506279   1.255432   5.183 2.71e-07 ***
## age                            1.886000   1.251967   1.506   0.1323    
## I(age^2)                      -0.555512   0.334230  -1.662   0.0969 .  
## I(age^3)                       0.041351   0.026063   1.587   0.1130    
## I((age - 5)^3 * (age >= 5))   -0.039615   0.030956  -1.280   0.2010    
## I((age - 10)^3 * (age >= 10)) -0.006842   0.006664  -1.027   0.3048    
## I((age - 20)^3 * (age >= 20))  0.008044   0.002263   3.554   0.0004 ***
## I((age - 30)^3 * (age >= 30)) -0.005703   0.002317  -2.462   0.0140 *  
## I((age - 40)^3 * (age >= 40))  0.010290   0.004178   2.463   0.0140 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.743 on 883 degrees of freedom
## Multiple R-squared:  0.4261, Adjusted R-squared:  0.4209 
## F-statistic: 81.94 on 8 and 883 DF,  p-value: < 2.2e-16

Plot the regression model

data_plot +
geom_line(data=triceps, 
            aes(y = pred_unnatural, x=age), size = 1, col="blue")

We can see before the first knot (age<5) and after the last knot(age>40) the regression line is quite wiggly, and people think this is “un-natural”. By the way, the term of ‘natural’ spline may come from beam models in construction.

Conduct a ‘half-natural’ cubic spline regression

For a ‘full’ natural cubic spline regression, we will force the function linear beyond the boundary knots, i.e we will force the regression line before the first knot(age=5) and after the last knot (age=40) linear. For the ‘half natural cubic spline’ I mean we can just force one tail linear, for a easier part, we just force the regression line before knot (age=5) linear. We can do this just by dropping off the \(x^2\) and \(x^3\) terms in the equation \((2)\), then the regression model becomes

\[y = \beta_0 + \beta_1 x + \beta_4 (x-k_1)^3_{+} + \beta_5 (x-k_2)^3_{+} + \beta_6 (x-k_3)^3_{+} + \beta_7 (x-k_4)^3_{+} + \beta_8 (x-k_5)^3_{+} +\varepsilon \tag{3}\] We use the following R code to conduct the ‘half natural’ cubic spline regression,just by dropping off the \(x^2\) and \(x^3\) terms in the model \((2)\).

lm_half_natural<-lm(triceps~ age +I((age-5)^3*(age>=5)) +
               I((age-10)^3*(age >= 10)) +
               I((age-20)^3*(age >= 20)) +
               I((age-30)^3*(age >= 30)) +
               I((age-40)^3*(age >= 40)),
             data = triceps)
pred_half_natural<-predict(lm_half_natural)
summary(lm_half_natural)

## 
## Call:
## lm(formula = triceps ~ age + I((age - 5)^3 * (age >= 5)) + I((age - 
##     10)^3 * (age >= 10)) + I((age - 20)^3 * (age >= 20)) + I((age - 
##     30)^3 * (age >= 30)) + I((age - 40)^3 * (age >= 40)), data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5238  -1.7161  -0.3777   1.1569  23.0823 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    8.7317709  0.3676174  23.752  < 2e-16 ***
## age                           -0.3249270  0.0587533  -5.530 4.21e-08 ***
## I((age - 5)^3 * (age >= 5))    0.0064974  0.0007624   8.523  < 2e-16 ***
## I((age - 10)^3 * (age >= 10)) -0.0118709  0.0015436  -7.690 3.91e-14 ***
## I((age - 20)^3 * (age >= 20))  0.0083771  0.0016767   4.996 7.04e-07 ***
## I((age - 30)^3 * (age >= 30)) -0.0058053  0.0021305  -2.725  0.00656 ** 
## I((age - 40)^3 * (age >= 40))  0.0103644  0.0041005   2.528  0.01166 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.746 on 885 degrees of freedom
## Multiple R-squared:  0.4238, Adjusted R-squared:  0.4199 
## F-statistic: 108.5 on 6 and 885 DF,  p-value: < 2.2e-16

Compare the ‘un-natural’ spline regression with the ‘half-natural’ spline regession

data_plot +
geom_line(data=triceps, 
            aes(y = pred_unnatural, x=age), size = 2, col="blue")+
  geom_line(data=triceps, 
            aes(y = pred_half_natural, x=age), size = 1, col="orange")

Here, we can see the ‘half-natural’ cubic spline regression (orange) and the ‘un-natural’ cubic spline regression (blue) are only quite different before the knot (age=5), since we have forced the function before the knot (age=5) linear for the ‘half-natural’ cubic spline regression.

Conduct a ‘full-natural’ cubic spline regression

Here, the ‘full-natural’ cubic spline regression just means the natural cubic spline regression, I give the name “full-natural’ just for comparing with”un-natural” and “half-natural” spline regressions above.

We use rcs function from the rms package to conduct the natural spline regression

model.full_natural <- ols(triceps~rcs(age,parms=c(5,10,20,30,40)),data=triceps)
model.full_natural

## Linear Regression Model
## 
## ols(formula = triceps ~ rcs(age, parms = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs     892    LR chi2    485.23    R2       0.420    
## sigma3.7558    d.f.            4    R2 adj   0.417    
## d.f.    887    Pr(> chi2) 0.0000    g        3.464    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -10.6288  -1.7162  -0.3569   1.1035  22.8625 
## 
## 
##           Coef     S.E.   t     Pr(>|t|)
## Intercept   8.6571 0.3658 23.66 <0.0001 
## age        -0.3043 0.0576 -5.28 <0.0001 
## age'        7.3465 0.8627  8.52 <0.0001 
## age''     -13.0582 1.6827 -7.76 <0.0001 
## age'''      7.7086 1.5005  5.14 <0.0001

Next, we put the “un-nature”, “half-nature” and “full-nature” cubic spline regession together.

pred_full_natural<-predict(model.full_natural)

data_plot +
geom_line(data=triceps, 
            aes(y = pred_unnatural, x=age), size = 2, col="blue") +
geom_line(data=triceps, 
          aes(y = pred_full_natural, x=age), size = 1, col="red") +
geom_line(data=triceps, 
            aes(y = pred_half_natural, x=age), size = 1, col="orange")

We can see for the full-natural cubic spline regression(red), after the last knot (age=40), the line is linear however the un-natural and half-natural are not.

Next we can get the final model function form by the following r code

Function(model.full_natural)

## function (age = NA) 
## {
##     8.6570578 - 0.30431985 * age + 0.0059971737 * pmax(age - 
##         5, 0)^3 - 0.010659745 * pmax(age - 10, 0)^3 + 0.0062927259 * 
##         pmax(age - 20, 0)^3 - 0.001596325 * pmax(age - 30, 0)^3 - 
##         3.3829677e-05 * pmax(age - 40, 0)^3
## }
## <environment: 0x000001fc25133ed0>

Finally, our final natural cubic spline regression model can be written as

\[\mathrm{E}(\mathrm{triceps}) = X\beta,~~\mathrm{where}\] \[\begin{array} \lefteqn{X\hat{\beta}=}\\ & & 8.657058 \\ & & -0.3043198 \mathrm{age}+0.005997174 (\mathrm{age}-5)_{+}^{3}-0.01065974 (\mathrm{age}-10)_{+}^{3} \\ & & +0.006292726 (\mathrm{age}-20)_{+}^{3}-0.001596325 (\mathrm{age}-30)_{+}^{3}-3.382968\!\times\!10^{-5}(\mathrm{age}-40)_{+}^{3} \\ \end{array} \tag{4}\]

\[(x)_{+}=x~\mathrm{if}~x > 0,~0~\mathrm{otherwise}\] Note, the above equation is produced by the function

latex(model.full_natural)

Noted, our final result, the equation \((4)\), seems quite different from the below natural cubic spline regression output except the intercept and the coefficient for the age variable. What are the age’, age’’ and age’’’?

## Linear Regression Model
## 
## ols(formula = triceps ~ rcs(age, parms = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs     892    LR chi2    485.23    R2       0.420    
## sigma3.7558    d.f.            4    R2 adj   0.417    
## d.f.    887    Pr(> chi2) 0.0000    g        3.464    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -10.6288  -1.7162  -0.3569   1.1035  22.8625 
## 
## 
##           Coef     S.E.   t     Pr(>|t|)
## Intercept   8.6571 0.3658 23.66 <0.0001 
## age        -0.3043 0.0576 -5.28 <0.0001 
## age'        7.3465 0.8627  8.52 <0.0001 
## age''     -13.0582 1.6827 -7.76 <0.0001 
## age'''      7.7086 1.5005  5.14 <0.0001

To understand the output of the natural cubic spline regression from the rms package, we need to do some mathematical derivations and conduct some “hand calculations”

Mathematical derivations of the natual cubic splines

First let us write down a general form of cubic splines regression model with \(K\) knots (we do not count two ends of the line as knots). This formula can be found in the book “The Elements of Statistical Learning, Data Mining, Inference, and Prediction’ by Professor Hastie, et al.

\[f(x) = \sum_{j=0}^3 \beta_j x^j + \sum_{k=1}^K \theta_k (x - \xi_k)_{+}^{3}.\tag{5}\]

This is what I called “un-natural” spline regression model, since there were no any restrictions on the formula \((5)\).In the statistical learning book, a capital letter \(X\) was used, usually, a capital letter stands for a random variable in statistics, here I prefer to use a little letter \(x\) since for linear regressions people usually think the ‘randomness’ is not from the predictors.

For the natural spline we need to force the line before the first knot and line after the last knot linear (i.e. straight), these restrictions imply the following relationship:

For the first restriction, when \(x\) is less than the first knot, we need:

\[\beta_2=0 \text{ and } \beta_3=0 \tag{6}\] For the second restriction, when \(x\) is greater than the last knot, we also need \[\sum_{k=1}^K\theta_k=0 \text{ and } \sum_{k=1}^K\xi_k\theta_k=0 \tag{7}\]

The proofs of the above relationship are not very difficult.

From the equation \((5)\) when \(x<\xi_1\) we need the \(f(x)\) be linear, note, the ‘linear’ only means \(f(x)\) can be expressed as some combinations of \(x\) at power 0 or 1, i.e. there were no \(x^2\), \(x^3\)…, noted sometimes linear and non-linear can be in term of regression coefficients, here the ‘linear’ is just in term of \(x\).

When \(x<\xi_1\) we have

\[f(x)=\sum_{j=0}^3\beta_j x^j=\beta_0+\beta_1x+\beta_2x^2+\beta_3x^3 \tag{8}\]

therefore, by the linearity, both \(\beta_2\) and \(\beta_3\) have to be equal to zero.

Method 2.

Another way to prove the condition is to use some calculus properties, since \(f(x)\) needs to be linear in term of \(x\), therefore, the first derivative of \(f(x)\) will be a constant and the second derivative of \(f(x)\) will be 0, i.e if \(f(x)=kx+b\) then \(f^{'}(x)=k\) and \(f^{''}(x)=0\)

We can set \(f^{''}(x)=0\) then we obtain \(2\beta_2+6\beta_3x=0\) for any \(x<\xi_1\), therefore, \(\beta_2\) and \(\beta_3\) are both equal to 0.

For the second restriction, when \(x>\xi_K\),i.e when \(x\) is greater than the last knot, we can drop the \(+\) sign for the truncated function since when \(x>k_i\), \((x-k_i)^d_{+}=(x-k_i)^d\) then \(f(x)\) can be written as:

\[f(x) = \sum_{j=0}^3 \beta_j x^j + \sum_{k=1}^K \theta_k (x - \xi_k)^{3}\tag{9}\] For the proof of the last restriction, we can tediously expand \(f(x)\) and drop \(x^2\) and \(x^3\) terms and show \(\sum_{k=1}^K\theta_k=0 \text{ and } \sum_{k=1}^K\xi_k\theta_k=0\). Here, I will use the derivative method since it is a little bit conciser.

\[f''(x) = 6 \sum_{k=1}^K \theta_k (x-\xi_k)=0\Rightarrow (\sum_{k=1}^K\theta_k)x-\sum_{k=1}^K\theta_k\xi_k=0, \; \forall x>\xi_K\] Since \(\forall x\) that \(x>\xi_K\), we have \(f''(x)=0\), therefore, both
\(\sum_{k=1}^K\theta_k\) and \(\sum_{k=1}^K\theta_k\xi_k\) have to be 0. i.e.

\[\sum_{k=1}^K\theta_k=0 \text{ and } \sum_{k=1}^K\theta_k\xi_k=0\] \(\square\)

Because we put restrictions\((6)\) and \((7)\) on the model \((5)\) we have to develop a new representation of the linear regression model. Next, I will show to obtain these spline “basis functions” for the natural cubic spline regression.

The concept of the basis function can be found from professor Hastie’s book https://hastie.su.domains/ISLR2/ISLRv2_website.pdf on page 294.

Let us see what will happen when we put restrictions \((6)\) and \((7)\) on \((5)\)

Because of \((6)\), we can easily see the first part of \((5)\) becomes \(\beta_0+\beta_1x\)

Let see what will happen when we put the restriction \((7)\) on \((5)\)

\[\sum_{k=1}^K \theta_k (x - \xi_k)_{+}^{3}=\sum_{k=1}^{K-1}\theta_k (x - \xi_k)_{+}^{3}+\theta_K(x-\xi_K)_{+}^3 \tag{10}\] From \(\sum_{k=1}^K\theta_k=0\) in \((7)\) we can infer that \[\sum_{k=1}^{K-1}\theta_k +\theta_K=0\Rightarrow \theta_K=-\sum_{k=1}^{K-1}\theta_k \]

We can replace \(\theta_K\) in \((10)\) with \(-\sum_{k=1}^{K-1}\theta_k\) and we get:

\[\begin{align*}\sum_{k=1}^K \theta_k (x - \xi_k)_{+}^{3}&=\sum_{k=1}^{K-1}\theta_k (x - \xi_k)_{+}^{3}-\sum_{k=1}^{K-1}\theta_k(x-\xi_K)_{+}^3\\ &=\sum_{k=1}^{K-1}\theta_k \left [(x - \xi_k)_{+}^{3}-(x-\xi_K)_{+}^3\right ]\tag{11} \end{align*}\]

Notice, for the restriction \((7)\) we have not use the condition \(\sum_{k=1}^K\xi_k\theta_k=0\) yet, now we will use this condition.

\[\sum_{k=1}^K\theta_k=0 \text{ and } \sum_{k=1}^K\theta_k\xi_k=0\Rightarrow (\sum_{k=1}^K\theta_k)\xi_K-\sum_{k=1}^K\theta_k\xi_k=0 \tag{12}\],note what we did here is just \(0*\xi_K-0=0\)

From the \((12)\)

\[\begin{align*}(\sum_{k=1}^K\theta_k)\xi_K-\sum_{k=1}^K\theta_k\xi_k=0&\Rightarrow\sum_{k=1}^K\theta_k(\xi_K-\xi_k)=0\\ &\Rightarrow \sum_{k=1}^{K-1}\theta_k(\xi_K-\xi_k)+\theta_K(\xi_K-\xi_K)=0\\&\Rightarrow \sum_{k=1}^{K-1}\theta_k(\xi_K-\xi_k)=0\\ &\Rightarrow \sum_{k=1}^{K-2}\theta_k(\xi_K-\xi_k)+\theta_{K-1}(\xi_K-\xi_{K-1})=0\\ &\Rightarrow \theta_{K-1}=-\sum_{k=1}^{K-2}\theta_k\frac{\xi_K-\xi_k}{\xi_K-\xi_{K-1}} \tag{13} \end{align*}\]

Combine the \((13)\) and \((11)\)

\[\begin{align*}\sum_{k=1}^K \theta_k (x - \xi_k)_{+}^{3}&=\sum_{k=1}^{K-1}\theta_k \left [(x - \xi_k)_{+}^{3}-(x-\xi_K)_{+}^3\right]\\ &=\sum_{k=1}^{K-2}\theta_k \left [(x - \xi_k)_{+}^{3}-(x-\xi_K)_{+}^3\right]+\theta_{K-1}\left[ (x-\xi_{K-1})_{+}^3-(x-\xi_K)_{+}^3\right] (\text{ replace } \theta_{K-1} \text{ from the equation 13})\\ &=\sum_{k=1}^{K-2}\theta_k \left [(x - \xi_k)_{+}^{3}-(x-\xi_K)_{+}^3\right]-\sum_{k=1}^{K-2}\theta_k\frac{\xi_K-\xi_k}{\xi_K-\xi_{K-1}}\left[ (x-\xi_{K-1})_{+}^3-(x-\xi_K)_{+}^3\right] \tag{14} \end{align*}\]

Define a function \[ d_k(x) = \frac{(x-\xi_k)^3_+ - (x-\xi_K)_+^3}{\xi_K - \xi_k} \tag{15} \] From the equation \((15)\) we can get

\[d_{K-1}(x) = \frac{(x-\xi_{K-1})^3_+ - (x-\xi_K)_+^3}{\xi_K - \xi_{K-1}} \tag{16}\]

Replace truncated functions in \((14)\) by \(d_k(x)\) and \(d_{K-1}(x)\) and we get

\[\begin{align*}\sum_{k=1}^K \theta_k (x - \xi_k)_{+}^{3}&=\sum_{k=1}^{K-1}\theta_k \left [(x - \xi_k)_{+}^{3}-(x-\xi_K)_{+}^3\right]\\ &=\sum_{k=1}^{K-2}\theta_k \left [(x - \xi_k)_{+}^{3}-(x-\xi_K)_{+}^3\right]-\sum_{k=1}^{K-2}\theta_k\frac{\xi_K-\xi_k}{\xi_K-\xi_{K-1}}\left[ (x-\xi_{K-1})_{+}^3-(x-\xi_K)_{+}^3\right]\\ &=\sum_{k=1}^{K-2} \theta_k (\xi_K - \xi_k) d_k(x) - \sum_{k=1}^{K-2} \theta_k \frac{\xi_K - \xi_k}{\xi_K - \xi_{K-1}} (\xi_K - \xi_{K-1}) d_{K-1}(x)\\ &=\sum_{k=1}^{K-2} \theta_k (\xi_K - \xi_k) \left [d_k(x) - d_{K-1}(x)\right] \tag{17} \end{align*}\]

Put \((17)\) and \((6)\) back to \((5)\) we get the final natural cubic spline regression model.

\[f(x)=\beta_0+\beta_1x+\sum_{k=1}^{K-2} \theta_k (\xi_K - \xi_k) \left [d_k(x) - d_{K-1}(x)\right] \tag{18} \] Notice, in \((18)\) \((\xi_K - \xi_k) \left [d_k(x) - d_{K-1}(x)\right]\) is a function of \(x\) we may write it as

\[s_k(x)=(\xi_K - \xi_k) \left [d_k(x) - d_{K-1}(x)\right] \tag{19}\]

It is the basis function of the natural spline regression, now \(f(x)\) can be treated as a regular linear regression model, we can calculate the coefficients by usual methods such as ordinary least square methods or maximum likelihood methods.

Conduct the natural spline regression ‘by hand’

Since we know that a natural cubic spline regression model can be represented by \((18)\), now we can perform the regression ‘by hand’. When I say ‘by hand’, I mean we can conduct the regression without using special functions from R packages such as “splines” and ‘rms’ or functions from other statistical software and we even can conduct the regression model by using Excel or paper and pencils.

Next, I will show how to conduct the regression model step by step without using special R or other software’s build in functions, and I will compare the results from our calculations with results from R packages such as “splines” and “rms”.

The followings are the first 100 data records of the triceps data set . I will do calculations on data number 93 and 95 manually, then I will write a function to calculate basis function values for each data record.

head(triceps,100)

##       age lntriceps triceps
## 1   12.05  1.223776     3.4
## 2    9.91  1.386294     4.0
## 3   10.04  1.435084     4.2
## 4   11.49  1.435084     4.2
## 5   10.12  1.481605     4.4
## 6   11.87  1.481605     4.4
## 7   10.91  1.568616     4.8
## 8   10.03  1.568616     4.8
## 9   11.02  1.568616     4.8
## 10  10.00  1.609438     5.0
## 11  11.36  1.609438     5.0
## 12  11.38  1.609438     5.0
## 13  11.01  1.648659     5.2
## 14   9.99  1.648659     5.2
## 15   9.96  1.648659     5.2
## 16  11.93  1.648659     5.2
## 17  11.26  1.648659     5.2
## 18  10.08  1.648659     5.2
## 19   9.99  1.686399     5.4
## 20  12.09  1.686399     5.4
## 21  10.40  1.686399     5.4
## 22  10.00  1.686399     5.4
## 23  11.01  1.686399     5.4
## 24  11.85  1.686399     5.4
## 25  11.04  1.686399     5.4
## 26  10.40  1.686399     5.4
## 27  11.40  1.704748     5.5
## 28  11.67  1.722767     5.6
## 29  11.61  1.722767     5.6
## 30  10.04  1.722767     5.6
## 31   9.93  1.722767     5.6
## 32  10.65  1.722767     5.6
## 33  12.14  1.757858     5.8
## 34  10.51  1.757858     5.8
## 35  10.57  1.757858     5.8
## 36  10.50  1.757858     5.8
## 37  12.20  1.757858     5.8
## 38  11.98  1.791759     6.0
## 39  10.64  1.791759     6.0
## 40  11.60  1.791759     6.0
## 41  10.31  1.791759     6.0
## 42  10.60  1.791759     6.0
## 43  11.31  1.791759     6.0
## 44  10.37  1.791759     6.0
## 45  11.61  1.824549     6.2
## 46  11.38  1.824549     6.2
## 47  10.44  1.824549     6.2
## 48  11.20  1.824549     6.2
## 49  11.99  1.856298     6.4
## 50  11.44  1.856298     6.4
## 51  10.13  1.856298     6.4
## 52  10.48  1.887070     6.6
## 53  11.44  1.887070     6.6
## 54  11.40  1.887070     6.6
## 55  11.46  1.887070     6.6
## 56  11.89  1.916923     6.8
## 57  11.34  1.916923     6.8
## 58  12.20  1.916923     6.8
## 59  10.65  1.945910     7.0
## 60  10.50  1.945910     7.0
## 61  10.52  1.945910     7.0
## 62  11.37  1.945910     7.0
## 63  11.16  1.945910     7.0
## 64  11.98  1.945910     7.0
## 65  11.13  1.945910     7.0
## 66  10.69  1.945910     7.0
## 67  10.49  1.974081     7.2
## 68  11.74  1.974081     7.2
## 69  12.13  2.001480     7.4
## 70  10.27  2.001480     7.4
## 71   9.82  2.079442     8.0
## 72  12.00  2.079442     8.0
## 73  11.37  2.128232     8.4
## 74  11.52  2.151762     8.6
## 75  12.01  2.174752     8.8
## 76   9.91  2.174752     8.8
## 77  11.85  2.197225     9.0
## 78  12.19  2.197225     9.0
## 79  12.22  2.282382     9.8
## 80  11.57  2.282382     9.8
## 81  11.24  2.302585    10.0
## 82  11.33  2.322388    10.2
## 83  10.89  2.322388    10.2
## 84  10.55  2.564949    13.0
## 85  10.45  2.653242    14.2
## 86  11.28  2.772589    16.0
## 87  36.51  2.240710     9.4
## 88   2.98  2.104134     8.2
## 89  31.72  3.054001    21.2
## 90   4.26  2.261763     9.6
## 91  24.37  2.322388    10.2
## 92  19.50  2.580217    13.2
## 93  40.72  2.610070    13.6
## 94  17.74  2.501436    12.2
## 95  17.92  1.916923     6.8
## 96  18.07  1.856298     6.4
## 97  14.62  2.639057    14.0
## 98   7.78  1.791759     6.0
## 99  18.12  2.415914    11.2
## 100  5.49  2.001480     7.4

From the equation \((18)\) and \((19)\) we can see except \(x\) we have \(K-2\) basis functions in the model,i.e. from \(k=1\) to \(K-2\) basis functions. For triceps data set, we set five knots (K=5) at age 5,10,20,30,40, therefore we have 3 (5-2) spline basis. We write them as:

\[s_1=(\xi_5-\xi_1)\left[d_1(x)-d_4(x)\right] \tag{20}\] \[s_2=(\xi_5-\xi_2)\left[d_2(x)-d_4(x)\right]\tag{21}\]

\[s_3=(\xi_5-\xi_3)\left[d_3(x)-d_4(x)\right]\tag{22}\] From the equation \((15)\)

\[d_1(x)= \frac{(x-\xi_1)^3_+ - (x-\xi_5)_+^3}{\xi_5 - \xi_1}\]

\[d_2(x)= \frac{(x-\xi_2)^3_+ - (x-\xi_5)_+^3}{\xi_5 - \xi_1}\]

\[d_3(x)= \frac{(x-\xi_3)^3_+ - (x-\xi_5)_+^3}{\xi_5 - \xi_1}\] \[d_4(x)= \frac{(x-\xi_4)^3_+ - (x-\xi_5)_+^3}{\xi_5 - \xi_1}\]

Now, let us look at the data number 93 whose age=40.72, from the above equations we get:

\[d_1=\frac{(40.72-5)_+^3 - (40.72-40)_+^3}{40 - 5}=1302.155\]

\[d_2=\frac{(40.72-10)_+^3 - (40.72-40)_+^3}{40 - 5}=966.3552\]

\[d_3=\frac{(40.72-20)_+^3 - (40.72-40)_+^3}{40 - 5}=444.755\]

\[d_4=\frac{(40.72-30)_+^3 - (40.72-40)_+^3}{40 - 5}=123.1552\] Put \(d_1,d_2,d_3,d_4\) back to \((20),(21), (22)\) and we get: \[s_1=(40-5)*(1302.155-123.1552)=41265\] \[s_2=(40-10)*(966.3552-123.1552)=25296\] \[s_3=(40-20)*(444.755-123.1552)=6432\] Notice the values of these basis functions are quite large, usually we can scale these values smaller, we scale these values by dividing them by \((\xi_5-\xi_1)^2\), as has been done by rms package in R. When we scale continuous predictor it will not affect model fit and prediction, however it does affect the coefficient of the predictor variable.

\[s_1=\frac{41265}{(40-5)^2}=33.68571 \tag{23}\] \[s_2=\frac{25296}{(40-5)^2}=20.6498 \tag{24}\] \[s_3=\frac{6432}{(40-5)^2}=5.250612 \tag{25}\]

We continue the calculations for the data number 95, i.e for age=17.92.

Note, for age=17.92, there are only two extra basis functions, \(s_1\) and \(s_2\) need to be calculated, since \(s_3\) will be 0.

\[d_1=\frac{(17.92-5)_+^3 - (17.92-40)_+^3}{40 - 5}=61.61969\] Note, \((17.92-40)_+^3=0\)

\[d_2=\frac{(17.92-10)_+^3 - (17.92-40)_+^3}{40 - 5}=16.55977\] \(d_3\) and \(d_4\) are all zeros for age=17.92.

\[s_1=(40-5)*(61.61969-0)=2156.689\] \[s_2=(40-10)*(16.55977-0)=496.7931\] Scale \(s_1\) and \(s_2\) by \((\xi_5-\xi_1)^2\)

\[s_1=\frac{2156.689}{(40-5)^2}=1.760563\] \[s_2=\frac{496.7931}{(40-5)^2}= 0.4055454\]

I will stop manual calculations here, next I will write a R function to calculate the values of basis functions for each data record in the triceps data set.

sk<-function(x,xi_k){
  xi_1<-5 #first knot
  xi_K_1<-30 #second to the last knot
  xi_K<-40   #last knot
  dk<-((x-xi_k)^3*(x>=xi_k)-(x-xi_K)^3*(x>=xi_K))/(xi_K-xi_k)
  dK_1<-((x-xi_K_1)^3*(x>=xi_K_1)-(x-xi_K)^3*(x>=xi_K))/(xi_K-xi_K_1)
  sk<-(xi_K-xi_k)*(dk-dK_1)/(xi_K-xi_1)^2 # scaled sk
  return(sk)
  }

We use the above function to calculate the scaled values of basis functions of \(s_1, s_2\) and \(s_3\).

s1<-sk(triceps$age,5)
s2<-sk(triceps$age,10)
s3<-sk(triceps$age,20)
triceps_new<-cbind(triceps,s1,s2,s3)
head(triceps_new,100)

##       age lntriceps triceps           s1           s2         s3
## 1   12.05  1.223776     3.4 2.860430e-01 7.032757e-03 0.00000000
## 2    9.91  1.386294     4.0 9.662919e-02 0.000000e+00 0.00000000
## 3   10.04  1.435084     4.2 1.045094e-01 5.224475e-08 0.00000000
## 4   11.49  1.435084     4.2 2.231505e-01 2.700365e-03 0.00000000
## 5   10.12  1.481605     4.4 1.095655e-01 1.410608e-06 0.00000000
## 6   11.87  1.481605     4.4 2.646879e-01 5.338124e-03 0.00000000
## 7   10.91  1.568616     4.8 1.685102e-01 6.151597e-04 0.00000000
## 8   10.03  1.568616     4.8 1.038886e-01 2.204023e-08 0.00000000
## 9   11.02  1.568616     4.8 1.780957e-01 8.662934e-04 0.00000000
## 10  10.00  1.609438     5.0 1.020408e-01 0.000000e+00 0.00000000
## 11  11.36  1.609438     5.0 2.100077e-01 2.053432e-03 0.00000000
## 12  11.38  1.609438     5.0 2.119952e-01 2.145365e-03 0.00000000
## 13  11.01  1.648659     5.2 1.772097e-01 8.410626e-04 0.00000000
## 14   9.99  1.648659     5.2 1.014298e-01 0.000000e+00 0.00000000
## 15   9.96  1.648659     5.2 9.961138e-02 0.000000e+00 0.00000000
## 16  11.93  1.648659     5.2 2.716838e-01 5.868621e-03 0.00000000
## 17  11.26  1.648659     5.2 2.002567e-01 1.632961e-03 0.00000000
## 18  10.08  1.648659     5.2 1.070176e-01 4.179580e-07 0.00000000
## 19   9.99  1.686399     5.4 1.014298e-01 0.000000e+00 0.00000000
## 20  12.09  1.686399     5.4 2.909395e-01 7.452515e-03 0.00000000
## 21  10.40  1.686399     5.4 1.285420e-01 5.224475e-05 0.00000000
## 22  10.00  1.686399     5.4 1.020408e-01 0.000000e+00 0.00000000
## 23  11.01  1.686399     5.4 1.772097e-01 8.410626e-04 0.00000000
## 24  11.85  1.686399     5.4 2.623830e-01 5.168677e-03 0.00000000
## 25  11.04  1.686399     5.4 1.798766e-01 9.182562e-04 0.00000000
## 26  10.40  1.686399     5.4 1.285420e-01 5.224475e-05 0.00000000
## 27  11.40  1.704748     5.5 2.139951e-01 2.239998e-03 0.00000000
## 28  11.67  1.722767     5.6 2.422375e-01 3.802011e-03 0.00000000
## 29  11.61  1.722767     5.6 2.357590e-01 3.406758e-03 0.00000000
## 30  10.04  1.722767     5.6 1.045094e-01 5.224475e-08 0.00000000
## 31   9.93  1.722767     5.6 9.781484e-02 0.000000e+00 0.00000000
## 32  10.65  1.722767     5.6 1.472344e-01 2.241833e-04 0.00000000
## 33  12.14  1.757858     5.8 2.971383e-01 8.000285e-03 0.00000000
## 34  10.51  1.757858     5.8 1.365585e-01 1.082867e-04 0.00000000
## 35  10.57  1.757858     5.8 1.410683e-01 1.511777e-04 0.00000000
## 36  10.50  1.757858     5.8 1.358163e-01 1.020408e-04 0.00000000
## 37  12.20  1.757858     5.8 3.046922e-01 8.692243e-03 0.00000000
## 38  11.98  1.791759     6.0 2.776068e-01 6.336642e-03 0.00000000
## 39  10.64  1.791759     6.0 1.464540e-01 2.139954e-04 0.00000000
## 40  11.60  1.791759     6.0 2.346907e-01 3.343676e-03 0.00000000
## 41  10.31  1.791759     6.0 1.222215e-01 2.431928e-05 0.00000000
## 42  10.60  1.791759     6.0 1.433600e-01 1.763269e-04 0.00000000
## 43  11.31  1.791759     6.0 2.050936e-01 1.835178e-03 0.00000000
## 44  10.37  1.791759     6.0 1.264115e-01 4.134935e-05 0.00000000
## 45  11.61  1.824549     6.2 2.357590e-01 3.406758e-03 0.00000000
## 46  11.38  1.824549     6.2 2.119952e-01 2.145365e-03 0.00000000
## 47  10.44  1.824549     6.2 1.314197e-01 6.953776e-05 0.00000000
## 48  11.20  1.824549     6.2 1.945535e-01 1.410612e-03 0.00000000
## 49  11.99  1.856298     6.4 2.788017e-01 6.433140e-03 0.00000000
## 50  11.44  1.856298     6.4 2.180326e-01 2.437536e-03 0.00000000
## 51  10.13  1.856298     6.4 1.102087e-01 1.793474e-06 0.00000000
## 52  10.48  1.887070     6.6 1.343400e-01 9.027893e-05 0.00000000
## 53  11.44  1.887070     6.6 2.180326e-01 2.437536e-03 0.00000000
## 54  11.40  1.887070     6.6 2.139951e-01 2.239998e-03 0.00000000
## 55  11.46  1.887070     6.6 2.200703e-01 2.540519e-03 0.00000000
## 56  11.89  1.916923     6.8 2.670064e-01 5.511243e-03 0.00000000
## 57  11.34  1.916923     6.8 2.080328e-01 1.964167e-03 0.00000000
## 58  12.20  1.916923     6.8 3.046922e-01 8.692243e-03 0.00000000
## 59  10.65  1.945910     7.0 1.472344e-01 2.241833e-04 0.00000000
## 60  10.50  1.945910     7.0 1.358163e-01 1.020408e-04 0.00000000
## 61  10.52  1.945910     7.0 1.373034e-01 1.147823e-04 0.00000000
## 62  11.37  1.945910     7.0 2.109999e-01 2.099063e-03 0.00000000
## 63  11.16  1.945910     7.0 1.908121e-01 1.274200e-03 0.00000000
## 64  11.98  1.945910     7.0 2.776068e-01 6.336642e-03 0.00000000
## 65  11.13  1.945910     7.0 1.880379e-01 1.177875e-03 0.00000000
## 66  10.69  1.945910     7.0 1.503836e-01 2.681701e-04 0.00000000
## 67  10.49  1.974081     7.2 1.350768e-01 9.603987e-05 0.00000000
## 68  11.74  1.974081     7.2 2.499445e-01 4.300426e-03 0.00000000
## 69  12.13  2.001480     7.4 2.958915e-01 7.888652e-03 0.00000000
## 70  10.27  2.001480     7.4 1.194802e-01 1.606784e-05 0.00000000
## 71   9.82  2.079442     8.0 9.141236e-02 0.000000e+00 0.00000000
## 72  12.00  2.079442     8.0 2.800000e-01 6.530612e-03 0.00000000
## 73  11.37  2.128232     8.4 2.109999e-01 2.099063e-03 0.00000000
## 74  11.52  2.151762     8.6 2.262595e-01 2.866785e-03 0.00000000
## 75  12.01  2.174752     8.8 2.812017e-01 6.629064e-03 0.00000000
## 76   9.91  2.174752     8.8 9.662919e-02 0.000000e+00 0.00000000
## 77  11.85  2.197225     9.0 2.623830e-01 5.168677e-03 0.00000000
## 78  12.19  2.197225     9.0 3.034244e-01 8.574247e-03 0.00000000
## 79  12.22  2.282382     9.8 3.072384e-01 8.931471e-03 0.00000000
## 80  11.57  2.282382     9.8 2.315048e-01 3.159094e-03 0.00000000
## 81  11.24  2.302585    10.0 1.983433e-01 1.556427e-03 0.00000000
## 82  11.33  2.322388    10.2 2.070499e-01 1.920520e-03 0.00000000
## 83  10.89  2.322388    10.2 1.668053e-01 5.754856e-04 0.00000000
## 84  10.55  2.564949    13.0 1.395542e-01 1.358165e-04 0.00000000
## 85  10.45  2.653242    14.2 1.321458e-01 7.438766e-05 0.00000000
## 86  11.28  2.772589    16.0 2.021821e-01 1.711960e-03 0.00000000
## 87  36.51  2.240710     9.4 2.475103e+01 1.453307e+01 3.22327151
## 88   2.98  2.104134     8.2 0.000000e+00 0.000000e+00 0.00000000
## 89  31.72  3.054001    21.2 1.555850e+01 8.352103e+00 1.30584758
## 90   4.26  2.261763     9.6 0.000000e+00 0.000000e+00 0.00000000
## 91  24.37  2.322388    10.2 5.932706e+00 2.422335e+00 0.06812531
## 92  19.50  2.580217    13.2 2.488673e+00 6.998980e-01 0.00000000
## 93  40.72  2.610070    13.6 3.368572e+01 2.064980e+01 5.25061284
## 94  17.74  2.501436    12.2 1.687999e+00 3.785182e-01 0.00000000
## 95  17.92  1.916923     6.8 1.760563e+00 4.055454e-01 0.00000000
## 96  18.07  1.856298     6.4 1.822597e+00 4.290268e-01 0.00000000
## 97  14.62  2.639057    14.0 7.267568e-01 8.049887e-02 0.00000000
## 98   7.78  1.791759     6.0 1.753874e-02 0.000000e+00 0.00000000
## 99  18.12  2.415914    11.2 1.843595e+00 4.370510e-01 0.00000000
## 100  5.49  2.001480     7.4 9.603987e-05 0.000000e+00 0.00000000

Next let us run a linear regression on triceps variable, the predictor variables are age, s1,s2 and s3, let us see what we can get from the linear regression model.

spline_by_hand<-lm(triceps~age+s1+s2+s3,data=triceps)
summary(spline_by_hand)

## 
## Call:
## lm(formula = triceps ~ age + s1 + s2 + s3, data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6288  -1.7162  -0.3569   1.1035  22.8625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.6571     0.3658  23.664  < 2e-16 ***
## age          -0.3043     0.0576  -5.283 1.60e-07 ***
## s1            7.3465     0.8627   8.516  < 2e-16 ***
## s2          -13.0582     1.6827  -7.760 2.33e-14 ***
## s3            7.7086     1.5005   5.137 3.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.756 on 887 degrees of freedom
## Multiple R-squared:  0.4196, Adjusted R-squared:  0.4169 
## F-statistic: 160.3 on 4 and 887 DF,  p-value: < 2.2e-16

Note, the results are exactly the same as using the rms package, I copy the rms results here for comparison.

## Linear Regression Model
## 
## ols(formula = triceps ~ rcs(age, parms = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs     892    LR chi2    485.23    R2       0.420    
## sigma3.7558    d.f.            4    R2 adj   0.417    
## d.f.    887    Pr(> chi2) 0.0000    g        3.464    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -10.6288  -1.7162  -0.3569   1.1035  22.8625 
## 
## 
##           Coef     S.E.   t     Pr(>|t|)
## Intercept   8.6571 0.3658 23.66 <0.0001 
## age        -0.3043 0.0576 -5.28 <0.0001 
## age'        7.3465 0.8627  8.52 <0.0001 
## age''     -13.0582 1.6827 -7.76 <0.0001 
## age'''      7.7086 1.5005  5.14 <0.0001

Now we can understand that age’ stands for \(s_1\) basis function, age’’ stands for \(s_2\) function, and age’’’ stands for \(s_3\) function.

Next, I will show how to get the equation \((4)\),i.e

\[\begin{array} \lefteqn{X\hat{\beta}=}\\ & & 8.657058 \\ & & -0.3043198 \mathrm{age}+0.005997174 (\mathrm{age}-5)_{+}^{3}-0.01065974 (\mathrm{age}-10)_{+}^{3} \\ & & +0.006292726 (\mathrm{age}-20)_{+}^{3}-0.001596325 (\mathrm{age}-30)_{+}^{3}-3.382968\!\times\!10^{-5}(\mathrm{age}-40)_{+}^{3} \\ \end{array} \tag{4}\]

To understand the equation \((4)\) we need to look at the equation \((17)\), when we restrict two tails of the regression to be linear, we have

\[\sum_{k=1}^K \theta_k (x - \xi_k)_{+}^{3}=\sum_{k=1}^{K-2} \theta_k (\xi_K - \xi_k) \left [d_k(x) - d_{K-1}(x)\right]\tag{17}\]

From the equation \((17)\) we can see from \(k\) to \(K-2\) the coefficients of \((x -\xi_k)_{+}^{3}\) and \((\xi_K - \xi_k) \left [d_k(x) -d_{K-1}(x)\right]\) are the same, however, remember when we do the calculations for the equations \((23)\),\((24)\) and \((25)\) we scale the basis functions (divided by \(\xi_K-\xi_1)^2\)(i.e. divided by \(35^2\)). We know for a linear regression model if we divide a continuous predictor by a constant \(c\) then the coefficient of the predictor will be increased by \(c\) times. Since we divided the basis function values by \(35^2\) in our calculations for the regression model, we deliberately increased the coefficient by \(35^2\) times, therefore, for the original age scale, we need to divide the coefficients from the regression model by \(35^2\). Therefore,

\[\theta_1=\frac{7.3465378}{35^2}=0.005997174\] \[\theta_2=\frac{-13.05819}{35^2}=-0.01065974\] \[\theta_3=\frac{7.708589}{35^2}=0.006292726 \] Now we obtained three coefficients by hand for the natural cubic spline regression, next we need to calculate \(\theta_4\) and \(\theta_5\), these two coefficients can be calculated by the equation \((13)\) and the equation \((7)\).

From the equation \((13)\) \[\theta_4=-(\theta_1*\frac{40-5}{40-30}+\theta_2*\frac{40-10}{40-30}+\theta_2*\frac{40-20}{40-30})=-0.001596325 \] From the equation \((7)\)

\[\theta_5=-(\theta_1+\theta_2+\theta_3+\theta_4)=-3.382968\times10^{-5}\] Now, we finished our hand calculations for the natural cubic spline regression model. Hopefully, we can understand the meaning of the statistical analysis results.

Comparing natural cubic regression model results by the splines package and the rms package

Let us use R functions from the splines package to analyse the triceps data set.

splines.ns <- lm(triceps ~ ns(age, knots = c(5,10,20,30,40)), data=triceps)
summary(splines.ns)

## 
## Call:
## lm(formula = triceps ~ ns(age, knots = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4875  -1.6873  -0.3665   1.1146  22.8643 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              8.3811     0.5219  16.059  < 2e-16 ***
## ns(age, knots = c(5, 10, 20, 30, 40))1  -3.5592     0.6712  -5.303 1.44e-07 ***
## ns(age, knots = c(5, 10, 20, 30, 40))2   5.7803     1.0379   5.569 3.39e-08 ***
## ns(age, knots = c(5, 10, 20, 30, 40))3   5.5118     0.9416   5.853 6.78e-09 ***
## ns(age, knots = c(5, 10, 20, 30, 40))4   6.9070     0.9050   7.632 5.99e-14 ***
## ns(age, knots = c(5, 10, 20, 30, 40))5   5.4136     1.3783   3.928 9.24e-05 ***
## ns(age, knots = c(5, 10, 20, 30, 40))6   6.6460     1.0829   6.137 1.27e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.759 on 885 degrees of freedom
## Multiple R-squared:  0.4199, Adjusted R-squared:  0.416 
## F-statistic: 106.8 on 6 and 885 DF,  p-value: < 2.2e-16

We got very different output(coefficients) comparing with the rms results and our hand calculations ! Are they really different? Let us do a comparison on the regression line.

pred_splines_ns<-predict(splines.ns)

data_plot +

geom_line(data=triceps, 
          aes(y = pred_full_natural, x=age), size = 2, col="red") +
geom_line(data=triceps, 
            aes(y = pred_splines_ns, x=age), size = 1, col="green")

The two lines are congruent! In fact, the two models are exactly the same, however, they used different basis functions, when I have time I will show what are those orthogonal basis functions used in the splines package. But I will stop here.

References

1.Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.

2.James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.

3.Harrell, Frank E. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Vol. 608. New York: springer, 2001.

4.Bates, M. D., B. Venables, and Maintainer R. Core Team. “Package ‘splines’.” R Version 2, no. 0 (2011).

5.Machine Learning for Biostatistics, Armando Teixeira-Pinto, https://bookdown.org/tpinto_home/Beyond-Linearity/piecewise-regression-and-splines.html

6.MultiKink: Estimation and Inference for Multi-Kink Quantile Regression, https://cran.r-project.org/web/packages/MultiKink/index.html

Conduct natural spline regression ‘by hand’.

Enwu Liu

2023-02-16