Using the splines package

V.Rajaraman

We are familiar with linear modeling. But often the relationship between the predictor and the outcome will be nonlinear. Fortunately, the lm() function in R is powerful enough to handle even this situation. Let us see an example.

n = 1000
set.seed(n)
x = rnorm(n)
y = 8 + 0.6 * x + 0.4 * x^2 + 0.2 * x^3 + rnorm(n, 0, 0.5)

We generated a vector x of 1000 normal random numbers. The dependent variable y has x, x2 and x3 components, plus some noise.

If we plot x and y, we can clearly see that the curve is anything but a straight line:

plot(x, y)

plot of chunk unnamed-chunk-2

Just for curiosity, let us fit an ordinary straight line between x and y:

lm1 = lm(y ~ x)
lm1
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##        8.36         1.13
plot(x, y)
abline(lm1, col = "red")

plot of chunk unnamed-chunk-3

Now we go for a nonlinear fit. the function bs() in the splines package fits a cubic B-spline curve by default. But you can change the degrees through its parameters. See ?bs for further details .

library(splines)
b = bs(x)
str(b)
##  bs [1:1000, 1:3] 0.387 0.443 0.322 0.226 0.421 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "1" "2" "3"
##  - attr(*, "degree")= int 3
##  - attr(*, "knots")= num(0) 
##  - attr(*, "Boundary.knots")= num [1:2] -3.36 2.67
##  - attr(*, "intercept")= logi FALSE
##  - attr(*, "class")= chr [1:3] "bs" "basis" "matrix"

We now fit a linear model between y and the transformed variable b.

lm2 = lm(y ~ b)
lm2
## 
## Call:
## lm(formula = y ~ b)
## 
## Coefficients:
## (Intercept)           b1           b2           b3  
##       2.429       10.303       -0.565       14.049

The last action is to apply the linear model to the original variable to get the predictions:

pr = predict(lm2, newdata = data.frame(x))

Now you can see how the fit has dramatically improved:

plot(x, y)
points(x, pr, col = "red", pch = 19, cex = 0.5)

plot of chunk unnamed-chunk-7