Consider these data
set.seed(2016-6-18)
x<-rexp(1000)
y<-exp(-x/2)+rnorm(1000)/10
plot(y~x)
abline(lm(y~x),col="red",lwd=2)
Clearly, y is lower when x is higher, and the trend is flattening out. Back in the Stone Age, we used quadaratics to test for non-linearity in situations like this
m<-lm(y~x+I(x^2))
summary(m)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39610 -0.06346 -0.00200 0.06426 0.27974
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.972922 0.005583 174.27 <2e-16 ***
## x -0.401163 0.008017 -50.04 <2e-16 ***
## I(x^2) 0.044918 0.001986 22.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09958 on 997 degrees of freedom
## Multiple R-squared: 0.8476, Adjusted R-squared: 0.8473
## F-statistic: 2772 on 2 and 997 DF, p-value: < 2.2e-16
Using the quadratic for a test is actually ok-ish, though I’d prefer to compare against something more flexible, like a spline, unless a quadratic has some theoretical basis in context. (It obviously doesn’t here, because there is no context). What’s a problem is using the fitted quadratic as an estimate of \(E[Y|X=x]\).
plot(y~x)
quadratic<-bquote(.(coef(m)[1])+x*.(coef(m)[2])+x^2*.(coef(m)[3]))
do.call(curve,list(quadratic,add=TRUE,col="purple",lwd=2))
In particular, drawing the conclusion that \(E[Y|X=x]\) goes up again at the right extreme is completely without basis. The fitted curve does. It has to. It’s a quadratic.
In this case we know the true mean doesn’t go up again, it is monotone decreasing.
plot(y~x)
curve(exp(-x/2),col="blue",lwd=2,add=TRUE)
In this case you’ve seen the raw data, so you know the right-hand arm of the quadratic has nothing to support it. But suppose you saw
plot(y~x,type="n",ylim=c(-.4,1.4))
abline(lm(y~x),col="red",lwd=2)
do.call(curve,list(quadratic,add=TRUE,col="purple",lwd=2))
You’d know from the model that the quadratic was somehow better, and you might be willing to trust that the range of the plot matched the range of the data somehow. You’d have no way to know which parts of the quadratic were well-supported; whether there was really evidence of an increase at high \(x\), or just non-linearity at low \(x\).
Unfortunately, that sort of plot is standard practice in some areas, instead of
plot(y~x)
lines(supsmu(x,y),col="orange",lwd=2)