Recall the correlation \(R = \frac{SS_{xy}}{\sqrt{SS_{xx}SS_{yy}}}\).
\(R^2\) is called the the coefficient of determination or multiple R-squared.
Practical interpretation of R^2: About \(100(R^2)%\) of the total sum of squares of deviation of sample y values about mean can be explained by (or attributed to) using x to predict y in the linear model. More simply, \(100(R^2)%\) of the variance in the data is explained by the model.
x = c(0,1,2,3)
y = c(8,5,7,4)
lm = lm(y~x)
lm$coefficients
## (Intercept) x
## 7.5 -1.0
The adjusted coefficient of determination is given by:
\[R^2_a = 1-\frac{n-1}{n-(k+1)} \Big(1-R^2\Big) = 1-\frac{n-1}{n-(k+1)} \Big(\frac{SSE}{SS_{yy}}\Big)\]
Where \(k\) is the number of model parameters (k=2 in the case of two variable regression). \(R^2_a\) takes into account or adjust for both the sample size and the number of model parameters, penalizing models with more parameters. Since you can often just add more and more parameter estimates to a model to make the \(R^2\) closer and closer to one, some analysts prefer to use \(R^2_a\).
Will \(R^2_a\) be bigger or smaller then \(R^2\)? Why?
Find the \(R^2_a\) for the example data.