I this our first blog we will cover some of the very basics of linear regression. There is a lot of literature and resources online that cover linear regression, but here we will use the basics to understand the difference between \(R^2\) and adjusted \(R^2\). Even analyst with no linear regression training will know that a higher \(R^2\) is better, and they have probably been told that adjusted \(R^2\) should be used. Even if not told, since adjusted \(R^2\) shows smaller than \(R^2\), any conservative analysts will pick the former over the latter.
A quick google search on \(R^2\) will show results stating that this metric should not be used in isolation, same applies to adjusted \(R^2\). But rather that starting to discuss scattered and residuals plots, here we will dive into really understanding what these two numbers really mean. I think it will be hard to forget what the numbers represent once this clear explanation is understood. So let’s start.
Because we are trying to understand how a model based on observations fits the real data, we take a first step building something that by definition we do not have when doing linear regression, that is the real model. If we had this, there would be no need to do linear regression analysis, since we would have our answer already. But here we will use it to explain how we are trying to fit an estimated model to this real one and what the \(R^2\) tells us about how good our fit is.
Let’s assume we have the real data and thus the real model. That is we know what the real Y given our readings X. We will start with something simple so we can produce two dimensional plots, only one set of reading X, therefore our model will have only two coefficients, an intercept and a coefficient for the reading.
Real model coefficient:
b0<-100
b1<-8
Real readings:
X1<-c(12,25,9,21.86,32,56,12,13,45,22,45,55,66)
So from this we have a perfect model:
Y <- b0 + b1 * X1
plot(X1,Y)
m<-lm(Y ~ X1)
abline(coef(m),lty=5)
summary(m)
## Warning in summary.lm(m): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Y ~ X1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.015e-13 -8.375e-15 1.105e-14 1.865e-14 3.701e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+02 1.936e-14 5.166e+15 <2e-16 ***
## X1 8.000e+00 5.244e-16 1.526e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.533e-14 on 11 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.328e+32 on 1 and 11 DF, p-value: < 2.2e-16
As expected we get a perfect fit, since it is generated for the real values of Y from our readings. From this model we can calculate the real values of Y:
(Y<-predict(m,data.frame(X1)))
## 1 2 3 4 5 6 7 8 9 10
## 196.00 300.00 172.00 274.88 356.00 548.00 196.00 204.00 460.00 276.00
## 11 12 13
## 460.00 540.00 628.00
To be clear, these represent the real values of Y we have in real life situations. What we have here, which we don’t normally have, is the real model, that’s what we want to estimate.
In real world situations we would not have this “perfect model”, that would be the model we are trying to build a regression for, or estimate. We do not have the model coefficients, and we don’t have the X real readings. Rather we have observations of X, and we have the real values of Y. The task is to estimate the coefficients of the model that best fit the Y. But since what we have are observations of X, and not the true values, thee contain a certain amount of error. This can be error when observing the X, or setting the X in an experiment. So let’s assume we have different observations which include a certain amount of error, we will use uniform error around the true readings which we will call noise.
noise<-runif(length(X1),1,20)
plot(noise)
We then add this noise to the true readings to get our observations.
x1<-X1+noise
plot(x1,Y)
Now we have a good dataset to work with. This datset resembles what we would see in real life situations.
With our dataset in had, let’s build a linear regression model and see how it looks.
m1<-lm(Y ~ x1)
plot(x1,Y)
abline(coef(m1),lty=5)
summary(m1)
##
## Call:
## lm(formula = Y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.398 -44.114 -7.985 46.604 82.067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.7985 44.5133 -0.647 0.531
## x1 8.9331 0.9735 9.176 1.73e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 55.25 on 11 degrees of freedom
## Multiple R-squared: 0.8844, Adjusted R-squared: 0.8739
## F-statistic: 84.2 on 1 and 11 DF, p-value: 1.734e-06
We can see what looks like a very good model. Looking at \(R^2\), it is +0.9, not bad. Now this model isn’t a perfect fit, and we estimate b0 and b1 which we can cheat and compare agains the real model’s coefficients (remember in real life these are not known). And we find some discrepancies. So maybe it makes sense to better understand \(R^2\) and not just blindly give a thumbs up to model with good looking, close to one, values.
So let’s start by calculating estimated values of Y from this model and compare them with the real values. We will label the former \(\hat{Y}\) and the latter as shown before Y.
(Yhat<-predict(m1,data.frame(x1)))
## 1 2 3 4 5 6 7 8
## 183.1879 228.6173 223.4849 308.9275 273.9328 529.2881 242.9831 247.4024
## 9 10 11 12 13
## 413.3957 345.3976 467.9851 474.1639 672.1137
plot(Y,Yhat)
abline(0,1)
For clarify I plotted a line with zero intercept and slope of 1. If our model were perfect, all points would fall on this line.
So now let’s dive into \(R^2\). We are told it represents the percent the model explains the variation in the data. So it’s obvious that it is a number between 0 and 1, and that the higher it is, the more our model “explains” the variation of the data. but what exactly does this mean.
Well let’s start by calculating what is “variation in the data”. We will call this TSS, and it is basically how much the data changes around its mean. Because what we are trying to model is Y, we compute this TSS value using Y and how much is “moves” around its mean. Because we want the amount of variation around the mean, regardless of being positive or negative, we do the sum of the squares of the difference between each value and the mean.
(TSS<-sum((Y-mean(Y))^2))
## [1] 290589.9
So we have our first component in the definition of \(R^2\), the variation or total variation in the data. Now we need the amount of variation “explained” by the model. This value is basically the variation of the model data around the data real mean, rather than the real data around it’s mean. Because we have already calculated the model predicted values $, we can simply take the difference between these values and the mean value of Y. Again we do the sum of the squares so negative variation doesn’t cancel out positive variation. We can this ESS.
(ESS<-sum((Yhat-mean(Y))^2))
## [1] 257011.8
So now the definition of \(R^2\) becomes easy to understand, it is the ratio of these two
(R.squared<-ESS/TSS)
## [1] 0.8844486
Now \(R^2\) has a true tangible meaning. Now we know that its value will be higher, as long a more of the variation is explained by the model and that what that means is that the difference between the predicted values minus the mean of the real values is smaller. It’s easy to see that this would not only be accomplished by a better model, but also by adding more observations. This is because as we add more observations, TSS might growth faster than ESS, giving us a higher \(R^2\) even though the model isn’t necessarily better.
Another way of looking at \(R^2\) is by realizing that the total variation has to be equal to the variation explained by the model and that not explained. If we call the not explained portion RSS, then simply TSS = ESS + RSS. So here we compute RSS as:
(RSS<-sum((Y-Yhat)^2))
## [1] 33578.08
We can then see if our equality holds true:
(round(TSS,1))==(round(ESS+RSS,1))
## [1] TRUE
Then we can rewrite \(R^2\) as after substitution and obtain the ame value.
(R.squared = 1 - RSS/TSS)
## [1] 0.8844486
Now that we truly understand \(R^2\), what is the fan fair around adjusted \(R^2\). Well first we need to understand that adjusted \(R^2\) becomes useful when dialing with model with more than one predictor variable. So let’s start by taking our idealized model and add two more X predictor variables.
X2<-c(10,16,4,8,35.98,65,21.34,13.77,54.33,21,76,35,56)
X3<-c(5,6.55,4.88,9.1,35,65.55,12.3,31.75,59.34,22,86,15,46.33)
And same as we did for the single predictor scenario, we will build our “perfect” model and then our “real life” dataset.
b0<-100
b1<-8
b2 <- 12
b3 <- 10
Y <- b0 + b1 * X1 + b2 * X2 + b3 * X3
noise<-runif(length(X1),1,20)
x1<-X1+noise
noise<-runif(length(X2),1,20)
x2<-X2+noise
noise<-runif(length(X3),1,20)
x3<-X3+noise
With our dataset in hand, we start building predictor models. We start with a single predictor model, similar to what we did in the previous single predictor scenario.
m2<-lm(Y ~ x1)
summary(m2)
##
## Call:
## lm(formula = Y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -596.4 -155.7 -53.7 160.6 1085.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84.603 320.324 -0.264 0.79657
## x1 25.906 6.747 3.840 0.00275 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 455.2 on 11 degrees of freedom
## Multiple R-squared: 0.5727, Adjusted R-squared: 0.5339
## F-statistic: 14.74 on 1 and 11 DF, p-value: 0.002747
(Yhat<-predict(m2,data.frame(x1)))
## 1 2 3 4 5 6 7
## 521.6835 1078.4327 393.0375 909.0787 978.2181 1678.1524 668.9105
## 8 9 10 11 12 13
## 407.6259 1544.7524 746.0053 1146.6424 1706.3766 1817.0040
(TSS<-sum((Y-mean(Y))^2))
## [1] 5333198
(ESS<-sum((Yhat-mean(Y))^2))
## [1] 3054420
(R.squared<-ESS/TSS)
## [1] 0.5727182
We have a single predictor model for what we know to have three predictors, so we build a model with all three predictors.
m3<-lm(Y ~ x1+x2+x3)
summary(m3)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.05 -77.96 25.60 41.66 106.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -228.611 64.451 -3.547 0.006245 **
## x1 9.821 1.907 5.149 0.000604 ***
## x2 11.013 2.888 3.814 0.004128 **
## x3 9.179 2.284 4.018 0.003026 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90.36 on 9 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9816
## F-statistic: 214.7 on 3 and 9 DF, p-value: 1.088e-08
(Yhat<-predict(m3,data.frame(x1,x2,x3)))
## 1 2 3 4 5 6 7
## 443.9556 658.1734 162.6045 442.6791 1096.1047 1957.9025 674.2762
## 8 9 10 11 12 13
## 647.7818 1849.4107 732.0528 2170.9362 1034.3452 1725.6972
(TSS<-sum((Y-mean(Y))^2))
## [1] 5333198
(ESS<-sum((Yhat-mean(Y))^2))
## [1] 5259721
(R.squared<-ESS/TSS)
## [1] 0.9862226
Even before we jump into adjusted \(R^2\), we notice our \(R^2\) has increased when using all three predictors. In this case it is obvious that this would happen, we know beforehand our data has three predictors, so of course we would expect the three predictor model to be a much better fit. But in real life situation we do not know this, we don’t know how many predictors model the data. But even if we didn’t know our data had three predictors as would be in a real life situation, surely this is a better model because \(R^2\) is much higher. Not so fast, it turns out that \(R^2\) always increases as we increase the number of predictor variables. Remember \(R^2\) is the ratio of ESS/TSS. As we increase predictors, ESS increases faster than TSS, and thus \(R^2\) becomes larger.
To prove that adding predictors always increases \(R^2\), lets change one of our predictors to just new random numbers and see if again we derive a larger \(R^2\)
x2<-runif(length(X2),300,600)
m4<-lm(Y ~ x1+x2+x3)
summary(m4)
##
## Call:
## lm(formula = Y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -214.71 -126.77 0.65 87.00 172.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -264.18940 234.31136 -1.128 0.288686
## x1 14.23228 2.51706 5.654 0.000312 ***
## x2 0.04436 0.42544 0.104 0.919245
## x3 16.91713 1.71482 9.865 4.01e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 146.1 on 9 degrees of freedom
## Multiple R-squared: 0.964, Adjusted R-squared: 0.952
## F-statistic: 80.33 on 3 and 9 DF, p-value: 8.124e-07
Yhat<-predict(m4,data.frame(x1,x1,x3))
ESS<-sum((Yhat-mean(Y))^2)
(R.squared<-ESS/TSS)
## [1] 0.963998
Again we see a much larger \(R^2\) than our original model with one predictor. Is this a better model then? how can it be if we added a predictor with nothing but random values? How can we tell we actually have a better model when we start adding predictors. This is what we use adjusted \(R^2\), which we calculate given the formula:
adjusted-\(R^2\) = 1 - \(\frac{(1 - R^{2})(N-1)}{N-p-1}\)
where N is sample size and P the number of predictors
What adjusted \(R^2\) does is account for the number of predictors being added to a model. Now calculating it for our two last model we get:
N<-length(x1)
p<-3
(R.squared.adjusted<-1-((1-R.squared)*(N-1))/(N-p-1))
## [1] 0.9519973
The value for adjusted \(R^2\) has decreased, as expected, since we know the predictor x2 is simply noise. But we are still looking at a larger value that our original model with only one predictor x1. This is because adjusted \(R^2\) accounts for the lack of contribution of x2, but keeps the effects of x3 which is still a valid predictor and is positively impacting our model. To see a complete negative effect of x2 we need to model using it and our original predictor x1.
m5<-lm(Y ~ x1+x2)
summary(m5)
##
## Call:
## lm(formula = Y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -560.98 -180.98 -68.86 188.87 1086.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.0953 756.4233 0.081 0.93722
## x1 25.4949 7.3147 3.485 0.00587 **
## x2 -0.2971 1.3827 -0.215 0.83419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 476.3 on 10 degrees of freedom
## Multiple R-squared: 0.5747, Adjusted R-squared: 0.4896
## F-statistic: 6.756 on 2 and 10 DF, p-value: 0.01392
Yhat<-predict(m5,data.frame(x1,x2))
ESS<-sum((Yhat-mean(Y))^2)
(R.squared<-ESS/TSS)
## [1] 0.5746817
The computed \(R^2\) for this model shows it is slightly better, even though we’ve added a predictor variable which we know is noise and has little to add to our estimation of Y. The adjusted-\(R^2\) tells a slightly different story.
p<-2
(R.squared.adjusted<-1-((1-R.squared)*(N-1))/(N-p-1))
## [1] 0.4896181
Here we can now see that in fact this model is worse than the original one predictor model. Adjusted-\(R^2\) has eliminated some of the effect of adding more predictors to the model and shows this to be a less appropriate estimation model for Y.