Some limitations of R squared

We will see, via example, that the R squared coefficient is not sufficient, or even necessary, for establishing clinical significance in a regression setting.

We will also see that, in multiple linear regression, the R squared can not tell us which regression variable is more important than the other.

Along the way, press the button that says “code” on the right hand side of the page, if you want to see the underlying R code.

To download the data and experiment with it yourself, a csv file for the first example can be found here, and a csv file for the second example can be found here.

R squared is not sufficient

Here we see an example where the R squared is high, but there is no clinical significance present.

First, we simulate the data. We define a large sample size (n = 1000). We simulate an error term with very low standard deviation (sd = 0.00001). We define x1 to be a continuous variable ranging from 0 to 1, and we simulate the regression relationship: \[y = 0.0002 \cdot x_1 + \text{error_term}\]

set.seed(0)
n <- 1000
sd <- 0.00001
error_term <- rnorm(n, sd = sd)
x1 <- seq(0,1, length.out = n)
y <- 0.0002*x1 + error_term

Now we define x2 to be a random normal variable that is completely independent from y.

x2 <- rnorm(n)

Next, we fit the linear model, searching for a true model of the form: \[y = \beta_1x_1 + \beta_2x_2\]

model <- lm(y ~ x1 + x2)

The R squard is greater than 0.97, indicating high explained variance.

cat(summary(model)$r.sq)

## 0.9709235

The confidence intervals show statistical significance for x1, but that there is no clinical significance (i.e., the upper and lower bounds of the confidence interval for x1 are both very close to 0). We rightfully see no statistical significance for x2.

confint(model)

##                     2.5 %       97.5 %
## (Intercept) -1.121399e-06 1.356833e-06
## x1           1.972968e-04 2.015873e-04
## x2          -7.191445e-07 4.807556e-07

Using the plot below, we can visualise what is happening. We can see that the estimated regression line (red) is a better fit to the data than the mean of the y values (dotted horizontal line), so R squared is high. However, by looking at the range of values on the y-axis, we can also see that the slope of the estimated regression line is very small, so there is no clinical significance.

plot(x1,y)
abline(a = model$coefficients[1], b = model$coefficients[2], col = "red")
abline(a = mean(y), b = 0, lty = 3)

Notice a second limitation of R squared: it does not tell us whether x1 or x2 contributed more to the overall fit. We can see that x2 contributed almost nothing (as it should), with the below plot

plot(x2, y)
abline(a = mean(y), b = 0, lty = 3)

Finally, it should be noted that, in the above, we have made the assumption that a slope as flat as 0.0002 is somehow ‘small’ and clinically irrelevant. However, there is no universal scale to reliably determine clinical significance. Domain expertise is always required to determine how strong a relationship should be before it is considered relevant. Similarly, in the material below, we make the assumption that a slope as steep as 2 should be considered clinically relevant.

R squared is not necessary

Here we see an example where the R squared is low, but there is a clear clinical significance present.

Above, we have seen a situation where a high R squared value did not indicate clinical significance. Can we have a situation where a strong clinical significance is present, but the R squared value is low?

Again, we define a large sample size (n = 1000). We simulate an error term with high standard deviation (sd = 2). We define x1 to be a continuous variable ranging from 0 to 1, and we simulate a regression relationship in which a one unit increase in y, leads on average to a one unit increase in x: \[y = x_1 + \text{error_term}\]

set.seed(0)
n <- 1000
sd <- 2
error_term <- rnorm(n, sd = sd)
x1 <- seq(0,1, length.out = n)
y <- x1 + error_term
write.csv(data.frame(y = y, x1 = x1), file = "data2.csv")

We fit a simple linear model, searching for a representation of the form: \[y = \beta_1 x_1\]

model <- lm(y ~ x1)

The R squared value is very low!

cat(summary(model)$r.sq)

## 0.01624086

However, thankfully, the confidence interval for x1 tells us there is a statistically and clinically significant effect of x1 on y. We are 95% confident that a one unit increase in y will result, on average, in an increase in x of somewhere between 0.458 and 1.316 units.

confint(model)

##                  2.5 %    97.5 %
## (Intercept) -0.2227677 0.2724902
## x1           0.4581606 1.3157578

Again, a plot will help us visualise what is happening. In the below plot, we see that the data is very noisy, and it would be difficult to determine the strength of the relationship, between x1 and y, by eye. This can be phrased another way: the regression line (in red) is not much better at explaining variation in the data than taking the mean of y (the dotted horizontal line), so the R squared is low. Despite this, as we have seen, our confidence interval is powerful enough to detect the statistically and clinically significant relationship between y and x1.

plot(x1,y)
abline(a = model$coefficients[1], b = model$coefficients[2], col = "red")
abline(a = mean(y), b = 0, lty = 3)

So, the low R squared might have convinced us that the model is weak, misleading us away from the important evidence, had we not looked at the confidence interval for x1.

Indeed, this is perhaps more akin to a real world scenario than the previous example, since real world data is often quite noisy. However, it should be noted that the strength of the clinical significance is not, on its own, enough to suggest an underlying causal mechanism between x1 and y. When a causal mechanism is of interest, a good experimental design is often needed to reduce unexplained variation, and to control for any relevant confounding variables.

Some limitations of R squared

Daniel Fryer

21/11/2019

R squared is not sufficient

R squared is not necessary