Low r-squared with low p-value

library(tidyr)
library(dplyr)
library(tibble)
library(TTR)
library(tidyquant)
library(broom)
library(corrr)
library(modelr)
library(hexbin)
library(Quandl)

Note: These are my notes, so they are very rough…

What is the difference between linear regression on y with x and x with y?

(Mashed from: https://stats.stackexchange.com/questions/22718/what-is-the-difference-between-linear-regression-on-y-with-x-and-x-with-y/22721)

First, we construct a random normal distribution, y, with a mean of 5 and a SD of 1:

y <- rnorm(1000, mean=5, sd=1)

Next, I purposely create a second random normal distribution, x, which is simply 5x the value of y for each y:

x = y * 5

By design, we have perfect correlation of x and y:

cor(x,y)

## [1] 1

cor(y,x)

## [1] 1

Now do linear regressions:

lm.1  = lm(y~x)
coef(lm.1)

##   (Intercept)             x 
## -4.493867e-15  2.000000e-01

lm.2 = lm(x~y)
coef(lm.2)

##   (Intercept)             y 
## -3.235584e-14  5.000000e+00

Low r-squared with low p-value.

Low R-squared values in multiple regression analysis? Is this bad?

In my regression analysis I found R-squared values from 2% to 15%. Can I include such low R-squared values in my research paper? Or R-squared values always have to be 70% or more. If anyone can refer me any books or journal articles about validity of low R-squared values, it would be highly appreciated.

Low R-squared values in multiple regression analysis?

If you see how r-square is calculated you will realize that it is really a “storm in a teacup” discussion. More formally, assume:

y1 = b0 + b1x + 1e y2 = g0 + g1x + 1000u

Low R-squared values in multiple regression analysis?. Available from: https://www.researchgate.net/post/Low_R-squared_values_in_multiple_regression_analysis [accessed Aug 17, 2017].

# Define variable x
x = seq(0, 100, by = 1)

# Parameters for linear models
b0 = 1; g0 = 1
b1 = 10; g1 = 10
e = rnorm(n = 100, mean = 0, sd=1)  # Normally distributed error
u = rnorm(n = 100, mean=0, sd=1)

# Define linear models to test
y1 = b0 + b1*x + 1*e
y2 = g0 + g1*x + 1000*u  # Note that the error is increased by 1000 times but remains normally distributed

# Define linear model
lm.1 = lm(y1~x)

# Find prediction interval
lm.1.pred = predict(lm.1, interval = "prediction", level = 0.95)

# qplot(x, y1)
ggplot() +
        geom_point(aes(x=x, y=y1)) +
        geom_smooth(aes(x=x, y=y1), method=lm) +
        geom_line(aes(x=x, y=lm.1.pred[ , 2]), linetype="dotted", color="red") +
        geom_line(aes(x=x, y=lm.1.pred[ , 3]), linetype="dotted", color="red")

c(cor(y1,x), coef(lm.1)[2])

##                     x 
##  0.9999939 10.0031588

This shows the correlation is almost 1, and b1=10

# Define linear model
lm.2 = lm(y2~x)

# Find prediction interval
lm.2.pred = as.tibble(predict(lm.2, interval = "prediction", level = 0.95))

## Warning in predict.lm(lm.2, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses

ggplot() +
        geom_point(aes(x=x, y=y2)) +
        geom_smooth(aes(x=x, y=y2), method=lm) +
        geom_line(aes(x=x, y=lm.2.pred$lwr), linetype="dotted", color="red") +
        geom_line(aes(x=x, y=lm.2.pred$upr), linetype="dotted", color="red") +
        ggtitle("Low rsquared, low p-value. Red dotted lines indicate 95% prediction interval")

Note: So what might be a situation where we might get a significant beta p-value, and a low rsquared? Imagine a swarm of ant princesses and drones in a mating swarm, slowly flying skyward. As a swarm, they will be all spread out, but their center of mass is definitely moving upward.

This situation may also be relevant in finance. These might represent the returns of various equities. Each equity has the possibility of a very positive or very negative return. With enough equities in the portfolio, the “error”, however large individually, should cancel out leaving the tendency given by the regression line.

The individual predictions will fall with a 95% probability within the prediction interval which is very wide. So a low rsquared makes the model generally not useful for predicting single cases. It is useful if the application has many cases where the variance can cancel out, and the mean is useful.

c(cor(y2,x), coef(lm.2)[2])

##                   x 
## 0.2476061 9.3500932

With all the added noise: g1 is the value under x, but the actual value is 10.

summary(lm.1)

## 
## Call:
## lm(formula = y1 ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27445 -0.72889  0.00134  0.74705  2.11668 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  0.847741   0.204077    4.154 6.94e-05 ***
## x           10.003159   0.003526 2837.033  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.033 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.049e+06 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lm.2)

## 
## Call:
## lm(formula = y2 ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2811.03  -656.63    47.53   673.42  2655.26 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   88.975    212.824   0.418   0.6768  
## x              9.350      3.677   2.543   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1077 on 99 degrees of freedom
## Multiple R-squared:  0.06131,    Adjusted R-squared:  0.05183 
## F-statistic: 6.466 on 1 and 99 DF,  p-value: 0.01254

Notice that the r-squared is very low, but the p-value is significant.

What happens if I standardize the variables?

xs = scale(x)
y1s = scale(y1)
y2s = scale(y2)

lm.1s = lm(y1s~xs)

# qplot(x, y1)
ggplot() +
        geom_point(aes(x=xs, y=y1s)) +
        geom_smooth(aes(x=xs, y=y1s), method=lm)

c(cor(y1s,xs), coef(lm.1s)[2])

##                  xs 
## 0.9999939 0.9999939

Notice that after standardizing the variables, the correlation equals the b1, the slope.

lm.2s = lm(y2s~xs)


ggplot() +
        geom_point(aes(x=xs, y=y2s)) +
        geom_smooth(aes(x=xs, y=y2s), method=lm)

c(cor(y2s,xs), coef(lm.2s)[2])

##                  xs 
## 0.2476061 0.2476061

summary(lm.2s)

## 
## Call:
## lm(formula = y2s ~ xs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.54063 -0.59347  0.04296  0.60864  2.39984 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 8.859e-17  9.689e-02   0.000   1.0000  
## xs          2.476e-01  9.737e-02   2.543   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9737 on 99 degrees of freedom
## Multiple R-squared:  0.06131,    Adjusted R-squared:  0.05183 
## F-statistic: 6.466 on 1 and 99 DF,  p-value: 0.01254