From a simple regression model, we can derive the coefficient of determination using the following:

\[ R^2 = 1 - \frac{SSE}{SSTO} = 1 - \frac{\sum{(y^2_i - \hat{y}^2_i)^2}} {\sum{(y^2_i - \bar{y}^2_i)^2}} \] The coefficient of determination is simply called r-squred value.

For a simple case, where continuous dependent variable is \(Y\) and there is just one continuous variable \(X\), the numeric value of the r-squared is equal to the square of the correlation of \(X\) and \(Y\).

Now if we have one continuous variable and one categorical variable, it is impossible to calculate the correlation between them. However, we can use regression to come up with a numeric value that can be treated similarly as correlation. For that, we need to make a regression model taking the continuous variable as dependent variable and categorical variable as independent variable. The model gives a r-sq value. We need to take the square root of that r-sq value.

Let us see an example:

Make two correlated varaibles

set.seed(100)
X <- runif(500, 1, 100)


set.seed(200)
noise <- rnorm(500, 0, 25)


Y <- X + noise


## Correlation of X and Y
print(cor(X, Y))
## [1] 0.7446233

Make a regression model with X and Y

modelXY <- lm(Y ~ X)


## model summary
sumryXY <- summary(modelXY)

## r-sq of model
rsqXY <- sumryXY$r.squared

print(rsqXY)
## [1] 0.5544638
## square root of r-sq
print(sqrt(rsqXY))
## [1] 0.7446233

So, the square root of the r-sq value is exactly equal to the correlation coefficient (absolute value in case of negative correlation) between \(X\) and \(Y\).

Make categorical X

X_cat <- cut(X, breaks = 5, labels = c("Sam", "Pete",
                                       "Jon", "Tom", "Chris"))

summary(X_cat)
##   Sam  Pete   Jon   Tom Chris 
##    81   121    92   110    96

Make regression model with Y and X_cat

modelX_catY <- lm(Y ~ X_cat)

## summary
sumryX_catY <- summary(modelX_catY)
print(sumryX_catY)
## 
## Call:
## lm(formula = Y ~ X_cat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75.61 -16.11   0.18  16.07  95.03 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   10.799      2.780   3.885 0.000116 ***
## X_catPete     22.683      3.592   6.315    6e-10 ***
## X_catJon      44.376      3.812  11.642  < 2e-16 ***
## X_catTom      56.372      3.663  15.390  < 2e-16 ***
## X_catChris    79.375      3.774  21.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.02 on 495 degrees of freedom
## Multiple R-squared:  0.527,  Adjusted R-squared:  0.5231 
## F-statistic: 137.9 on 4 and 495 DF,  p-value: < 2.2e-16
# r-sq
rsqX_catY <- sumryX_catY$r.squared

print(rsqX_catY)
## [1] 0.526957
## square root of r-sq
print(sqrt(rsqX_catY))
## [1] 0.725918

Observe, that the square root of the r-sq is similar to the original correlation of X and Y.

โ€“Riaz Khan
MS, Statistics
South Dakota State University