From a simple regression model, we can derive the coefficient of determination using the following:
\[ R^2 = 1 - \frac{SSE}{SSTO} = 1 - \frac{\sum{(y^2_i - \hat{y}^2_i)^2}} {\sum{(y^2_i - \bar{y}^2_i)^2}} \] The coefficient of determination is simply called r-squred value.
For a simple case, where continuous dependent variable is \(Y\) and there is just one continuous variable \(X\), the numeric value of the r-squared is equal to the square of the correlation of \(X\) and \(Y\).
Now if we have one continuous variable and one categorical variable, it is impossible to calculate the correlation between them. However, we can use regression to come up with a numeric value that can be treated similarly as correlation. For that, we need to make a regression model taking the continuous variable as dependent variable and categorical variable as independent variable. The model gives a r-sq value. We need to take the square root of that r-sq value.
Let us see an example:
modelXY <- lm(Y ~ X)
## model summary
sumryXY <- summary(modelXY)
## r-sq of model
rsqXY <- sumryXY$r.squared
print(rsqXY)
## [1] 0.5544638
## square root of r-sq
print(sqrt(rsqXY))
## [1] 0.7446233
So, the square root of the r-sq value is exactly equal to the correlation coefficient (absolute value in case of negative correlation) between \(X\) and \(Y\).
X_cat <- cut(X, breaks = 5, labels = c("Sam", "Pete",
"Jon", "Tom", "Chris"))
summary(X_cat)
## Sam Pete Jon Tom Chris
## 81 121 92 110 96
modelX_catY <- lm(Y ~ X_cat)
## summary
sumryX_catY <- summary(modelX_catY)
print(sumryX_catY)
##
## Call:
## lm(formula = Y ~ X_cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.61 -16.11 0.18 16.07 95.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.799 2.780 3.885 0.000116 ***
## X_catPete 22.683 3.592 6.315 6e-10 ***
## X_catJon 44.376 3.812 11.642 < 2e-16 ***
## X_catTom 56.372 3.663 15.390 < 2e-16 ***
## X_catChris 79.375 3.774 21.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.02 on 495 degrees of freedom
## Multiple R-squared: 0.527, Adjusted R-squared: 0.5231
## F-statistic: 137.9 on 4 and 495 DF, p-value: < 2.2e-16
# r-sq
rsqX_catY <- sumryX_catY$r.squared
print(rsqX_catY)
## [1] 0.526957
## square root of r-sq
print(sqrt(rsqX_catY))
## [1] 0.725918
Observe, that the square root of the r-sq is similar to the original correlation of X and Y.
โRiaz Khan
MS, Statistics
South Dakota State University