In this post, I will be showing how results of lm() function can be obtained using matrix algebra. Dataset I will be using is Moneyball.
Variable TEAM_BATTING_HBP is not used for analysis, and missing values are filled in using k-NN imputation.
#options(scipen=999)
library(dplyr)
library(DMwR)
library(knitr) # Report display, table format
library(kableExtra)
bb.df <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
bb.df$TEAM_BATTING_1B = bb.df$TEAM_BATTING_H - bb.df$TEAM_BATTING_2B - bb.df$TEAM_BATTING_3B - bb.df$TEAM_BATTING_HR
bb.df <- bb.df %>%
select(TARGET_WINS,TEAM_BATTING_1B,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,
TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_BASERUN_CS,TEAM_PITCHING_H,TEAM_PITCHING_HR,
TEAM_PITCHING_BB,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP)
bb.df <- knnImputation(bb.df, 15, meth='weighAvg')
for(i in 1:ncol(bb.df)){
bb.df[is.na(bb.df[,i]), i] <- as.numeric(as.character(bb.df[is.na(bb.df[,i]), i]))
}
head(bb.df,20) %>%
kable("html",caption = "Sample Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "200px")
| TARGET_WINS | TEAM_BATTING_1B | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | 1199 | 194 | 39 | 13 | 143 | 842 | 116.1351 | 61.54372 | 9364 | 84 | 927 | 5456 | 1011 | 155.9764 |
| 70 | 908 | 219 | 22 | 190 | 685 | 1075 | 37.0000 | 28.00000 | 1347 | 191 | 689 | 1082 | 193 | 155.0000 |
| 86 | 973 | 232 | 35 | 137 | 602 | 917 | 46.0000 | 27.00000 | 1377 | 137 | 602 | 917 | 175 | 153.0000 |
| 70 | 1044 | 209 | 38 | 96 | 451 | 922 | 43.0000 | 30.00000 | 1396 | 97 | 454 | 928 | 164 | 156.0000 |
| 82 | 982 | 186 | 27 | 102 | 472 | 920 | 49.0000 | 39.00000 | 1297 | 102 | 472 | 920 | 138 | 168.0000 |
| 75 | 951 | 200 | 36 | 92 | 443 | 973 | 107.0000 | 59.00000 | 1279 | 92 | 443 | 973 | 123 | 149.0000 |
| 80 | 889 | 179 | 54 | 122 | 525 | 1062 | 80.0000 | 54.00000 | 1244 | 122 | 525 | 1062 | 136 | 186.0000 |
| 85 | 950 | 171 | 37 | 115 | 456 | 1027 | 40.0000 | 36.00000 | 1281 | 116 | 459 | 1033 | 112 | 136.0000 |
| 86 | 1040 | 197 | 40 | 114 | 447 | 922 | 69.0000 | 27.00000 | 1391 | 114 | 447 | 922 | 127 | 169.0000 |
| 76 | 944 | 213 | 18 | 96 | 441 | 827 | 72.0000 | 34.00000 | 1271 | 96 | 441 | 827 | 131 | 159.0000 |
| 78 | 1017 | 179 | 27 | 82 | 374 | 888 | 60.0000 | 39.00000 | 1364 | 86 | 391 | 928 | 119 | 141.0000 |
| 68 | 1043 | 203 | 31 | 95 | 509 | 801 | 119.0000 | 79.00000 | 1372 | 95 | 509 | 801 | 147 | 150.0000 |
| 72 | 1040 | 196 | 41 | 55 | 597 | 816 | 221.0000 | 109.00000 | 1340 | 55 | 601 | 821 | 185 | 165.0000 |
| 76 | 969 | 210 | 23 | 63 | 534 | 812 | 126.0000 | 80.00000 | 1265 | 63 | 534 | 812 | 150 | 139.0000 |
| 74 | 976 | 233 | 40 | 131 | 542 | 880 | 159.0000 | 89.00000 | 1380 | 131 | 542 | 880 | 147 | 137.0000 |
| 87 | 1055 | 226 | 28 | 108 | 539 | 682 | 86.0000 | 69.00000 | 1417 | 108 | 539 | 682 | 136 | 136.0000 |
| 88 | 1114 | 242 | 43 | 164 | 589 | 843 | 100.0000 | 53.00000 | 1563 | 164 | 589 | 843 | 135 | 172.0000 |
| 66 | 1082 | 239 | 32 | 107 | 546 | 900 | 92.0000 | 64.00000 | 1478 | 108 | 553 | 911 | 136 | 146.0000 |
| 75 | 1026 | 197 | 24 | 143 | 579 | 841 | 65.0000 | 49.00000 | 2047 | 211 | 853 | 1239 | 149 | 177.0000 |
| 93 | 1038 | 268 | 26 | 186 | 613 | 760 | 55.0000 | 53.00000 | 1518 | 186 | 613 | 760 | 106 | 171.0000 |
Regression model using lm() function
bb.lm <- lm(TARGET_WINS ~ ., data = bb.df)
summary(bb.lm)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = bb.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.668 -8.486 0.097 8.273 61.574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.2580952 5.7825918 2.293 0.02195 *
## TEAM_BATTING_1B 0.0498149 0.0037128 13.417 < 2e-16 ***
## TEAM_BATTING_2B 0.0362604 0.0073414 4.939 8.42e-07 ***
## TEAM_BATTING_3B 0.0835683 0.0162111 5.155 2.76e-07 ***
## TEAM_BATTING_HR 0.1236577 0.0274046 4.512 6.74e-06 ***
## TEAM_BATTING_BB 0.0093864 0.0057680 1.627 0.10380
## TEAM_BATTING_SO -0.0107378 0.0025854 -4.153 3.40e-05 ***
## TEAM_BASERUN_SB 0.0089533 0.0054808 1.634 0.10249
## TEAM_BASERUN_CS 0.0907402 0.0164281 5.523 3.70e-08 ***
## TEAM_PITCHING_H -0.0004599 0.0003669 -1.254 0.21012
## TEAM_PITCHING_HR 0.0136050 0.0241301 0.564 0.57293
## TEAM_PITCHING_BB 0.0029975 0.0041308 0.726 0.46814
## TEAM_PITCHING_SO 0.0023862 0.0009146 2.609 0.00914 **
## TEAM_FIELDING_E -0.0223547 0.0023865 -9.367 < 2e-16 ***
## TEAM_FIELDING_DP -0.1005829 0.0139331 -7.219 7.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.94 on 2261 degrees of freedom
## Multiple R-squared: 0.3291, Adjusted R-squared: 0.325
## F-statistic: 79.23 on 14 and 2261 DF, p-value: < 2.2e-16
Model suggests, variables TEAM_BATTING_BB, TEAM_BASERUN_SB, TEAM_PITCHING_H, TEAM_PITCHING_HR and TEAM_PITCHING_BB are not satistically significant as p-value is greater than 0.05 critical value.
Regression function using martix notation,
\[\left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right]=\left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right]\cdot\left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right] + \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]
Let, \[\mathbf{X} = \left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right], \mathbf{y} = \left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right], \beta = \left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right], \epsilon = \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]
General regression equation, \[\mathbf{y} = \mathbf{X}\cdot\beta + \epsilon\]
Assuming there is no residuals(\(\epsilon\)),
\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}\]
Where, \(\mathbf{X}^T\), is transpose of \(\mathbf{X}\). It is calculated using t() function and inverse(\(^{-1}\)) is derived using solve() function.
Variance of the residuals or mean squared error(\({\hat\sigma}^{2}\)),
\[{\hat\sigma}^{2} = \frac{(\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})^T \cdot (\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})}{n - p - 1}\] Where \(\mathbf{X} \cdot \hat{\mathbf{\beta}} = \hat{\mathbf{y}}\)
Estimated covariances for linear regression coefficient estimates
\[\widehat{\text{var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1}\]
Square root of the diagonal of \(\widehat{\text{var}}(\hat{\mathbf{\beta}})\) are standard errors for each coefficient.
Variable TARGET_WINS is response variable(\(\mathbf{y}\)) and rest of the variables are independent variables. Following code creates vector y and matrix x.
#Extract 'y' values into vector
yV <- as.matrix(bb.df[,])[,1]
#Extract 'x' values into matrix
xM <- cbind(constant = 1, as.matrix(bb.df[,])[, -1])
Calculate beta values or estimates.
#Transpose of x matrix
xMT <- t(xM)
#Inverse of x and x transpose
xInv <- solve(t(xM)%*%xM)
#Beta values
betaVal <- xInv %*% t(xM) %*% yV
#Display estimates
df <- data.frame(`lm-Function` = coef(bb.lm), `Manual-Calc`= betaVal)
df %>%
kable("html",caption = "Beta Values") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| lm.Function | Manual.Calc | |
|---|---|---|
| (Intercept) | 13.2580952 | 13.2580952 |
| TEAM_BATTING_1B | 0.0498149 | 0.0498149 |
| TEAM_BATTING_2B | 0.0362604 | 0.0362604 |
| TEAM_BATTING_3B | 0.0835683 | 0.0835683 |
| TEAM_BATTING_HR | 0.1236577 | 0.1236577 |
| TEAM_BATTING_BB | 0.0093864 | 0.0093864 |
| TEAM_BATTING_SO | -0.0107378 | -0.0107378 |
| TEAM_BASERUN_SB | 0.0089533 | 0.0089533 |
| TEAM_BASERUN_CS | 0.0907402 | 0.0907402 |
| TEAM_PITCHING_H | -0.0004599 | -0.0004599 |
| TEAM_PITCHING_HR | 0.0136050 | 0.0136050 |
| TEAM_PITCHING_BB | 0.0029975 | 0.0029975 |
| TEAM_PITCHING_SO | 0.0023862 | 0.0023862 |
| TEAM_FIELDING_E | -0.0223547 | -0.0223547 |
| TEAM_FIELDING_DP | -0.1005829 | -0.1005829 |
There is no difference between values obtained from lm() function and manual calculation.
Now let’s get Standard Errors.
#Convert rows to column, transpose Beta values
betValt <- t(betaVal)
#Get predicted values of y, x times beta values
predY <- (xM[,1]*betValt[,1] + xM[,2]*betValt[,2] + xM[,3]*betValt[,3] + xM[,4]*betValt[,4] +
xM[,5]*betValt[,5] + xM[,6]*betValt[,6] + xM[,7]*betValt[,7] + xM[,8]*betValt[,8] +
xM[,9]*betValt[,9] + xM[,10]*betValt[,10] + xM[,11]*betValt[,11] + xM[,12]*betValt[,12] +
xM[,13]*betValt[,13] + xM[,14]*betValt[,14] + xM[,15]*betValt[,15])
#Get Residuals or epsilon values(y - y hat)
residM <- yV - predY
#Transpose of residuals matrix
residMt <- t(residM)
n <- nrow(bb.df)
p <- ncol(bb.df) - 1
#Variance of Residuals, sum of squares of each value divided by (rows - columns - 2)
varResid <- (residMt %*% residM) / (n - p - 1)
#Residual Standard Error
#Sqrt is applied as varResid is square value.
#R-Squared value of LM model
resdSE <- sqrt(varResid)
#Get diagonal values of product of x and transpose of x
#it is basically squared value
diagXM <-diag(xInv)
#standard errors of beta values
betaSE <- resdSE * sqrt(diagXM)
#Display estimates
df <- data.frame(`lm-Function` = summary(bb.lm)[["coefficients"]][, "Std. Error"], `Manual-Calc`= betaSE)
df %>%
kable("html",caption = "Standard Error Of Estimates") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| lm.Function | Manual.Calc | |
|---|---|---|
| (Intercept) | 5.7825918 | 5.7825918 |
| TEAM_BATTING_1B | 0.0037128 | 0.0037128 |
| TEAM_BATTING_2B | 0.0073414 | 0.0073414 |
| TEAM_BATTING_3B | 0.0162111 | 0.0162111 |
| TEAM_BATTING_HR | 0.0274046 | 0.0274046 |
| TEAM_BATTING_BB | 0.0057680 | 0.0057680 |
| TEAM_BATTING_SO | 0.0025854 | 0.0025854 |
| TEAM_BASERUN_SB | 0.0054808 | 0.0054808 |
| TEAM_BASERUN_CS | 0.0164281 | 0.0164281 |
| TEAM_PITCHING_H | 0.0003669 | 0.0003669 |
| TEAM_PITCHING_HR | 0.0241301 | 0.0241301 |
| TEAM_PITCHING_BB | 0.0041308 | 0.0041308 |
| TEAM_PITCHING_SO | 0.0009146 | 0.0009146 |
| TEAM_FIELDING_E | 0.0023865 | 0.0023865 |
| TEAM_FIELDING_DP | 0.0139331 | 0.0139331 |
There is no difference between Standard Errors of lm() function and manual calculation.
t-value And p-value'sAnd next two columns of lm() function are t-value and p-value's.
#beta value divided by standard error of the estimate
tValue <- betaVal/betaSE
#Probability value
pValue <- 2*pt(abs(betaVal/betaSE), df=n-p,lower.tail= FALSE)
df <- data.frame(`lm-Function-t-value` = summary(bb.lm)[["coefficients"]][, "t value"], `Manual-Calc-t-value`=tValue, `lm-Function-p-value` = summary(bb.lm)[["coefficients"]][, "Pr(>|t|)"], `Manual-Calc-p-value` = pValue)
df %>%
kable("html",caption = "t-value And p-value") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| lm.Function.t.value | Manual.Calc.t.value | lm.Function.p.value | Manual.Calc.p.value | |
|---|---|---|---|---|
| (Intercept) | 2.2927600 | 2.2927600 | 0.0219533 | 0.0219532 |
| TEAM_BATTING_1B | 13.4169647 | 13.4169647 | 0.0000000 | 0.0000000 |
| TEAM_BATTING_2B | 4.9391451 | 4.9391451 | 0.0000008 | 0.0000008 |
| TEAM_BATTING_3B | 5.1549957 | 5.1549957 | 0.0000003 | 0.0000003 |
| TEAM_BATTING_HR | 4.5122994 | 4.5122994 | 0.0000067 | 0.0000067 |
| TEAM_BATTING_BB | 1.6273425 | 1.6273425 | 0.1038037 | 0.1038037 |
| TEAM_BATTING_SO | -4.1533315 | -4.1533315 | 0.0000340 | 0.0000340 |
| TEAM_BASERUN_SB | 1.6335772 | 1.6335772 | 0.1024869 | 0.1024868 |
| TEAM_BASERUN_CS | 5.5234878 | 5.5234878 | 0.0000000 | 0.0000000 |
| TEAM_PITCHING_H | -1.2535798 | -1.2535798 | 0.2101244 | 0.2101243 |
| TEAM_PITCHING_HR | 0.5638171 | 0.5638171 | 0.5729345 | 0.5729345 |
| TEAM_PITCHING_BB | 0.7256280 | 0.7256280 | 0.4681419 | 0.4681419 |
| TEAM_PITCHING_SO | 2.6090739 | 2.6090739 | 0.0091386 | 0.0091386 |
| TEAM_FIELDING_E | -9.3670760 | -9.3670760 | 0.0000000 | 0.0000000 |
| TEAM_FIELDING_DP | -7.2190003 | -7.2190003 | 0.0000000 | 0.0000000 |
df <- data.frame(`lm-Function` = sigma(bb.lm), `Manual-Calc`= resdSE)
df %>%
kable("html",caption = "Residual standard error") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| lm.Function | Manual.Calc |
|---|---|
| 12.94206 | 12.94206 |
According to stats.stackexchange.com, there are many ways to obtain adjusted R2.
Formula used by lm() function is, McNemar’s formula.
Formula for F-Statistic.
#Residual sum of squares(errors)
resSS <- sum((yV - predY) ^ 2)
#Regression sum of squares
regSS <- sum((predY - mean(yV)) ^ 2)
#Total sum of squares
totalSS <- sum((yV - mean(yV)) ^ 2)
rSq <- 1 - (resSS/totalSS)
#McNemar's formula, adjusted Rsq
adjRSq <- 1 - (1 - rSq)*((n-1)/(n-p))
#'p' is used instead of 'p-1' because I have already substracted 1 while declaring 'p'
FStat <- (regSS/(p))/(resSS/(n-p-1))
df <- data.frame(`lm-Function-r-sq.` = summary(bb.lm)$r.squared, `Manual-Calc-r-sq`= rSq, `lm-Function-adj.r-sq.` = summary(bb.lm)$adj.r.squared, `Manual-Calc-adj.r-sq`= adjRSq,`lm-Function-F-statistic` = summary(bb.lm)$fstatistic[1],`Manual-Calc-F-statistic`= FStat)
df %>%
kable("html",caption = "R-Square And Adjusted R-Square") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
| lm.Function.r.sq. | Manual.Calc.r.sq | lm.Function.adj.r.sq. | Manual.Calc.adj.r.sq | lm.Function.F.statistic | Manual.Calc.F.statistic | |
|---|---|---|---|---|---|---|
| value | 0.329118 | 0.329118 | 0.3249639 | 0.3252624 | 79.22788 | 79.22788 |
Conclusion, there is no difference in results obtained using lm() function or manual calculation. Matrix method provides insights into how estimates, standard errors and residuals are obtained. However lm() function is much faster in generating output.