In this post, I will be showing how results of lm()
function can be obtained using matrix algebra
. Dataset I will be using is Moneyball
.
Variable TEAM_BATTING_HBP
is not used for analysis, and missing values are filled in using k-NN
imputation.
#options(scipen=999)
library(dplyr)
library(DMwR)
library(knitr) # Report display, table format
library(kableExtra)
bb.df <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
bb.df$TEAM_BATTING_1B = bb.df$TEAM_BATTING_H - bb.df$TEAM_BATTING_2B - bb.df$TEAM_BATTING_3B - bb.df$TEAM_BATTING_HR
bb.df <- bb.df %>%
select(TARGET_WINS,TEAM_BATTING_1B,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,
TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_BASERUN_CS,TEAM_PITCHING_H,TEAM_PITCHING_HR,
TEAM_PITCHING_BB,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP)
bb.df <- knnImputation(bb.df, 15, meth='weighAvg')
for(i in 1:ncol(bb.df)){
bb.df[is.na(bb.df[,i]), i] <- as.numeric(as.character(bb.df[is.na(bb.df[,i]), i]))
}
head(bb.df,20) %>%
kable("html",caption = "Sample Data") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
scroll_box(width = "100%", height = "200px")
TARGET_WINS | TEAM_BATTING_1B | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | 1199 | 194 | 39 | 13 | 143 | 842 | 116.1351 | 61.54372 | 9364 | 84 | 927 | 5456 | 1011 | 155.9764 |
70 | 908 | 219 | 22 | 190 | 685 | 1075 | 37.0000 | 28.00000 | 1347 | 191 | 689 | 1082 | 193 | 155.0000 |
86 | 973 | 232 | 35 | 137 | 602 | 917 | 46.0000 | 27.00000 | 1377 | 137 | 602 | 917 | 175 | 153.0000 |
70 | 1044 | 209 | 38 | 96 | 451 | 922 | 43.0000 | 30.00000 | 1396 | 97 | 454 | 928 | 164 | 156.0000 |
82 | 982 | 186 | 27 | 102 | 472 | 920 | 49.0000 | 39.00000 | 1297 | 102 | 472 | 920 | 138 | 168.0000 |
75 | 951 | 200 | 36 | 92 | 443 | 973 | 107.0000 | 59.00000 | 1279 | 92 | 443 | 973 | 123 | 149.0000 |
80 | 889 | 179 | 54 | 122 | 525 | 1062 | 80.0000 | 54.00000 | 1244 | 122 | 525 | 1062 | 136 | 186.0000 |
85 | 950 | 171 | 37 | 115 | 456 | 1027 | 40.0000 | 36.00000 | 1281 | 116 | 459 | 1033 | 112 | 136.0000 |
86 | 1040 | 197 | 40 | 114 | 447 | 922 | 69.0000 | 27.00000 | 1391 | 114 | 447 | 922 | 127 | 169.0000 |
76 | 944 | 213 | 18 | 96 | 441 | 827 | 72.0000 | 34.00000 | 1271 | 96 | 441 | 827 | 131 | 159.0000 |
78 | 1017 | 179 | 27 | 82 | 374 | 888 | 60.0000 | 39.00000 | 1364 | 86 | 391 | 928 | 119 | 141.0000 |
68 | 1043 | 203 | 31 | 95 | 509 | 801 | 119.0000 | 79.00000 | 1372 | 95 | 509 | 801 | 147 | 150.0000 |
72 | 1040 | 196 | 41 | 55 | 597 | 816 | 221.0000 | 109.00000 | 1340 | 55 | 601 | 821 | 185 | 165.0000 |
76 | 969 | 210 | 23 | 63 | 534 | 812 | 126.0000 | 80.00000 | 1265 | 63 | 534 | 812 | 150 | 139.0000 |
74 | 976 | 233 | 40 | 131 | 542 | 880 | 159.0000 | 89.00000 | 1380 | 131 | 542 | 880 | 147 | 137.0000 |
87 | 1055 | 226 | 28 | 108 | 539 | 682 | 86.0000 | 69.00000 | 1417 | 108 | 539 | 682 | 136 | 136.0000 |
88 | 1114 | 242 | 43 | 164 | 589 | 843 | 100.0000 | 53.00000 | 1563 | 164 | 589 | 843 | 135 | 172.0000 |
66 | 1082 | 239 | 32 | 107 | 546 | 900 | 92.0000 | 64.00000 | 1478 | 108 | 553 | 911 | 136 | 146.0000 |
75 | 1026 | 197 | 24 | 143 | 579 | 841 | 65.0000 | 49.00000 | 2047 | 211 | 853 | 1239 | 149 | 177.0000 |
93 | 1038 | 268 | 26 | 186 | 613 | 760 | 55.0000 | 53.00000 | 1518 | 186 | 613 | 760 | 106 | 171.0000 |
Regression model using lm()
function
bb.lm <- lm(TARGET_WINS ~ ., data = bb.df)
summary(bb.lm)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = bb.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.668 -8.486 0.097 8.273 61.574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.2580952 5.7825918 2.293 0.02195 *
## TEAM_BATTING_1B 0.0498149 0.0037128 13.417 < 2e-16 ***
## TEAM_BATTING_2B 0.0362604 0.0073414 4.939 8.42e-07 ***
## TEAM_BATTING_3B 0.0835683 0.0162111 5.155 2.76e-07 ***
## TEAM_BATTING_HR 0.1236577 0.0274046 4.512 6.74e-06 ***
## TEAM_BATTING_BB 0.0093864 0.0057680 1.627 0.10380
## TEAM_BATTING_SO -0.0107378 0.0025854 -4.153 3.40e-05 ***
## TEAM_BASERUN_SB 0.0089533 0.0054808 1.634 0.10249
## TEAM_BASERUN_CS 0.0907402 0.0164281 5.523 3.70e-08 ***
## TEAM_PITCHING_H -0.0004599 0.0003669 -1.254 0.21012
## TEAM_PITCHING_HR 0.0136050 0.0241301 0.564 0.57293
## TEAM_PITCHING_BB 0.0029975 0.0041308 0.726 0.46814
## TEAM_PITCHING_SO 0.0023862 0.0009146 2.609 0.00914 **
## TEAM_FIELDING_E -0.0223547 0.0023865 -9.367 < 2e-16 ***
## TEAM_FIELDING_DP -0.1005829 0.0139331 -7.219 7.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.94 on 2261 degrees of freedom
## Multiple R-squared: 0.3291, Adjusted R-squared: 0.325
## F-statistic: 79.23 on 14 and 2261 DF, p-value: < 2.2e-16
Model suggests, variables TEAM_BATTING_BB
, TEAM_BASERUN_SB
, TEAM_PITCHING_H
, TEAM_PITCHING_HR
and TEAM_PITCHING_BB
are not satistically significant as p-value
is greater than 0.05
critical value.
Regression function using martix notation,
\[\left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right]=\left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right]\cdot\left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right] + \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]
Let, \[\mathbf{X} = \left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right], \mathbf{y} = \left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right], \beta = \left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right], \epsilon = \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]
General regression equation, \[\mathbf{y} = \mathbf{X}\cdot\beta + \epsilon\]
Assuming there is no residuals(\(\epsilon\)),
\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}\]
Where, \(\mathbf{X}^T\), is transpose
of \(\mathbf{X}\). It is calculated using t()
function and inverse(\(^{-1}\)) is derived using solve()
function.
Variance of the residuals or mean squared error(\({\hat\sigma}^{2}\)),
\[{\hat\sigma}^{2} = \frac{(\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})^T \cdot (\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})}{n - p - 1}\] Where \(\mathbf{X} \cdot \hat{\mathbf{\beta}} = \hat{\mathbf{y}}\)
Estimated covariances for linear regression coefficient estimates
\[\widehat{\text{var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1}\]
Square root of the diagonal of \(\widehat{\text{var}}(\hat{\mathbf{\beta}})\) are standard errors for each coefficient.
Variable TARGET_WINS
is response variable(\(\mathbf{y}\)) and rest of the variables are independent variables. Following code creates vector y
and matrix x
.
#Extract 'y' values into vector
yV <- as.matrix(bb.df[,])[,1]
#Extract 'x' values into matrix
xM <- cbind(constant = 1, as.matrix(bb.df[,])[, -1])
Calculate beta
values or estimates
.
#Transpose of x matrix
xMT <- t(xM)
#Inverse of x and x transpose
xInv <- solve(t(xM)%*%xM)
#Beta values
betaVal <- xInv %*% t(xM) %*% yV
#Display estimates
df <- data.frame(`lm-Function` = coef(bb.lm), `Manual-Calc`= betaVal)
df %>%
kable("html",caption = "Beta Values") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
lm.Function | Manual.Calc | |
---|---|---|
(Intercept) | 13.2580952 | 13.2580952 |
TEAM_BATTING_1B | 0.0498149 | 0.0498149 |
TEAM_BATTING_2B | 0.0362604 | 0.0362604 |
TEAM_BATTING_3B | 0.0835683 | 0.0835683 |
TEAM_BATTING_HR | 0.1236577 | 0.1236577 |
TEAM_BATTING_BB | 0.0093864 | 0.0093864 |
TEAM_BATTING_SO | -0.0107378 | -0.0107378 |
TEAM_BASERUN_SB | 0.0089533 | 0.0089533 |
TEAM_BASERUN_CS | 0.0907402 | 0.0907402 |
TEAM_PITCHING_H | -0.0004599 | -0.0004599 |
TEAM_PITCHING_HR | 0.0136050 | 0.0136050 |
TEAM_PITCHING_BB | 0.0029975 | 0.0029975 |
TEAM_PITCHING_SO | 0.0023862 | 0.0023862 |
TEAM_FIELDING_E | -0.0223547 | -0.0223547 |
TEAM_FIELDING_DP | -0.1005829 | -0.1005829 |
There is no difference between values obtained from lm()
function and manual calculation.
Now let’s get Standard Errors
.
#Convert rows to column, transpose Beta values
betValt <- t(betaVal)
#Get predicted values of y, x times beta values
predY <- (xM[,1]*betValt[,1] + xM[,2]*betValt[,2] + xM[,3]*betValt[,3] + xM[,4]*betValt[,4] +
xM[,5]*betValt[,5] + xM[,6]*betValt[,6] + xM[,7]*betValt[,7] + xM[,8]*betValt[,8] +
xM[,9]*betValt[,9] + xM[,10]*betValt[,10] + xM[,11]*betValt[,11] + xM[,12]*betValt[,12] +
xM[,13]*betValt[,13] + xM[,14]*betValt[,14] + xM[,15]*betValt[,15])
#Get Residuals or epsilon values(y - y hat)
residM <- yV - predY
#Transpose of residuals matrix
residMt <- t(residM)
n <- nrow(bb.df)
p <- ncol(bb.df) - 1
#Variance of Residuals, sum of squares of each value divided by (rows - columns - 2)
varResid <- (residMt %*% residM) / (n - p - 1)
#Residual Standard Error
#Sqrt is applied as varResid is square value.
#R-Squared value of LM model
resdSE <- sqrt(varResid)
#Get diagonal values of product of x and transpose of x
#it is basically squared value
diagXM <-diag(xInv)
#standard errors of beta values
betaSE <- resdSE * sqrt(diagXM)
#Display estimates
df <- data.frame(`lm-Function` = summary(bb.lm)[["coefficients"]][, "Std. Error"], `Manual-Calc`= betaSE)
df %>%
kable("html",caption = "Standard Error Of Estimates") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
lm.Function | Manual.Calc | |
---|---|---|
(Intercept) | 5.7825918 | 5.7825918 |
TEAM_BATTING_1B | 0.0037128 | 0.0037128 |
TEAM_BATTING_2B | 0.0073414 | 0.0073414 |
TEAM_BATTING_3B | 0.0162111 | 0.0162111 |
TEAM_BATTING_HR | 0.0274046 | 0.0274046 |
TEAM_BATTING_BB | 0.0057680 | 0.0057680 |
TEAM_BATTING_SO | 0.0025854 | 0.0025854 |
TEAM_BASERUN_SB | 0.0054808 | 0.0054808 |
TEAM_BASERUN_CS | 0.0164281 | 0.0164281 |
TEAM_PITCHING_H | 0.0003669 | 0.0003669 |
TEAM_PITCHING_HR | 0.0241301 | 0.0241301 |
TEAM_PITCHING_BB | 0.0041308 | 0.0041308 |
TEAM_PITCHING_SO | 0.0009146 | 0.0009146 |
TEAM_FIELDING_E | 0.0023865 | 0.0023865 |
TEAM_FIELDING_DP | 0.0139331 | 0.0139331 |
There is no difference between Standard Errors
of lm()
function and manual calculation.
t-value
And p-value's
And next two columns of lm()
function are t-value
and p-value's
.
#beta value divided by standard error of the estimate
tValue <- betaVal/betaSE
#Probability value
pValue <- 2*pt(abs(betaVal/betaSE), df=n-p,lower.tail= FALSE)
df <- data.frame(`lm-Function-t-value` = summary(bb.lm)[["coefficients"]][, "t value"], `Manual-Calc-t-value`=tValue, `lm-Function-p-value` = summary(bb.lm)[["coefficients"]][, "Pr(>|t|)"], `Manual-Calc-p-value` = pValue)
df %>%
kable("html",caption = "t-value And p-value") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
lm.Function.t.value | Manual.Calc.t.value | lm.Function.p.value | Manual.Calc.p.value | |
---|---|---|---|---|
(Intercept) | 2.2927600 | 2.2927600 | 0.0219533 | 0.0219532 |
TEAM_BATTING_1B | 13.4169647 | 13.4169647 | 0.0000000 | 0.0000000 |
TEAM_BATTING_2B | 4.9391451 | 4.9391451 | 0.0000008 | 0.0000008 |
TEAM_BATTING_3B | 5.1549957 | 5.1549957 | 0.0000003 | 0.0000003 |
TEAM_BATTING_HR | 4.5122994 | 4.5122994 | 0.0000067 | 0.0000067 |
TEAM_BATTING_BB | 1.6273425 | 1.6273425 | 0.1038037 | 0.1038037 |
TEAM_BATTING_SO | -4.1533315 | -4.1533315 | 0.0000340 | 0.0000340 |
TEAM_BASERUN_SB | 1.6335772 | 1.6335772 | 0.1024869 | 0.1024868 |
TEAM_BASERUN_CS | 5.5234878 | 5.5234878 | 0.0000000 | 0.0000000 |
TEAM_PITCHING_H | -1.2535798 | -1.2535798 | 0.2101244 | 0.2101243 |
TEAM_PITCHING_HR | 0.5638171 | 0.5638171 | 0.5729345 | 0.5729345 |
TEAM_PITCHING_BB | 0.7256280 | 0.7256280 | 0.4681419 | 0.4681419 |
TEAM_PITCHING_SO | 2.6090739 | 2.6090739 | 0.0091386 | 0.0091386 |
TEAM_FIELDING_E | -9.3670760 | -9.3670760 | 0.0000000 | 0.0000000 |
TEAM_FIELDING_DP | -7.2190003 | -7.2190003 | 0.0000000 | 0.0000000 |
df <- data.frame(`lm-Function` = sigma(bb.lm), `Manual-Calc`= resdSE)
df %>%
kable("html",caption = "Residual standard error") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
lm.Function | Manual.Calc |
---|---|
12.94206 | 12.94206 |
According to stats.stackexchange.com, there are many ways to obtain adjusted R2.
Formula used by lm()
function is, McNemar’s formula.
Formula for F-Statistic
.
#Residual sum of squares(errors)
resSS <- sum((yV - predY) ^ 2)
#Regression sum of squares
regSS <- sum((predY - mean(yV)) ^ 2)
#Total sum of squares
totalSS <- sum((yV - mean(yV)) ^ 2)
rSq <- 1 - (resSS/totalSS)
#McNemar's formula, adjusted Rsq
adjRSq <- 1 - (1 - rSq)*((n-1)/(n-p))
#'p' is used instead of 'p-1' because I have already substracted 1 while declaring 'p'
FStat <- (regSS/(p))/(resSS/(n-p-1))
df <- data.frame(`lm-Function-r-sq.` = summary(bb.lm)$r.squared, `Manual-Calc-r-sq`= rSq, `lm-Function-adj.r-sq.` = summary(bb.lm)$adj.r.squared, `Manual-Calc-adj.r-sq`= adjRSq,`lm-Function-F-statistic` = summary(bb.lm)$fstatistic[1],`Manual-Calc-F-statistic`= FStat)
df %>%
kable("html",caption = "R-Square And Adjusted R-Square") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
lm.Function.r.sq. | Manual.Calc.r.sq | lm.Function.adj.r.sq. | Manual.Calc.adj.r.sq | lm.Function.F.statistic | Manual.Calc.F.statistic | |
---|---|---|---|---|---|---|
value | 0.329118 | 0.329118 | 0.3249639 | 0.3252624 | 79.22788 | 79.22788 |
Conclusion, there is no difference in results obtained using lm()
function or manual calculation. Matrix method provides insights into how estimates
, standard errors
and residuals
are obtained. However lm()
function is much faster in generating output.