In this post, I will be showing how results of lm() function can be obtained using matrix algebra. Dataset I will be using is Moneyball.

Variable TEAM_BATTING_HBP is not used for analysis, and missing values are filled in using k-NN imputation.

#options(scipen=999)
library(dplyr)
library(DMwR)
library(knitr)    # Report display, table format
library(kableExtra)

bb.df <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
bb.df$TEAM_BATTING_1B = bb.df$TEAM_BATTING_H - bb.df$TEAM_BATTING_2B - bb.df$TEAM_BATTING_3B - bb.df$TEAM_BATTING_HR

bb.df <- bb.df %>% 
  select(TARGET_WINS,TEAM_BATTING_1B,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,
TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_BASERUN_CS,TEAM_PITCHING_H,TEAM_PITCHING_HR,
TEAM_PITCHING_BB,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP)

bb.df <- knnImputation(bb.df, 15, meth='weighAvg')

for(i in 1:ncol(bb.df)){
  bb.df[is.na(bb.df[,i]), i] <- as.numeric(as.character(bb.df[is.na(bb.df[,i]), i]))
}

head(bb.df,20) %>% 
  kable("html",caption = "Sample Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
  scroll_box(width = "100%", height = "200px")
Sample Data
TARGET_WINS TEAM_BATTING_1B TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
39 1199 194 39 13 143 842 116.1351 61.54372 9364 84 927 5456 1011 155.9764
70 908 219 22 190 685 1075 37.0000 28.00000 1347 191 689 1082 193 155.0000
86 973 232 35 137 602 917 46.0000 27.00000 1377 137 602 917 175 153.0000
70 1044 209 38 96 451 922 43.0000 30.00000 1396 97 454 928 164 156.0000
82 982 186 27 102 472 920 49.0000 39.00000 1297 102 472 920 138 168.0000
75 951 200 36 92 443 973 107.0000 59.00000 1279 92 443 973 123 149.0000
80 889 179 54 122 525 1062 80.0000 54.00000 1244 122 525 1062 136 186.0000
85 950 171 37 115 456 1027 40.0000 36.00000 1281 116 459 1033 112 136.0000
86 1040 197 40 114 447 922 69.0000 27.00000 1391 114 447 922 127 169.0000
76 944 213 18 96 441 827 72.0000 34.00000 1271 96 441 827 131 159.0000
78 1017 179 27 82 374 888 60.0000 39.00000 1364 86 391 928 119 141.0000
68 1043 203 31 95 509 801 119.0000 79.00000 1372 95 509 801 147 150.0000
72 1040 196 41 55 597 816 221.0000 109.00000 1340 55 601 821 185 165.0000
76 969 210 23 63 534 812 126.0000 80.00000 1265 63 534 812 150 139.0000
74 976 233 40 131 542 880 159.0000 89.00000 1380 131 542 880 147 137.0000
87 1055 226 28 108 539 682 86.0000 69.00000 1417 108 539 682 136 136.0000
88 1114 242 43 164 589 843 100.0000 53.00000 1563 164 589 843 135 172.0000
66 1082 239 32 107 546 900 92.0000 64.00000 1478 108 553 911 136 146.0000
75 1026 197 24 143 579 841 65.0000 49.00000 2047 211 853 1239 149 177.0000
93 1038 268 26 186 613 760 55.0000 53.00000 1518 186 613 760 106 171.0000

Regression model using lm() function

bb.lm <- lm(TARGET_WINS ~ ., data = bb.df)

summary(bb.lm)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = bb.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.668  -8.486   0.097   8.273  61.574 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      13.2580952  5.7825918   2.293  0.02195 *  
## TEAM_BATTING_1B   0.0498149  0.0037128  13.417  < 2e-16 ***
## TEAM_BATTING_2B   0.0362604  0.0073414   4.939 8.42e-07 ***
## TEAM_BATTING_3B   0.0835683  0.0162111   5.155 2.76e-07 ***
## TEAM_BATTING_HR   0.1236577  0.0274046   4.512 6.74e-06 ***
## TEAM_BATTING_BB   0.0093864  0.0057680   1.627  0.10380    
## TEAM_BATTING_SO  -0.0107378  0.0025854  -4.153 3.40e-05 ***
## TEAM_BASERUN_SB   0.0089533  0.0054808   1.634  0.10249    
## TEAM_BASERUN_CS   0.0907402  0.0164281   5.523 3.70e-08 ***
## TEAM_PITCHING_H  -0.0004599  0.0003669  -1.254  0.21012    
## TEAM_PITCHING_HR  0.0136050  0.0241301   0.564  0.57293    
## TEAM_PITCHING_BB  0.0029975  0.0041308   0.726  0.46814    
## TEAM_PITCHING_SO  0.0023862  0.0009146   2.609  0.00914 ** 
## TEAM_FIELDING_E  -0.0223547  0.0023865  -9.367  < 2e-16 ***
## TEAM_FIELDING_DP -0.1005829  0.0139331  -7.219 7.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.94 on 2261 degrees of freedom
## Multiple R-squared:  0.3291, Adjusted R-squared:  0.325 
## F-statistic: 79.23 on 14 and 2261 DF,  p-value: < 2.2e-16

Model suggests, variables TEAM_BATTING_BB, TEAM_BASERUN_SB, TEAM_PITCHING_H, TEAM_PITCHING_HR and TEAM_PITCHING_BB are not satistically significant as p-value is greater than 0.05 critical value.

General Matrix Form

Regression function using martix notation,

\[\left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right]=\left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right]\cdot\left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right] + \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]

Let, \[\mathbf{X} = \left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right], \mathbf{y} = \left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right], \beta = \left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right], \epsilon = \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]

General regression equation, \[\mathbf{y} = \mathbf{X}\cdot\beta + \epsilon\]

Assuming there is no residuals(\(\epsilon\)),

\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}\]

Where, \(\mathbf{X}^T\), is transpose of \(\mathbf{X}\). It is calculated using t() function and inverse(\(^{-1}\)) is derived using solve() function.

Variance of the residuals or mean squared error(\({\hat\sigma}^{2}\)),

\[{\hat\sigma}^{2} = \frac{(\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})^T \cdot (\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})}{n - p - 1}\] Where \(\mathbf{X} \cdot \hat{\mathbf{\beta}} = \hat{\mathbf{y}}\)

Estimated covariances for linear regression coefficient estimates

\[\widehat{\text{var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1}\]

Square root of the diagonal of \(\widehat{\text{var}}(\hat{\mathbf{\beta}})\) are standard errors for each coefficient.

Create Matrix

Variable TARGET_WINS is response variable(\(\mathbf{y}\)) and rest of the variables are independent variables. Following code creates vector y and matrix x.

#Extract 'y' values into vector
yV <- as.matrix(bb.df[,])[,1]

#Extract 'x' values into matrix
xM <- cbind(constant = 1, as.matrix(bb.df[,])[, -1])

Beta Values

Calculate beta values or estimates.

#Transpose of x matrix
xMT <- t(xM)

#Inverse of x and x transpose
xInv <- solve(t(xM)%*%xM)

#Beta values
betaVal <- xInv %*% t(xM) %*% yV 

#Display estimates
df <- data.frame(`lm-Function` = coef(bb.lm), `Manual-Calc`= betaVal)

df %>% 
  kable("html",caption = "Beta Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
Beta Values
lm.Function Manual.Calc
(Intercept) 13.2580952 13.2580952
TEAM_BATTING_1B 0.0498149 0.0498149
TEAM_BATTING_2B 0.0362604 0.0362604
TEAM_BATTING_3B 0.0835683 0.0835683
TEAM_BATTING_HR 0.1236577 0.1236577
TEAM_BATTING_BB 0.0093864 0.0093864
TEAM_BATTING_SO -0.0107378 -0.0107378
TEAM_BASERUN_SB 0.0089533 0.0089533
TEAM_BASERUN_CS 0.0907402 0.0907402
TEAM_PITCHING_H -0.0004599 -0.0004599
TEAM_PITCHING_HR 0.0136050 0.0136050
TEAM_PITCHING_BB 0.0029975 0.0029975
TEAM_PITCHING_SO 0.0023862 0.0023862
TEAM_FIELDING_E -0.0223547 -0.0223547
TEAM_FIELDING_DP -0.1005829 -0.1005829

There is no difference between values obtained from lm() function and manual calculation.

Standard Errors

Now let’s get Standard Errors.

#Convert rows to column, transpose Beta values
betValt <- t(betaVal)

#Get predicted values of y, x times beta values
predY <- (xM[,1]*betValt[,1] + xM[,2]*betValt[,2] + xM[,3]*betValt[,3] +  xM[,4]*betValt[,4] +
          xM[,5]*betValt[,5] + xM[,6]*betValt[,6] + xM[,7]*betValt[,7] +  xM[,8]*betValt[,8] +
          xM[,9]*betValt[,9] + xM[,10]*betValt[,10] +  xM[,11]*betValt[,11] + xM[,12]*betValt[,12] +
          xM[,13]*betValt[,13] + xM[,14]*betValt[,14] + xM[,15]*betValt[,15])

#Get Residuals or epsilon values(y - y hat)
residM <- yV - predY

#Transpose of residuals matrix
residMt <- t(residM)

n <- nrow(bb.df)
p <- ncol(bb.df) - 1

#Variance of Residuals, sum of squares of each value divided by (rows - columns - 2)
varResid <- (residMt %*% residM) / (n - p - 1)

#Residual Standard Error
#Sqrt is applied as varResid is square value.
#R-Squared value of LM model
resdSE <- sqrt(varResid)

#Get diagonal values of product of x and transpose of x
#it is basically squared value
diagXM <-diag(xInv)

#standard errors of beta values
betaSE <- resdSE * sqrt(diagXM)

#Display estimates
df <- data.frame(`lm-Function` = summary(bb.lm)[["coefficients"]][, "Std. Error"], `Manual-Calc`= betaSE)

df %>% 
  kable("html",caption = "Standard Error Of Estimates") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
Standard Error Of Estimates
lm.Function Manual.Calc
(Intercept) 5.7825918 5.7825918
TEAM_BATTING_1B 0.0037128 0.0037128
TEAM_BATTING_2B 0.0073414 0.0073414
TEAM_BATTING_3B 0.0162111 0.0162111
TEAM_BATTING_HR 0.0274046 0.0274046
TEAM_BATTING_BB 0.0057680 0.0057680
TEAM_BATTING_SO 0.0025854 0.0025854
TEAM_BASERUN_SB 0.0054808 0.0054808
TEAM_BASERUN_CS 0.0164281 0.0164281
TEAM_PITCHING_H 0.0003669 0.0003669
TEAM_PITCHING_HR 0.0241301 0.0241301
TEAM_PITCHING_BB 0.0041308 0.0041308
TEAM_PITCHING_SO 0.0009146 0.0009146
TEAM_FIELDING_E 0.0023865 0.0023865
TEAM_FIELDING_DP 0.0139331 0.0139331

There is no difference between Standard Errors of lm() function and manual calculation.

t-value And p-value's

And next two columns of lm() function are t-value and p-value's.

#beta value divided by standard error of the estimate
tValue <- betaVal/betaSE

#Probability value
pValue <- 2*pt(abs(betaVal/betaSE), df=n-p,lower.tail= FALSE)

df <- data.frame(`lm-Function-t-value` = summary(bb.lm)[["coefficients"]][, "t value"], `Manual-Calc-t-value`=tValue, `lm-Function-p-value` = summary(bb.lm)[["coefficients"]][, "Pr(>|t|)"], `Manual-Calc-p-value` = pValue)

df %>% 
  kable("html",caption = "t-value And p-value") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
t-value And p-value
lm.Function.t.value Manual.Calc.t.value lm.Function.p.value Manual.Calc.p.value
(Intercept) 2.2927600 2.2927600 0.0219533 0.0219532
TEAM_BATTING_1B 13.4169647 13.4169647 0.0000000 0.0000000
TEAM_BATTING_2B 4.9391451 4.9391451 0.0000008 0.0000008
TEAM_BATTING_3B 5.1549957 5.1549957 0.0000003 0.0000003
TEAM_BATTING_HR 4.5122994 4.5122994 0.0000067 0.0000067
TEAM_BATTING_BB 1.6273425 1.6273425 0.1038037 0.1038037
TEAM_BATTING_SO -4.1533315 -4.1533315 0.0000340 0.0000340
TEAM_BASERUN_SB 1.6335772 1.6335772 0.1024869 0.1024868
TEAM_BASERUN_CS 5.5234878 5.5234878 0.0000000 0.0000000
TEAM_PITCHING_H -1.2535798 -1.2535798 0.2101244 0.2101243
TEAM_PITCHING_HR 0.5638171 0.5638171 0.5729345 0.5729345
TEAM_PITCHING_BB 0.7256280 0.7256280 0.4681419 0.4681419
TEAM_PITCHING_SO 2.6090739 2.6090739 0.0091386 0.0091386
TEAM_FIELDING_E -9.3670760 -9.3670760 0.0000000 0.0000000
TEAM_FIELDING_DP -7.2190003 -7.2190003 0.0000000 0.0000000

Residual Standard Error

df <- data.frame(`lm-Function` = sigma(bb.lm), `Manual-Calc`= resdSE)

df %>% 
  kable("html",caption = "Residual standard error") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
Residual standard error
lm.Function Manual.Calc
12.94206 12.94206

R-squared And F-Statistic

According to stats.stackexchange.com, there are many ways to obtain adjusted R2.

Formula used by lm() function is, McNemar’s formula.


Formula for F-Statistic.

#Residual sum of squares(errors) 
resSS <- sum((yV - predY) ^ 2)

#Regression sum of squares
regSS <- sum((predY - mean(yV)) ^ 2)

#Total sum of squares
totalSS <- sum((yV - mean(yV)) ^ 2)
rSq <- 1 - (resSS/totalSS)

#McNemar's formula, adjusted Rsq
adjRSq <- 1 - (1 - rSq)*((n-1)/(n-p))

#'p' is used instead of 'p-1' because I have already substracted 1 while declaring 'p'
FStat <- (regSS/(p))/(resSS/(n-p-1))

df <- data.frame(`lm-Function-r-sq.` = summary(bb.lm)$r.squared, `Manual-Calc-r-sq`= rSq, `lm-Function-adj.r-sq.` = summary(bb.lm)$adj.r.squared, `Manual-Calc-adj.r-sq`= adjRSq,`lm-Function-F-statistic` = summary(bb.lm)$fstatistic[1],`Manual-Calc-F-statistic`=  FStat)

df %>% 
  kable("html",caption = "R-Square And Adjusted R-Square") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)
R-Square And Adjusted R-Square
lm.Function.r.sq. Manual.Calc.r.sq lm.Function.adj.r.sq. Manual.Calc.adj.r.sq lm.Function.F.statistic Manual.Calc.F.statistic
value 0.329118 0.329118 0.3249639 0.3252624 79.22788 79.22788

Conclusion, there is no difference in results obtained using lm() function or manual calculation. Matrix method provides insights into how estimates, standard errors and residuals are obtained. However lm() function is much faster in generating output.


References