Dissecting Multiple Linear Regression Model

In this post, I will be showing how results of lm() function can be obtained using matrix algebra. Dataset I will be using is Moneyball.

Variable TEAM_BATTING_HBP is not used for analysis, and missing values are filled in using k-NN imputation.

#options(scipen=999)
library(dplyr)
library(DMwR)
library(knitr)    # Report display, table format
library(kableExtra)

bb.df <- read.csv("https://raw.githubusercontent.com/akulapa/Data621-Week05-Discussion/master/moneyball-training-data.csv", header= TRUE, stringsAsFactors = F)
bb.df$TEAM_BATTING_1B = bb.df$TEAM_BATTING_H - bb.df$TEAM_BATTING_2B - bb.df$TEAM_BATTING_3B - bb.df$TEAM_BATTING_HR

bb.df <- bb.df %>% 
  select(TARGET_WINS,TEAM_BATTING_1B,TEAM_BATTING_2B,TEAM_BATTING_3B,TEAM_BATTING_HR,TEAM_BATTING_BB,
TEAM_BATTING_SO,TEAM_BASERUN_SB,TEAM_BASERUN_CS,TEAM_PITCHING_H,TEAM_PITCHING_HR,
TEAM_PITCHING_BB,TEAM_PITCHING_SO,TEAM_FIELDING_E,TEAM_FIELDING_DP)

bb.df <- knnImputation(bb.df, 15, meth='weighAvg')

for(i in 1:ncol(bb.df)){
  bb.df[is.na(bb.df[,i]), i] <- as.numeric(as.character(bb.df[is.na(bb.df[,i]), i]))
}

head(bb.df,20) %>% 
  kable("html",caption = "Sample Data") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) %>%
  scroll_box(width = "100%", height = "200px")

Sample Data
TARGET_WINS	TEAM_BATTING_1B	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
39	1199	194	39	13	143	842	116.1351	61.54372	9364	84	927	5456	1011	155.9764
70	908	219	22	190	685	1075	37.0000	28.00000	1347	191	689	1082	193	155.0000
86	973	232	35	137	602	917	46.0000	27.00000	1377	137	602	917	175	153.0000
70	1044	209	38	96	451	922	43.0000	30.00000	1396	97	454	928	164	156.0000
82	982	186	27	102	472	920	49.0000	39.00000	1297	102	472	920	138	168.0000
75	951	200	36	92	443	973	107.0000	59.00000	1279	92	443	973	123	149.0000
80	889	179	54	122	525	1062	80.0000	54.00000	1244	122	525	1062	136	186.0000
85	950	171	37	115	456	1027	40.0000	36.00000	1281	116	459	1033	112	136.0000
86	1040	197	40	114	447	922	69.0000	27.00000	1391	114	447	922	127	169.0000
76	944	213	18	96	441	827	72.0000	34.00000	1271	96	441	827	131	159.0000
78	1017	179	27	82	374	888	60.0000	39.00000	1364	86	391	928	119	141.0000
68	1043	203	31	95	509	801	119.0000	79.00000	1372	95	509	801	147	150.0000
72	1040	196	41	55	597	816	221.0000	109.00000	1340	55	601	821	185	165.0000
76	969	210	23	63	534	812	126.0000	80.00000	1265	63	534	812	150	139.0000
74	976	233	40	131	542	880	159.0000	89.00000	1380	131	542	880	147	137.0000
87	1055	226	28	108	539	682	86.0000	69.00000	1417	108	539	682	136	136.0000
88	1114	242	43	164	589	843	100.0000	53.00000	1563	164	589	843	135	172.0000
66	1082	239	32	107	546	900	92.0000	64.00000	1478	108	553	911	136	146.0000
75	1026	197	24	143	579	841	65.0000	49.00000	2047	211	853	1239	149	177.0000
93	1038	268	26	186	613	760	55.0000	53.00000	1518	186	613	760	106	171.0000

Regression model using lm() function

bb.lm <- lm(TARGET_WINS ~ ., data = bb.df)

summary(bb.lm)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = bb.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.668  -8.486   0.097   8.273  61.574 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      13.2580952  5.7825918   2.293  0.02195 *  
## TEAM_BATTING_1B   0.0498149  0.0037128  13.417  < 2e-16 ***
## TEAM_BATTING_2B   0.0362604  0.0073414   4.939 8.42e-07 ***
## TEAM_BATTING_3B   0.0835683  0.0162111   5.155 2.76e-07 ***
## TEAM_BATTING_HR   0.1236577  0.0274046   4.512 6.74e-06 ***
## TEAM_BATTING_BB   0.0093864  0.0057680   1.627  0.10380    
## TEAM_BATTING_SO  -0.0107378  0.0025854  -4.153 3.40e-05 ***
## TEAM_BASERUN_SB   0.0089533  0.0054808   1.634  0.10249    
## TEAM_BASERUN_CS   0.0907402  0.0164281   5.523 3.70e-08 ***
## TEAM_PITCHING_H  -0.0004599  0.0003669  -1.254  0.21012    
## TEAM_PITCHING_HR  0.0136050  0.0241301   0.564  0.57293    
## TEAM_PITCHING_BB  0.0029975  0.0041308   0.726  0.46814    
## TEAM_PITCHING_SO  0.0023862  0.0009146   2.609  0.00914 ** 
## TEAM_FIELDING_E  -0.0223547  0.0023865  -9.367  < 2e-16 ***
## TEAM_FIELDING_DP -0.1005829  0.0139331  -7.219 7.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.94 on 2261 degrees of freedom
## Multiple R-squared:  0.3291, Adjusted R-squared:  0.325 
## F-statistic: 79.23 on 14 and 2261 DF,  p-value: < 2.2e-16

Model suggests, variables TEAM_BATTING_BB, TEAM_BASERUN_SB, TEAM_PITCHING_H, TEAM_PITCHING_HR and TEAM_PITCHING_BB are not satistically significant as p-value is greater than 0.05 critical value.

General Matrix Form

Regression function using martix notation,

\[\left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right]=\left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right]\cdot\left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right] + \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]

Let, \[\mathbf{X} = \left[\begin{array}{cc}1 & x_{11} & x_{12} & x_{13} & .& .& x_{1p}\\1 & x_{21} & x_{22} & x_{23} & .& .& x_{2p} \\1 & x_{31} & x_{32} & x_{33} & .& .& x_{3p} \\. & . & . & . & .& .& .\\. & . & . & . & .& .& . \\1 & x_{n1} & x_{n2} & x_{n3} & .& .& x_{np}\end{array}\right], \mathbf{y} = \left[\begin{array}{cc}y_1\\y_2\\y_3\\.\\.\\y_n\end{array}\right], \beta = \left[\begin{array}{cc}\beta_0\\\beta_1\\\beta_2\\.\\.\\\beta_p\end{array}\right], \epsilon = \left[\begin{array}{cc}\epsilon_1\\\epsilon_2\\\epsilon_3\\.\\.\\\epsilon_n\end{array}\right]\]

General regression equation, \[\mathbf{y} = \mathbf{X}\cdot\beta + \epsilon\]

Assuming there is no residuals(\(\epsilon\)),

\[\hat{\mathbf{\beta}} = \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1} \cdot \mathbf{X}^T \cdot \mathbf{y}\]

Where, \(\mathbf{X}^T\), is transpose of \(\mathbf{X}\). It is calculated using t() function and inverse(\(^{-1}\)) is derived using solve() function.

Variance of the residuals or mean squared error(\({\hat\sigma}^{2}\)),

\[{\hat\sigma}^{2} = \frac{(\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})^T \cdot (\mathbf{y} - \mathbf{X} \cdot \hat{\mathbf{\beta}})}{n - p - 1}\] Where \(\mathbf{X} \cdot \hat{\mathbf{\beta}} = \hat{\mathbf{y}}\)

Estimated covariances for linear regression coefficient estimates

\[\widehat{\text{var}}(\hat{\mathbf{\beta}}) = \hat{\sigma}^2 \left(\mathbf{X}^T \cdot \mathbf{X} \right)^{-1}\]

Square root of the diagonal of \(\widehat{\text{var}}(\hat{\mathbf{\beta}})\) are standard errors for each coefficient.

Create Matrix

Variable TARGET_WINS is response variable(\(\mathbf{y}\)) and rest of the variables are independent variables. Following code creates vector y and matrix x.

#Extract 'y' values into vector
yV <- as.matrix(bb.df[,])[,1]

#Extract 'x' values into matrix
xM <- cbind(constant = 1, as.matrix(bb.df[,])[, -1])

Beta Values

Calculate beta values or estimates.

#Transpose of x matrix
xMT <- t(xM)

#Inverse of x and x transpose
xInv <- solve(t(xM)%*%xM)

#Beta values
betaVal <- xInv %*% t(xM) %*% yV 

#Display estimates
df <- data.frame(`lm-Function` = coef(bb.lm), `Manual-Calc`= betaVal)

df %>% 
  kable("html",caption = "Beta Values") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

Beta Values
	lm.Function	Manual.Calc
(Intercept)	13.2580952	13.2580952
TEAM_BATTING_1B	0.0498149	0.0498149
TEAM_BATTING_2B	0.0362604	0.0362604
TEAM_BATTING_3B	0.0835683	0.0835683
TEAM_BATTING_HR	0.1236577	0.1236577
TEAM_BATTING_BB	0.0093864	0.0093864
TEAM_BATTING_SO	-0.0107378	-0.0107378
TEAM_BASERUN_SB	0.0089533	0.0089533
TEAM_BASERUN_CS	0.0907402	0.0907402
TEAM_PITCHING_H	-0.0004599	-0.0004599
TEAM_PITCHING_HR	0.0136050	0.0136050
TEAM_PITCHING_BB	0.0029975	0.0029975
TEAM_PITCHING_SO	0.0023862	0.0023862
TEAM_FIELDING_E	-0.0223547	-0.0223547
TEAM_FIELDING_DP	-0.1005829	-0.1005829

There is no difference between values obtained from lm() function and manual calculation.

Standard Errors

Now let’s get Standard Errors.

#Convert rows to column, transpose Beta values
betValt <- t(betaVal)

#Get predicted values of y, x times beta values
predY <- (xM[,1]*betValt[,1] + xM[,2]*betValt[,2] + xM[,3]*betValt[,3] +  xM[,4]*betValt[,4] +
          xM[,5]*betValt[,5] + xM[,6]*betValt[,6] + xM[,7]*betValt[,7] +  xM[,8]*betValt[,8] +
          xM[,9]*betValt[,9] + xM[,10]*betValt[,10] +  xM[,11]*betValt[,11] + xM[,12]*betValt[,12] +
          xM[,13]*betValt[,13] + xM[,14]*betValt[,14] + xM[,15]*betValt[,15])

#Get Residuals or epsilon values(y - y hat)
residM <- yV - predY

#Transpose of residuals matrix
residMt <- t(residM)

n <- nrow(bb.df)
p <- ncol(bb.df) - 1

#Variance of Residuals, sum of squares of each value divided by (rows - columns - 2)
varResid <- (residMt %*% residM) / (n - p - 1)

#Residual Standard Error
#Sqrt is applied as varResid is square value.
#R-Squared value of LM model
resdSE <- sqrt(varResid)

#Get diagonal values of product of x and transpose of x
#it is basically squared value
diagXM <-diag(xInv)

#standard errors of beta values
betaSE <- resdSE * sqrt(diagXM)

#Display estimates
df <- data.frame(`lm-Function` = summary(bb.lm)[["coefficients"]][, "Std. Error"], `Manual-Calc`= betaSE)

df %>% 
  kable("html",caption = "Standard Error Of Estimates") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

Standard Error Of Estimates
	lm.Function	Manual.Calc
(Intercept)	5.7825918	5.7825918
TEAM_BATTING_1B	0.0037128	0.0037128
TEAM_BATTING_2B	0.0073414	0.0073414
TEAM_BATTING_3B	0.0162111	0.0162111
TEAM_BATTING_HR	0.0274046	0.0274046
TEAM_BATTING_BB	0.0057680	0.0057680
TEAM_BATTING_SO	0.0025854	0.0025854
TEAM_BASERUN_SB	0.0054808	0.0054808
TEAM_BASERUN_CS	0.0164281	0.0164281
TEAM_PITCHING_H	0.0003669	0.0003669
TEAM_PITCHING_HR	0.0241301	0.0241301
TEAM_PITCHING_BB	0.0041308	0.0041308
TEAM_PITCHING_SO	0.0009146	0.0009146
TEAM_FIELDING_E	0.0023865	0.0023865
TEAM_FIELDING_DP	0.0139331	0.0139331

There is no difference between Standard Errors of lm() function and manual calculation.

`t-value` And `p-value's`

And next two columns of lm() function are t-value and p-value's.

#beta value divided by standard error of the estimate
tValue <- betaVal/betaSE

#Probability value
pValue <- 2*pt(abs(betaVal/betaSE), df=n-p,lower.tail= FALSE)

df <- data.frame(`lm-Function-t-value` = summary(bb.lm)[["coefficients"]][, "t value"], `Manual-Calc-t-value`=tValue, `lm-Function-p-value` = summary(bb.lm)[["coefficients"]][, "Pr(>|t|)"], `Manual-Calc-p-value` = pValue)

df %>% 
  kable("html",caption = "t-value And p-value") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

t-value And p-value
	lm.Function.t.value	Manual.Calc.t.value	lm.Function.p.value	Manual.Calc.p.value
(Intercept)	2.2927600	2.2927600	0.0219533	0.0219532
TEAM_BATTING_1B	13.4169647	13.4169647	0.0000000	0.0000000
TEAM_BATTING_2B	4.9391451	4.9391451	0.0000008	0.0000008
TEAM_BATTING_3B	5.1549957	5.1549957	0.0000003	0.0000003
TEAM_BATTING_HR	4.5122994	4.5122994	0.0000067	0.0000067
TEAM_BATTING_BB	1.6273425	1.6273425	0.1038037	0.1038037
TEAM_BATTING_SO	-4.1533315	-4.1533315	0.0000340	0.0000340
TEAM_BASERUN_SB	1.6335772	1.6335772	0.1024869	0.1024868
TEAM_BASERUN_CS	5.5234878	5.5234878	0.0000000	0.0000000
TEAM_PITCHING_H	-1.2535798	-1.2535798	0.2101244	0.2101243
TEAM_PITCHING_HR	0.5638171	0.5638171	0.5729345	0.5729345
TEAM_PITCHING_BB	0.7256280	0.7256280	0.4681419	0.4681419
TEAM_PITCHING_SO	2.6090739	2.6090739	0.0091386	0.0091386
TEAM_FIELDING_E	-9.3670760	-9.3670760	0.0000000	0.0000000
TEAM_FIELDING_DP	-7.2190003	-7.2190003	0.0000000	0.0000000

Residual Standard Error

df <- data.frame(`lm-Function` = sigma(bb.lm), `Manual-Calc`= resdSE)

df %>% 
  kable("html",caption = "Residual standard error") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

Residual standard error
lm.Function	Manual.Calc
12.94206	12.94206

R-squared And F-Statistic

According to stats.stackexchange.com, there are many ways to obtain adjusted R².

Formula used by lm() function is, McNemar’s formula.

Formula for F-Statistic.

#Residual sum of squares(errors) 
resSS <- sum((yV - predY) ^ 2)

#Regression sum of squares
regSS <- sum((predY - mean(yV)) ^ 2)

#Total sum of squares
totalSS <- sum((yV - mean(yV)) ^ 2)
rSq <- 1 - (resSS/totalSS)

#McNemar's formula, adjusted Rsq
adjRSq <- 1 - (1 - rSq)*((n-1)/(n-p))

#'p' is used instead of 'p-1' because I have already substracted 1 while declaring 'p'
FStat <- (regSS/(p))/(resSS/(n-p-1))

df <- data.frame(`lm-Function-r-sq.` = summary(bb.lm)$r.squared, `Manual-Calc-r-sq`= rSq, `lm-Function-adj.r-sq.` = summary(bb.lm)$adj.r.squared, `Manual-Calc-adj.r-sq`= adjRSq,`lm-Function-F-statistic` = summary(bb.lm)$fstatistic[1],`Manual-Calc-F-statistic`=  FStat)

df %>% 
  kable("html",caption = "R-Square And Adjusted R-Square") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

R-Square And Adjusted R-Square
	lm.Function.r.sq.	Manual.Calc.r.sq	lm.Function.adj.r.sq.	Manual.Calc.adj.r.sq	lm.Function.F.statistic	Manual.Calc.F.statistic
value	0.329118	0.329118	0.3249639	0.3252624	79.22788	79.22788

Conclusion, there is no difference in results obtained using lm() function or manual calculation. Matrix method provides insights into how estimates, standard errors and residuals are obtained. However lm() function is much faster in generating output.

Dissecting Multiple Linear Regression Model - Matrix Method

Pavan Akula

March 19, 2018

General Matrix Form

Create Matrix

Beta Values

Standard Errors

`t-value` And `p-value's`

Residual Standard Error

R-squared And F-Statistic

References

Dissecting Multiple Linear Regression Model - Matrix Method

Pavan Akula

March 19, 2018

General Matrix Form

Create Matrix

Beta Values

Standard Errors

t-value And p-value's

Residual Standard Error

R-squared And F-Statistic

References

`t-value` And `p-value's`