As an undergraduate, regression was a fuzzy concept for me. I completed a year worth of psychology statistics, but in psychology ANOVA is king, and regression is reduced to five pages in a chapter shared with correlation. Thus, the computations surrounding regression were very much a black box to me. Encouraged by my adviser, I decided to take a class titled, “Regression for Social and Behavioral Research” to find out what all the fuss was surrounding regression. During the first day of class we went over the syllabus:
I went into shock and tuned out for the rest of class and missed learning more new funny sound words like, “probit” and “logit”. After the first day, I was 9000 percent set on dropping the class, but to my dismay “walking for fitness” was already full, so I was stuck in a class where I would spend the first month or so doing matrix algebra. Little did I know, this class would be one of (if not the best) classes I completed as an undergraduate, and spurred my interest in data analysis and statistics. My goal with this tutorial is to simply share some of the things I learned along the way and to unpack what is happening under the hood when you hit ctrl+enter on a line of code containing the function “lm”. I’m assuming the reader has a working knowledge of the basic concepts surrounding regression, so I won’t go over those in detail. The point of this post isn’t to spell all this out in super math-y terms, but to provide a hands on applied example of the concepts. Throughout, I’ll reference few important equations with links below for the proofs behind them for the crazed.
Let’s start by attaching the package “mtcars”. We will use some data from here as our dataset. We can see mtcars is a dataset composed of various information about cars including measures such as: mpg (miles per gallon), hp (horsepower), cyl (cylinders) etc …
data("mtcars")
head(mtcars,5)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Let’s select 3 of these variables: mpg, hp, and disp (displacement). Here, we will be predicting mpg (y) from hp (x1) and disp (x2). At a quick glance, we can see that mpg seems to be inversely related to both hp and disp.
library(GGally)
y<-mtcars[,"mpg"]
x<-mtcars[,c("hp","disp")]
ggpairs(cbind(y,x))
The next step is to finish our x matrix which currently consists of our 2 predictor variables, hp and disp. To do this, we simply need to add a column (b0) representing the intercept. We can accomplish this by padding the current x matrix with a vector of ones.
x0<-rep(1,nrow(x))
x<-as.matrix(cbind(x0,x))
head(x,5)
## x0 hp disp
## Mazda RX4 1 110 160
## Mazda RX4 Wag 1 110 160
## Datsun 710 1 93 108
## Hornet 4 Drive 1 110 258
## Hornet Sportabout 1 175 360
Now, the fun begins. We’ll first start by finding our beta coefficients. In order to do this, we need to take the inverse of X transpose X and multiply that by X tranpose Y. Put more readably: \(b = (X^TX)^{-1}X^TY\). Below, we’ll begin by creating our \(X^TX\) and \(X^TY\) matrices.
XtX<-t(x)%*%x
XtY<-t(x)%*%y
It turns out the X’X matrix has some neat properties. On the diagonal you’ll find the sum squared values and on the off-diagonal you’ll find the cross-products. As an example, we will calculate the sum of squares for hp and the cross product between hp and disp. You can see these values also appear in the appropriate spot in the X’X matrix.
print(XtX)
## x0 hp disp
## x0 32.0 4694 7383.1
## hp 4694.0 834278 1291364.4
## disp 7383.1 1291364 2179627.5
sum(mtcars$hp*mtcars$hp)
## [1] 834278
sum(mtcars$hp*mtcars$disp)
## [1] 1291364
The X’Y matrix will take a bit of a different form as it will always be a vector of length n (number of covariates) + 1 (intercept), but you can still find the cross-products as seen below with respect to mpg and hp.
print(XtY)
## [,1]
## x0 642.9
## hp 84362.7
## disp 128705.1
sum(mtcars$hp*mtcars$mpg)
## [1] 84362.7
Alright, time to calculate some regression coefficients. As mentioned earlier, \(b = (X^TX)^{-1}X^TY\). At this point, we have everything ready to plug into that formula, so we will go ahead and do so. Just to ensure that I’m not lying to you, we will compare out resulting calculation to that of lm.
betas<-solve(XtX)%*%XtY
betas
## [,1]
## x0 30.73590425
## hp -0.02484008
## disp -0.03034628
summary(lm(mpg ~ hp + disp,data=mtcars))$coefficients[1:3]
## [1] 30.73590425 -0.02484008 -0.03034628
Now that we have our matrix we are going to compute something a little funky known as the hat matrix or projection matrix. This matrix maps the vector of our response variable (mpg) to a vector of predicted values. In this matrix, for every observation, is a description of how each response value influences the fitted value (these are known as leverages and found on the diagonal of the hat matrix). To find this hat matrix we need to use the following formula: \(H = X(X^TX)^{-1}X^T\). Once we have calclated the hat matrix, we multiple it by y to find our predicted values. We will do so, and compare our results to lm below:
hat_matrix<- x%*%solve(XtX)%*%t(x)
predicted<-hat_matrix%*%y
head(predicted,5)
## [,1]
## Mazda RX4 23.14809
## Mazda RX4 Wag 23.14809
## Datsun 710 25.14838
## Hornet 4 Drive 20.17416
## Hornet Sportabout 15.46423
lm(mpg ~ hp + disp,data=mtcars)$fitted.values[1:5]
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 23.14809 23.14809 25.14838 20.17416
## Hornet Sportabout
## 15.46423
all.equal(as.numeric(predicted),as.numeric(lm(mpg ~ hp + disp,data=mtcars)$fitted.values))
## [1] TRUE
Now that we have our predicted values we can easily find our residuals and again show that they are equal to the output produced by “lm”
residuals<-(y-predicted)
residuals[1:5]
## [1] -2.148091 -2.148091 -2.348379 1.225844 3.235770
lm(mpg ~ hp + disp,data=mtcars)$residuals[1:5]
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## -2.148091 -2.148091 -2.348379 1.225844
## Hornet Sportabout
## 3.235770
all.equal(as.numeric(residuals),as.numeric(lm(mpg ~ hp + disp,data=mtcars)$residuals))
## [1] TRUE
We can now use the residuals we’ve calculated to help us find the standard errors of our beta estimates. We can accomplish this by first finding the variance-covariance matrix between the regression coefficients. The formula to do so is: $ var() = {2}(XTX)^{-1}$. To proceed we need to estimate \(\sigma^2\). To estimate \(\hat{\sigma^2}\) we can use the following formula \(\hat{\sigma^2} = \dfrac{1}{N-p}(r^Tr)\) where r is our vector of residuals, N = number of observations and p = number of columns in the design matrix (inctercept included). Once we have an estimate of variance, we can plug it in to the first formula to obtain our variance co-variance matrix. Once we have that, we can take the square root of its diagnol to obtain our standard errors. As usual, we will compare our calculations to “lm”.
var_est <- (1/(nrow(x)-ncol(x)))*t(residuals)%*%residuals
ses<-sqrt(diag(as.numeric(var_est)*solve(XtX)))
print(ses)
## x0 hp disp
## 1.331566129 0.013385499 0.007404856
summary(lm(mpg ~ hp + disp,data=mtcars))$coefficients[4:6]
## [1] 1.331566129 0.013385499 0.007404856
Now, we can finish our calculations by obtain t and p - values. We can even calculate a few other things lm gives you such as r2 and adjusted r2
tvals<-betas/ses
print(tvals)
## [,1]
## x0 23.082522
## hp -1.855746
## disp -4.098159
pvals<-sapply(tvals, function (p) 2*pt(abs(-p), nrow(x)-ncol(x), lower=FALSE))
print(pvals)
## [1] 3.262507e-20 7.367905e-02 3.062678e-04
r2<-sum((predicted-mean(y))^2)/sum((y-mean(y))^2)
print(r2)
## [1] 0.7482402
adjusted_r2<-1-(((1-r2)*(nrow(x)-1))/(nrow(x)-(ncol(x)-1)-1))
print(adjusted_r2)
## [1] 0.7308774
Finally, we can combine everything in one spot and compare it to the summary output of lm
round(data.frame(b=betas,se=ses,t_value = tvals,p_value=pvals,r2=c(r2,rep(NA,2)),adj_r2=c(adjusted_r2,rep(NA,2))),6)
## b se t_value p_value r2 adj_r2
## x0 30.735904 1.331566 23.082522 0.000000 0.74824 0.730877
## hp -0.024840 0.013385 -1.855746 0.073679 NA NA
## disp -0.030346 0.007405 -4.098159 0.000306 NA NA
summary(lm(mpg ~ hp+disp,data=mtcars))
##
## Call:
## lm(formula = mpg ~ hp + disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7945 -2.3036 -0.8246 1.8582 6.9363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.735904 1.331566 23.083 < 2e-16 ***
## hp -0.024840 0.013385 -1.856 0.073679 .
## disp -0.030346 0.007405 -4.098 0.000306 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 29 degrees of freedom
## Multiple R-squared: 0.7482, Adjusted R-squared: 0.7309
## F-statistic: 43.09 on 2 and 29 DF, p-value: 2.062e-09
What about incorportating an interaction? To do so is pretty simple, we just need to create the term in our design matrix and repeat the following steps. To make this a little cleaner, let’s create a function that takes a y value and design matrix as input and returns something similiar to the summary statement from lm.
my_lm<-function(y,x){
XtX<-t(x)%*%x
XtY<-t(x)%*%y
betas<-solve(XtX)%*%XtY
hat_matrix<- x%*%solve(XtX)%*%t(x)
predicted<-hat_matrix%*%y
residuals<-(y-predicted)
var_est <- (1/(nrow(x)-ncol(x)))*t(residuals)%*%residuals
ses<-sqrt(diag(as.numeric(var_est)*solve(XtX)))
tvals<-betas/ses
pvals<-sapply(tvals, function (p) 2*pt(abs(-p), nrow(x)-ncol(x), lower=FALSE))
r2<-sum((predicted-mean(y))^2)/sum((y-mean(y))^2)
adjusted_r2<-1-(((1-r2)*(nrow(x)-1))/(nrow(x)-(ncol(x)-1)-1))
return(round(data.frame(b=betas,se=ses,t_value = tvals,p_value=pvals,r2=c(r2,rep(NA,ncol(x)-1)),adj_r2=c(adjusted_r2,rep(NA,ncol(x)-1))),6))
}
x<-cbind(rep(1,nrow(mtcars)),mtcars[,c("hp","disp")],mtcars[,"hp"]*mtcars[,"disp"])
x<-as.matrix(x)
y<-mtcars[,"mpg"]
my_lm(y,x)
## b se t_value p_value
## rep(1, nrow(mtcars)) 39.674263 2.914172 13.614248 0.000000
## hp -0.097894 0.024745 -3.956175 0.000473
## disp -0.073373 0.014387 -5.100113 0.000021
## mtcars[, "hp"] * mtcars[, "disp"] 0.000290 0.000087 3.336151 0.002407
## r2 adj_r2
## rep(1, nrow(mtcars)) 0.819849 0.800548
## hp NA NA
## disp NA NA
## mtcars[, "hp"] * mtcars[, "disp"] NA NA