How to open R and derive the OLS estimates using matrices
Hello and welcome to the wonderful and fantastic world of using R for econometrics! There are a number of software packages capable of doing econometric calculations but R is, in my humble opinion, the best by far. It’s got far more features than comparable software and is much easier to program in once you get the hang of it! I stress the word “once” because there is a bit of a fixed cost when it comes to picking up some of the programming but all it takes is time, so if you feel a little disheartened at the start of this please don’t panic, worry or feel discouraged. If you put the hours in you will pick it up. And that’s all R will cost you, your time, because it is open source software meaning it is free to download and use. The Rstudio program is also free to use for non-commercial activities. This won’t cost you a penny!
R and RstudioObviously the first thing you will need to do is download and install all of the software to your computer. The R program can be found at https://cran.r-project.org/. When you land on that page choose your OS (like Windows or Mac) and click through. There is an option to “install R for the first time” so just download the base package. Don’t try to be a hero and download anything non-default! Once you download R on your computer you will be able to use the software. However, we do not stop there! Why? Because the base R program is awful for writing code in and doing your analysis in. The Rstudio software provides a much more user friendly R experience, so let’s download that. Go to https://rstudio.com/products/rstudio/download/#download and download the free Rstudio desktop. Don’t worry about the priority support, I will do this for you (actually, maybe I should charge for this, might be a nice little nixer!). Again, just click through the download accepting the default settings.
Once you have both programs installed you are ready to go! The video below gives a really good overview of the download process and also a small description of the Rstudio architecture once the program is open. The lad likes his board games too!
Now, armed with both programs, open Rstudio and you are ready to start R-ing (yes, I’m making up a verb). In the first quadrant (top-left panel) you are going to write your code (script panel from now on). When the code is complete you will execute it in the console panel (immediately below). This tutorial is more about showing how R can be used to perform linear algebra calculations that you would never want to do by hand. In the next tutorial we will explore some of the functions you will be using for regression analyses.
The following is a simple vector of values that I made up from the top of my head. The <- arrow is the less than key followed by the hyphen and it means “assign to”. In this application we are assigning the values of 5,0,4,… to the vector x.
x <- c(5,0,4,0,6,5,8)
When you type the above code into your script file and run by either going to the “Run” button that appears in the top right panel of the script panel or by highlighting the code and holding down the “Ctrl” and “Return” keys. Either way, when you do this the following will appear in the console.
x
[1] 5 0 4 0 6 5 8
This is just telling you what is in the R object. Also note that the x vector also appears in the top right hand “Environment” panel. Good stuff, now we want to explore some of the matrix calculations you can do with R
Matrices are arrays of numbers generated by stacking vectors alongside each other. In econometrics these vectors are typically our data variables. Before we look at any data let me show you how we can calculate the inverse of the following matrix:
\[X = \begin{bmatrix} 14 & -4\\ -34 & 24 \end{bmatrix}\]
First we create two vectors called x1 and x2 and then join these together with the cbind function (stands for column bind) to create a matrix X.
x1 <- c(14,-34)
x2 <- c(-4, 24)
X <- cbind(x1,x2)
The code above is the code that you should be saving into your R script. Now let us review if X was saved correctly. The code below you do not need to save into your script because it is simply reviewing what is saved into the R session.
X
x1 x2
[1,] 14 -4
[2,] -34 24
Super! Now let me show you how to get the inverse of X with the solve() function.
solve(X)
[,1] [,2]
x1 0.12 0.02
x2 0.17 0.07
Remember that that any matrix either pre- or post-multiplied by its inverse should return the identity matrix with a diagonal of 1’s and all off-diagonal elements of 0’s. Let’s see this by first creating the inverted X as a new object and then multiplying by X. As you can see we are using %*% to perform matrix multiplication. The two per cent signs are used here because we want to distinguish the matrix multiplication operation from the scalar/regular multiplication operation, which is just *.
invX <- solve(X)
X %*% invX
[,1] [,2]
[1,] 1 -1.110223e-16
[2,] 0 1.000000e+00
Oooooh, close! Looks like a bit of a rounding issue here. Not being correct to the 16th decimal place is not going to mess up any analysis but maybe we can solve this problem with the `round()`` function.
round(X %*% invX, 3)
[,1] [,2]
[1,] 1 0
[2,] 0 1
That looks better! That is just a simple explanation of how you can use R to perform a matrix-inversion exercise which you would not be bothered performing by hand. How does this relate to the OLS model we are going to use again and again throughout the course? We explore this in the next section.
R Using MatricesThere is no excuse for shoehorning an obscure application involving my beloved Shamrock Rovers into my professional life. But I am going to do it anyway! The table below is from the game “Fifa20” it is the squad from the 2019 season. Here you can see data on each of 22 squad members attributes including their rating in the game.
| id | Name | Position | Preferred Foot | Height | Weight | Rating |
|---|---|---|---|---|---|---|
| 1 | J. Byrne | Midfielder | Right | 175 | 73 | 69 |
| 2 | G. Burke | Midfielder | Left | 175 | 72 | 66 |
| 3 | L. Grace | Defender | Right | 189 | 81 | 65 |
| 4 | G. Bolger | Midfielder | Right | 180 | 73 | 64 |
| 5 | A. McEneff | Midfielder | Right | 175 | 70 | 63 |
| 6 | R. Finn | Midfielder | Right | 185 | 71 | 63 |
| 7 | G. Cummins | Midfielder | Right | 178 | 75 | 63 |
| 8 | R. Lopes | Defender | Right | 186 | 80 | 62 |
| 9 | A. Greene | Forward | Left | 180 | 73 | 62 |
| 10 | D. Watts | Midfielder | Right | 178 | 68 | 61 |
| 11 | S. Kavanagh | Midfielder | Left | 174 | 62 | 61 |
| 12 | J. O’Brien | Defender | Right | 180 | 70 | 61 |
| 13 | B. Kavanagh | Midfielder | Left | 172 | 63 | 59 |
| 14 | D. Carr | Midfielder | Right | 180 | 77 | 59 |
| 15 | G. O’Neill | Midfielder | Right | 180 | 76 | 58 |
| 16 | E. Boyle | Defender | Right | 180 | 78 | 58 |
| 17 | O. Vojic | Midfielder | Right | 182 | 71 | 57 |
| 18 | N. Farrugia | Midfielder | Left | 178 | 76 | 56 |
| 19 | T. Oluwa | Midfielder | Right | 179 | 82 | 53 |
| 20 | S. Callan | Midfielder | Right | 182 | 80 | 48 |
Let’s start with a really simple regression model that looks at whether a player \(i\)’s weight is influenced by their height (taller players should be heavier): \[ weight_i = b_0 + b_1 height_i + e_i \] and we are going to estimate this using OLS with the formula \(\mathbf{b=(X'X)^{-1}X'y}\). Now we need to create the vector \(\mathbf{y}\) and the matrix \(\mathbf{X}\).
weight <- c(73, 72, 81, 73, 70, 71, 75, 80, 73, 68, 62, 70, 63, 77, 76, 78, 71, 76, 82, 80)
height <- c(175,175,189,180,175,185,178,186,180,178,174,180,172,180,180,180,182,178,179,182)
Remember it is very important that we create the \(\mathbf{X}\) matrix with a column of 1’s in the first column. This is the intercept/constant term (i.e. no variable because all the observations are the same). We can get a vector of 1’s using the rep command with the 20 observations.
X <- cbind(rep(1, 20), height)
We have both components needed to calculate the OLS coefficients. Now we just need to do the matrix operations. Here is a good summary of all the matrix operations in R: https://www.statmethods.net/advstats/matrix.html
XprimeX <- crossprod(X)
XprimeXinv <- solve(XprimeX)
XprimeXinvXprime <- XprimeXinv %*% t(X)
XprimeXinvXprimey <- XprimeXinvXprime %*% weight
XprimeXinvXprimey
[,1]
-76.9984885
height 0.8391778
Alternatively we could have crammed everything onto one line.
beta <- solve(crossprod(X)) %*% crossprod(X, weight)
beta
[,1]
-76.9984885
height 0.8391778
This means our intercept estimate is: \(b_0 \approx -77\) and the slope is: \(b_1 \approx 0.84\). Taller players are heavier, on average. This make sense. For every 1cm increase in height we expect the weight of the player to increase by 0.84kg. Also, we can use this to make predictions. For example, we expect Jack Byrne (BALLER!) who is 175cm to be: \(-77 + 0.84\times175 = 70kg\). He is 73kg, so the residual in this instance is: \(e_{Jacko} = 73-70=3kg\).
The above is a simple illustration of the various model components. What about if we wanted to calculate the full vector of predicted values (\(\mathbf{\hat{y} = Xb}\)) and residuals \(\mathbf{e = y - \hat{y}}\). This is reasonably straightforward.
y_hat <- X %*% beta
e <- weight - y_hat
And we can investigate what the first five observations in both vectors look like with the head function.
head(y_hat, 5)
[,1]
[1,] 69.85762
[2,] 69.85762
[3,] 81.60611
[4,] 74.05351
[5,] 69.85762
head(e, 5)
[,1]
[1,] 3.1423821
[2,] 2.1423821
[3,] -0.6061064
[4,] -1.0535067
[5,] 0.1423821
Note that the first observation corresponds to the predicted and residual value for Jack Bryne. In class we discussed how to estimate the standard errors for the regression coefficients using matrices. The crucial calculation in this process requires us to generate the model’s residual variance-covariance matrix: \[ Var(b) = \sigma_{e}^2 (\mathbf{X'X})^{-1} \] The \(\sigma^2_e\) is the residual sum of squares \(RSS\) divided by \(n-k = 20-2=18\) and we have already seen how the \(\mathbf{(X'X)}^{-1}\) matrix is calculated.
RSS <- t(e) %*% e
sigma <- RSS/(20-2)
Vcov <- as.numeric(sigma) * (solve(crossprod(X)))
We have to use the as.numeric() function because the \(RSS\) is a \(1 \times 1\) matrix and R does not automatically recognize that you are trying to multiply by a scalar value. The standard errors are embedded in the Vcov object in the above but we need to remember that these values are computed by taking the square root of the diagonal element.
se <- sqrt(diag(Vcov))
se
height
42.8796066 0.2389554
The standard error associated with the height variable is thus: 0.24. This is our measure of the coefficient’s precision and we can use it to perform \(t\)-tests and calculate confidence intervals. For example, the 95pc confidence interval for the slope is between 0.34 and 1.34. In other words we don’t know the true population value but we can say with 95pc certainty that this true coefficient lies on the interval between 0.34 and 1.34.
beta[2] - se[2]*qt(0.975, 18)
height
0.3371512
beta[2] + se[2]*qt(0.975, 18)
height
1.341204
I think the above example provides a very good overview (if I do say so myself!) on how R can be used to process and perform computations with matrices. In practice you will not need to do calculations like these because R has a whole series of useful functions that allow us to run regressions like this and calculate the standard errors and other information in one single command. Below I show how the data is gathered as a data.frame and then the same regression is performed with the lm() function. Note that you do not need to specify the intercept in the lm() function because R assumes that you want an intercept to be estimated by default.1
rovers <- data.frame(height, weight)
model1 <-lm(weight ~ height, data = rovers)
summary(model1)
Call:
lm(formula = weight ~ height, data = rovers)
Residuals:
Min 1Q Median 3Q Max
-7.2494 -4.1252 0.5269 2.9955 8.7857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -76.9985 42.8796 -1.796 0.08935 .
height 0.8392 0.2390 3.512 0.00249 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.346 on 18 degrees of freedom
Multiple R-squared: 0.4066, Adjusted R-squared: 0.3736
F-statistic: 12.33 on 1 and 18 DF, p-value: 0.00249
We need to load ggplot2 and ggrepel for the plots and labels we are going to use.
library(ggplot2)
library(ggrepel)
# add labels
rovers$name <- c("J. Byrne","G. Burke","L. Grace","G. Bolger","A. McEneff","R. Finn",
"G. Cummins","R. Lopes","A. Greene","D. Watts","S. Kavanagh","J. O'Brien",
"B. Kavanagh","D. Carr","G. O'Neill","E. Boyle","O. Vojic","N. Farrugia",
"T. Oluwa","S. Callan")
ggplot(data = rovers, aes(x = height, y = weight, label = name)) +
theme_minimal(base_size = 20, base_family = "serif") +
geom_point(color="darkblue", alpha = 0.25, size = 2) +
geom_smooth(se= F, size = 1, col = "red", method="lm")+
geom_text_repel() +
xlab("Height (cm)") +
ylab("Weight (kg)")
You will need to explicitly specify that you do not want an intercept by adding a \(-1\) to the R formula in lm. Here, this would be lm(weight ~ -1 + height, data = rovers)↩︎