1. Suppose we have data on MFE students’ GMAT scores and their overall GPA in the program. We would like to see if GMAT scores predict academic performance (as measured by GPA), and so we fit a linear regression. R reports this output:
lm(formula = GPA ~ GMAT)
Coefficients:
Estimate Std Error t value p value
(Intercept) -1.601000 [hidden] -0.64 [hidden]
GMAT 0.006892 0.003531 [hidden] [hidden]
---
Standard Error of the Regression: 0.5945
Multiple R-squared: 0.05 Adjusted R-squared: 0.037
Overall F stat: 3.81 on 1 and 72 DF, pvalue= 0.055
1a. [2 points] Calculate the standard error of the intercept.
\(t_{b_0} = \frac{b_0-0}{se_{b_0}} \hspace{1em} \Longrightarrow \hspace{1em} se_{b_0} = \frac{b_0}{t_{b_0}} = \frac{-1.601}{-0.64} = 2.501562\)
Reference: Ch1 slide 86
1b. [2 points] Is the coefficient on GMAT statistically significantly different from zero at a 95% confidence level? For reference, the R code qt(p=0.025, df=72) returns the value -1.9934.
\(t_{b_1} = \frac{b_1-0}{se_{b_1}} = \frac{0.006892}{0.003531} = 1.951855\)
No, GMAT is not statistically significantly different from zero at the 95 confidence level because \(|t_{b_1}| < |t^*|\) (i.e., \(1.951855 < 1.9934\)).
Reference: Ch1 slides 89 and 99
1c. [2 points] Kate and Kalyan are two new MFE students not included in the regression. Kate’s GMAT score is 100 points higher than Kalyan’s. How much better do we expect Kate’s GPA to be?
\(\hat{GPA}_\text{Kalyan} = b_0 + b_1 \times GMAT_\text{Kalyan}\)
\(\hat{GPA}_\text{Kate} = b_0 + b_1 \times GMAT_\text{Kate}\)
\(\hat{GPA}_\text{Kate} - \hat{GPA}_\text{Kalyan} = b_1 \times (GMAT_\text{Kate} - GMAT_\text{Kalyan}) = 0.006892 \times 100 = 0.6892 \approx 0.7\)
From question 1b above, however, we see that \(b_1\) is not statistically significant at a 95% confidence level. While we can still calculate the difference in expected GPAs according to our model (i.e., Kate’s expected GPA is approximately 0.7 points higher than Kalyan’s expected GPA), we might worry about how precisly we are estimating the relationship between GMAT scores and GPAs, and that our lack of precision would translate into a large prediction interval.
Reference: Ch1 slide 100
2a. [1 point] Suppose you are working in R. Your Global Environment has a dataframe named “DF”. DF has two columns: a column named “Y” and a column named “X”. Write R code to run a regression of Y on X and store the output in an object named “out”.
out <- lm(Y ~ X, data=DF)
Reference: Ch1 slide 20
2b. [2 points] Suppose your code from the last question worked. Write R code to create a scatterplot of the residuals from the regression (on the vertical axis) against the X variable (on the horizontal axis).
plot(x=DF$X, y=out$residuals)
Reference: Ch2 slide 10
2c. [2 points] Fill in the blanks:
qnorm(p=0.975) = \(1.959964 \approx 1.96\)
pt(q=0, df=37) = \(0.5\)
Reference: Ch1 slides 93 and 98
3a. [3 points] Let \(X\) be an \(n \times k\) matrix and \(e\) be an \(n\)-length vector of least squares residuals from a multiple regression. Show \(X'e=0\).
We will use the facts that \(e = y - \hat{y}\), that \(\hat{y} = Xb\), and that \(b=(X'X)^{-1}X'y\). We also use the property that \((X'X)(X'X)^{-1}=I\).
\[\begin{align} \hspace{4em} X'e &= X'(y - \hat{y}) &\\ &= X'(y - Xb) &\\ &= X'(y - X(X'X)^{-1}X'y) &\\ &= X'y - X'X(X'X)^{-1}X'y &\\ &= X'y - IX'y &\\ &= 0 \end{align}\]Note that \((X'X)^{-1} \ne X^{-1}X'^{-1}\). An inverse can only be taken of a non-singular square matrix and, in general, \(X\) does not have this property.
Alternatively, we can multiply out the matrices:
\[ X'e = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_{11} & x_{12} & \ldots & x_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{k1} & x_{k2} & \ldots & x_{kn} \end{bmatrix} \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ \vdots \\ e_n \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^n e_i \\ \sum_{i=1}^n x_{1i} e_i \\ \vdots \\ \sum_{i=1}^n x_{ki} e_i \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]
Now to show \(X'e=0\), we need to show each row of the above matrix equals zero.
It is fine to state that \(\sum_i e_i = 0\) as “something you learned in class”, but you could also prove is:
\[\begin{align*} \hspace{4em} \sum_{i=1}^n e_i &= \sum_{i=1}^n y_i - \hat{y}_i &\\ &= \sum_{i=1}^n y_i - b_0 - b_1x_i &\\ &= \sum_{i=1}^n y_i - (\bar{y} - b_1\bar{x}) - b_1x_i &\\ &= \sum_{i=1}^n y_i - \sum_{i=1}^n \bar{y} + \sum_{i=1}^n b_1\bar{x} - \sum_{i=1}^n b_1x_i &\\ &= n\bar{y} - n\bar{y} + nb_1\bar{x} - nb_1\bar{x} &\\ &= 0 \end{align*}\]To show that \(\sum_i x_{ji}e_i = 0\) for an arbitrary \(j = 1, \ldots, k\), it is fine to state that the \(corr(x_{ji}, e_i) = 0\) for all \(j\), which implies that the \(cov(x_{ji},e_i) = 0\) for all \(j\). From this we find:
\[ cov(x_{ji}, e_i) = \sum_i \left( x_{ji} - \bar{x}_{ji} \right) \left( e_i - \bar{e} \right) = \sum_i \left( x_{ji} - \bar{x}_{ji} \right) e_i = \sum_i x_{ji} e_i - \bar{x}_{ji} \sum_i e_i = \sum_i x_{ji} e_i = 0 \]
Alternatively, you could directly show that \(\sum_i x_{ji} e_i = 0\) for an arbitrary \(j\), but the proof with summation notation is longer and more than involved that what we had in mind for an exam. It is easiest to notice that \(\sum_i x_{ji} e_i = 0\) is a condition that results from minimizing the sum of squared errors:
\[ \frac{\partial}{\partial\beta_j}\sum_i \left( y_i - \beta_0 + \beta_1x_{1i} + \ldots + \beta_kx_{ki} \right)^2 = 0 \hspace{1em} \Longrightarrow \hspace{1em} \sum_i x_{ji} \left( y_i - \beta_0 + \beta_1x_{1i} + \ldots + \beta_kx_{ki} \right) = \sum_i x_{ji}e_i = 0 \]
Reference: Ch3 slides 3–5 and basic linear algebra
3b. [3 points] You regress \(Y\) on \(X\) (e.g., lm(Y~X)). The \(X\) values and the residuals are shown below. Are these residuals consistent with a linear regression model or not? Why or why not?
\[ X = \begin{bmatrix} 2 \\ 4 \\ 1 \\ 3 \end{bmatrix} \hspace{3em} e = \begin{bmatrix} -1 \\ 2 \\ -2 \\ 1 \end{bmatrix}\]
No, these residuals are not consistent with linear regression. While \(\sum_{i=1}^4 e_i = 0\), which is consistent with linear regression, we find that \(cov(X,e) = \frac{1}{3}\sum_{i=1}^4(X_i - \bar{X})(e_i) = 2.33\), which is not consistent with linear regression because residuals from a linear regression have the property that \(cov(X,e) = 0\).
Reference: Ch1 slides 35–38 and 109, Ch2 slide 9
4. [3 points] The rat dataset from the DataAnalytics package has three variables: \(y\), \(x_1\), and \(x_2\). Below are coefficient estimates for three regressions using data from the rat dataset.
First, a regression of \(y\) on \(x_1\) and \(x_2\):
Estimate Std. Error t value
(Intercept) 0.178357 0.227775 0.7830
x1 0.035349 0.151375 0.2335
x2 1.232600 2.041265 0.6038
Second, a regression of \(x_1\) on \(x_2\):
Estimate Std. Error t value
(Intercept) 1.18864 0.22378 5.3117
x2 6.74253 2.83235 2.3805
Third, a regression of \(x_2\) on \(x_1\):
Estimate Std. Error t value
(Intercept) 0.014504 0.026834 0.5405
x1 0.037080 0.015576 2.3805
Suppose you run the simple regression of \(y\) on \(x_1\). Calculate what the estimated slope coefficient would be.
\[ b_1^s = b_1^m + c_1 b_2^m = 0.035349 + 0.03708 \times 1.2326 = 0.08105 \]
In words, the simple estimated regression coefficient for \(x_1\) (\(b_1^s\)) is equal to the sum of \(x_1\)’s direct effect on \(y\) and \(x_2\)’s indirect effect on \(y\) through \(x_1\). The direct effect is \(x_1\)’s estimated multiple regression coefficient (\(b_1^m\)). The indirect effect is the portion of \(x_2\)’s estimated direct effect on \(y\) (\(b_2^m\)) that “moves with” \(x_1\); this portion is \(c_1\), which is the estimated slope coefficient from a regression of \(x_2\) on \(x_1\).
Another way to see this is to “break”" \(x_2\) into the part explained by \(x_1\) and a part not explained by \(x_1\) that I will call \(\nu\):
\[ x_2 = \gamma_0 + \gamma_1x_1 + \nu \]
Then,
\[\begin{align*} y &= \beta_0^m + \beta_1^m x_1 + \beta_2^m x_2 + \varepsilon &\\ &= \beta_0^m + \beta_1^m x_1 + \beta_2^m (\gamma_0 + \gamma_1x_1 + \nu) + \varepsilon &\\ &= (\beta_0^m + \beta_2^m\gamma_0) + (\beta_1^m + \beta_2^m\gamma_1)x_1 + (\varepsilon + \nu) &\\ &= \beta_0^s + \beta_1^sx_1 + \varepsilon' & \end{align*}\]So we can see theoretically that \(\beta_1^s = \beta_1^m + \beta_2^m\gamma_1\).
When we regress \(x_2\) on \(x_1\), we get \(x_2 = \hat{x}_2 + u = c_0 + c_1x_1 + u\), where each \(c_j\) is an estimate of \(\gamma_j\) and \(u\) is this regression’s residual. From the estimated multiple regression we have that:
\[\begin{align*} y &= \hat{y} + e &\\ &= (b_0^m + b_1^m x_1 + b_2^m x_2) + e &\\ &= b_0^m + b_1^m x_1 + b_2^m (c_0 + c_1x_1 + u) + e &\\ &= (b_0^m + b_2^mc_0) + (b_1^m + b_2^mc_1)x_1 + (b_2^mu + e) &\\ \end{align*}\]Which shows that \(b_1^s = b_1^m + b_2^mc_1\).
Reference: Ch2 slides 37–49, but especially slide 49