The crime dataset from the R package Ecdat contains data on the number of reported crimes (variable reports) for the 90 counties of the state of North Carolina for the year 1984. The data also contain information on the number of empoyed police officers (variable police) and the tax revenue in millions of dollars (variable taxrev) for those counties.
Suppose we regress the number of reported crimes on the number of police officers and the tax revenue. R reports the following output, some of which has been hidden:
Multiple Regression Analysis:
3 regressors(including intercept) and 90 observations
lm(formula = reports ~ police + taxrev, data = crime)
Coefficients:
Estimate Std Error t value p value
(Intercept) 22.0700 [hidden] 4.57 0.000
police 1.0930 0.4792 [hidden] [hidden]
taxrev [hidden] 0.1555 [hidden] 0.273
---
Standard Error of the Regression: [hidden]
Multiple R-squared: 0.0775 Adjusted R-squared: 0.056
Overall F stat: [hidden] on 2 and 87 DF, pvalue= 0.03
For reference, two R commands and their results are provided below:
qt(p = 0.273/2, df = 87)
## [1] -1.10316
qt(p = 0.025, df = 87)
## [1] -1.987608
1a. [1.5 points] How do you interpret the coefficient on police? In other words, write an English sentence that explains what the coefficient on police means.
1b. [1 point] Is the coefficient on police statistically significantly different from zero at the 95% confidence level? Why?
1c. [1.5 points] Calculate the estimated coefficient on taxrev.
1d. [2 points] Using the fact that the Regression Sum of Squares (often abbreviated \(SSR\)) is equal to 1940.2, calculate the standard error of the regression (often denoted \(s\)).
2a. [1.5 points] Suppose you have two objects in the R global environment: (1) the length-\(n\) vector \(Y\) and (2) the \(n \times k\) matrix \(X\). Write the R code that calculates the vector of least squares coefficient estimates. Store the calculated result in an object name \(b\).
2b. [1.5 points] What does the following R code do?
rt(n=100, df=10)
Use the following matrices to answer the two sub-parts of this question:
\[\begin{align*} (X'X)^{-1} = \begin{bmatrix} 0.2 & -0.1 \\ -0.1 & 0.5 \end{bmatrix} \hspace{3em} X'Y = \begin{bmatrix} 1 \\ 5 \end{bmatrix} \end{align*}\]
3a. [2 points] Compute the least squares estimate vector \(\mathbf{b} = [b_0\ b_1]'\).
3b. [2 points] If \(s^2 = 3\), compute the standard errors of each of the least squares coefficients.
4a. [2 points] Let \(e_i\) be a least squares residual from the simple linear regression model. Verify that \(\sum_{i=1}^{n}e_i=0\).
4b. [1 point] Define the “hat matrix” \(P\) to be \(P = X(X'X)^{-1}X'\). Show that \(Py=\hat{y}\) where \(\hat{y}\) is the vector of least squares fitted values from the multiple regression model.
4c. [1 point] Define the “annihilator matrix” \(M\) to be \(M = I - P\) where \(P\) is defined above and \(I\) is an \(n \times n\) identity matrix. Show that \(My=e\) where \(e\) is the vector of least squares residuals from the multiple regression model.
5a. [1.5 points] Geoff eats either 0, 1, or 2 powerbars each morning. He claims the amount of powerbars he eats in the morning (\(X\)) positively affects his productivity during the day, measured in lines of R code written (\(Y\)). Geoff also collects some data on the amount of code he writes each day and he reports that var(\(Y\)) = var(\(Y|X\)=1 bar) = var(\(Y|X\)=2 bars). Do you agree with Geoff that his powerbar consumption affects his productivity? Why or why not?
5b. [1.5 points] A friend in the Berkeley MFE program tells you that his professor has invented a new procedure that is better than least squares for simple linear regression. This new procedure fits the slope of the fitted line with the formula below. Explain why your friend is wrong.
\[ b_1 = \frac{\frac{1}{n} \sum_{i=1}^{n} (Y_i - \bar{Y})X_i}{\frac{1}{n} \sum_{i=1}^{n}(X_i - \bar{X})^2} \]