2018 Final Exam Solutions

Question 1 [12 pts] – True/False Short Answer

Determine whether each of the following six claims is true or false? Provide support or explanation for your answers. Simply stating true or false without support will be given no credit. The subparts of this question are not related to each other.

1a) Let Model 1 be $Y = \alpha_1 + \beta_1 X + \varepsilon$. Let Model 2 be $Y = \alpha_2 + \beta_2 X + \delta_2 Z + \varepsilon$. Suppose we fit each model using least squares and suppose that cov$(X,Z)=0$ in the data. Our notation will be that $b_j$ is the estimate of $\beta_j$. I claim that the fitted coefficients on the $X$ variable are equal, or in other words that $b_1 = b_2$. Is this claim true or false?

TRUE. When cov$(X,Z)=0$, there is no “indirect” effect of $Z$ through $X$ in Model 1, so $b_1 = b_2$.

Reference: Ch2 slides 37–49, but especially slide 49

1b) Suppose our dataset has variables $Y$, $X_1$, and $X_2$. We find that cov$(X_1,X_2)=0.36$. I claim that it is possible that this dataset could exhibit perfect multi-colinearity between $X_1$ and $X_2$. Is this claim true or false?

TRUE. Perfect multicollinearity occurs one covariate can be written as a linear combination of other covariates. With only 2 covariates, this is the same as $\text{corr}(X_1,X_2)=1$. We know that:

\[ \text{corr}(X_1,X_2) = \frac{\text{cov}(X_1,X_2)}{\sigma_{X_1}\sigma_{X_2}} = \frac{0.36}{\sigma_{X_1}\sigma_{X_2}} \]

So if $\sigma_{X_1}\sigma_{X_2} = 0.36$, then this dataset could exhibit perfect multicollinearity.

Reference: Ch5 slide 6

1c) Suppose our data results from a data generating process where the errors are heteroskedastic, but we assume they are homoskedastic. We then estimate the parameters of our linear model using OLS (ordinary least squares). I claim that the OLS estimator is unbiased in this situation. Is this claim true or false?

TRUE. Heteroskedasticity has to do with the variance of the error terms, but it does not change the parameter estimates or their unbiasedness. $b_\text{OLS} = (X'X)^{-1}X'y$ whether or not we assume (or find) heteroskedasticity.

Note that this question was intended to be a query about the estimates of the $\beta_j$ parameters of a linear regression model, but there is ambiguity in the question. Submitted answers that discussed the unbiasedness of, e.g., $\hat{\sigma}$ will be given credit.

Reference: Ch5 Slide 56

1d) I claim the Box-Ljung test is one way to check for multicollinearity. Is this claim true or false?

FALSE. The Box-Ljung test is one way to check for autocorrelation.

Reference: Ch4 slide 16

1e) Suppose you use a bootstrap method to calculate the standard error of the slope parameter for a simple linear regression model. I claim that the bootstrap distribution is centered over the true parameter value $\beta$. Is this claim true or false?

FALSE. The bootstrap distribution here will be centered over the parameter estimate ($b$), not the true parameter value ($\beta$).

1f) I claim that all maximum likelihood estimators are unbiased. Is this claim true or false?

FALSE. One counter-example is $\hat{\sigma}_\text{MLE} = e'e/n$ whereas an unbiased estimate of $\sigma$ is $\hat{\sigma}_\text{OLS} = e'e/(n-k)$.

Question 2 [5 pts] – Hypothesis Test

Consider the simple linear regression model: $Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$ with $\varepsilon_i \sim \text{i.i.d. } \mathcal{N}(0,\sigma^2)$. Suppose you estimated of the parameters of this model using least squares with a dataset containing 1000 observations. Some calculations using the $X$ matrix, $Y$ vector, and vector of residuals ($e$) are provided below. Use that information to test the Null Hypothesis that $\beta_1 = 5$ at a 95% confidence level. What do you conclude?

\[ [X'X]^{-1} = \begin{bmatrix} 0.5 & 0.1 \\ 0.1 & 3 \end{bmatrix} \hspace{3em} X'Y = \begin{bmatrix} -4 \\ 2 \end{bmatrix} \hspace{3em} e'e = 212.91 \]

We proceed by calculating $b_1$, finding $s_{b_1}$, and then performing the t-test.

\[ b = (X'X)^{-1}X'Y = \begin{bmatrix} 0.5 & 0.1 \\ 0.1 & 3 \end{bmatrix} \begin{bmatrix} -4 \\ 2 \end{bmatrix} = \begin{bmatrix} (0.5)(-4) + (0.1)(2) \\ (0.1)(-4) + (3)(2) \end{bmatrix} = \begin{bmatrix} -2+0.2 \\ -0.4+6 \end{bmatrix} = \begin{bmatrix} -1.8 \\ 5.6 \end{bmatrix}\]

\[ s(X'X)^{-1} = \frac{e'e}{N-2}(X'X)^{-1} = \frac{212.91}{998} \begin{bmatrix} 0.5 & 0.1 \\ 0.1 & 3 \end{bmatrix} \approx \begin{bmatrix} 0.1067 & 0.0213 \\ 0.0213 & 0.6400 \end{bmatrix} \]

\[ s_{b_1} = \sqrt{0.64} = 0.8 \]

\[ t_{b_1} = \frac{5.6 - 5}{0.8} = 0.75 \]

The critical value is not provided, but is approximately 1.96. Our calculated t-stat is much lower than the critical value of approximately 1.96 and so we fail to reject the Null Hypothesis that $\beta_1 = 5$.

Reference: Ch1 Slide 86; Ch2 Slide 7; and Ch3 Slides 3, 12, and 25-26

Alternatively, you can do this with an F statistic.

\[ R = \begin{bmatrix} 0 & 1 \end{bmatrix} \hspace{2em} b = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \hspace{2em} r = 5 \hspace{2em} q = 1 \hspace{2em} s^2 = \frac{e'e}{N-k} \]

\[ F = \frac{(Rb-r)'[R(X'X)^{-1}R']^{-1}(Rb-r)}{qs^2} = \frac{(5.6-5)\times(1/3)\times(5.6-5)}{212.91/998} = 0.5625 = 0.75^2 \]

However, you would need a sense of the critical value in order to draw a conclusion about this hypothesis test:

qf(p=0.95, df1=1, df2=998)

## [1] 3.850793

Reference: Ch3 Slide 32

Question 3 [8 pts] – TV Data

Use the information on this page to answer the questions on the next page.

The Flat_Panel_TV dataset contains data on 70 televisions for sale. The data include the following varibles:

Price – the price of the television in dollars
Size – the diagonal length in inches of the screen
Brand – one of LG, Panasonic, or Samsung
Type – either LED or Plasma

A summary of the data is as follows:

summary(Flat_Panel_TV)

##      Price             Size             Brand        Type   
##  Min.   : 499.0   Min.   :32.00   LG       :21   LED   :36  
##  1st Qu.: 927.5   1st Qu.:46.00   Panasonic:17   Plasma:34  
##  Median :1335.5   Median :50.00   Samsung  :32              
##  Mean   :1423.2   Mean   :49.57                             
##  3rd Qu.:1795.0   3rd Qu.:55.00                             
##  Max.   :4049.0   Max.   :65.00

The code and abbreviated output for two different linear regressions with these data are provided below:

lmSumm(lm(log(Price) ~ log(Size) + Type + Brand, data=Flat_Panel_TV))

Coefficients:
               Estimate Std Error t value p value
(Intercept)    -0.91070   0.72990   -1.25   0.217
log(Size)       2.09100   0.19040   10.98   0.000
TypePlasma     -0.25710   0.07322   -3.51   0.001
BrandPanasonic -0.03968   0.08850   -0.45   0.655
BrandSamsung    0.17450   0.06717    2.60   0.012
---
Standard Error of the Regression:  0.2388
Multiple R-squared:  0.712  Adjusted R-squared:  0.695
Overall F stat: 40.22 on 4 and 65 DF, pvalue= 0

lmSumm(lm(log(Price) ~ log(Size) + Type, data=Flat_Panel_TV))

Coefficients:
            Estimate Std Error t value p value
(Intercept)  -1.3040   0.73370   -1.78    0.08
log(Size)     2.2190   0.19160   11.58    0.00
TypePlasma   -0.3254   0.06605   -4.93    0.00
---
Standard Error of the Regression:  0.2531
Multiple R-squared:  0.667  Adjusted R-squared:  0.657
Overall F stat: 67.1 on 2 and 67 DF, pvalue= 0

3a) According to the first regression, which brand sells the cheapest 56-inch Plasma TVs?

Panasonic. It has the lowest brand-specific intercept.

Reference: Ch5 Slides 49-51

3b) How do you interpret the coefficient on log(Size) in the first regression?

In a log-log regression, the coefficient can be interpreted as an elasticity. Here we can say the price-size elasticity is approximately 2, holding TV type and brand constant. Equivalently, a one-percent change in TV size leads to an approximate two-percent change in price, holding TV type and brand constant.

Reference: Ch5 Slide 46

3c) According to the first regression, what is the predicted price of a 65-inch Samsung LED TV?

The predicted price is $2,958.

$\log(P) = -0.91070 + 2.091\log(65) - 0.2571(0) - 0.03968(0) + 0.1745(1) = 7.992$

$P = e^{7.992} = 2958$

Reference: Ch2 Slide 35 and Ch5 Slides 41-42

3d) Test whether the set of brand dummy variables significantly improved the regression at a 95% confidence level. Note that qf(p=0.95, df1=2, df2=65) = 3.138.

\[ F = \frac{(R^2_\text{full} - R^2_\text{restricted})/k_2}{(1-R^2_\text{full})(N-k_1-k_2-1)} = \frac{(0.712 - 0.667)/2}{(1-0.712)/65} = \frac{0.0225}{0.00443} = 5.08\]

Because $F_\text{partial} > F_\text{critical}$ ($5.08>3.138$) the excluded variables are jointly statistically significant at the 95% confidence level and thus we can reject the (implied) null hypothesis that the brand variables do not improve the regression.

Reference: Ch2 Slide 31

Question 4 [4 pts] – Mammals Data

The mammals dataset has information on body weight (in kilograms) and brain wieght (in grams) for 62 mammals. Suppose we regress brain weight on body weight and its square:

data(mammals, package="DataAnalytics")
mammals$bodywgt2 <- mammals$bodywgt^2
lmSumm(lm(brainwgt ~ bodywgt + bodywgt2, data=mammals))

Coefficients:
              Estimate Std Error t value p value
(Intercept) 20.1600000 2.747e+01    0.73   0.466
bodywgt      2.1230000 1.179e-01   18.00   0.000
bodywgt2    -0.0001893 1.870e-05  -10.12   0.000

4a) Does this regression suggest that there is a nonlinear relationship between body weight and brain weight? Why or why not?

Yes, because the coefficient on the squared term is statistically significant.

Note however that we will never know the true relationship; it is possible that our particular dataset exhibits nonlinearity when in fact the true data generating process is linear.

It is also worth noting that the “small” coefficient on the squared term - by itself - is not informative as to the (non)linearity; that coefficient must be put in the context of the range of values for mammal bodyweight in the dataset. While humans weigh roughly 75kg, whales (which are included in the dataset) weight approximately 6000kg. At such a value, the “small” coefficient has a not-so-small effect.

Reference: Ch2 Slide 14

4b) Write the formula that describes the expected change in brain weight for a small change in body weight, according to the fitted regression.

Denote brain weight as $Y$ and body weight as $X$. Then, according to our model:

\[ \frac{\partial \hat{Y}}{\partial X} = b_1 + 2b_2 X = 2.123 - 0.0003786X \]

Alternatively:

\[ \hat{Y}_1 = b_0 + b_1X + b_2X^2 \hspace{1em} \text{and} \hspace{1em} \hat{Y}_2 = b_0 + b_1(X+\delta) + b_2(X + \delta)^2 \]

Such that:

\[ \hat{Y}_2 - \hat{Y}_1 = b_1\delta + b_2\delta^2 + 2b_2\delta X \approx b_1\delta + 2b_2\delta X = (b_1 + b_2X)\delta\]

In either case, notice that it is important that the expected change in $\hat{Y}$ depends on the $X$ value, and this is because the slope changes as we move (i.e., change our $X$ value) along a nonlinear curve.

Reference: Ch5 Slide 36

Question 5 [6 pts] – Time Series

5a) Suppose we would like to forecast the log-GDP of the United States. We are debating whether to use a linear-trend model or a simple auto-regressive (i.e., an AR(1)) model. How could we decide which model is more appropriate?

One in-sample method would be to estimate both models and then assess the autocorrelations of the residuals, either visually with the acf() function or statistically with the Box-Ljung Test. If one model has autocorrelation left in the residuals and the other does not, then the second is more appropriate. Notice that we cannot just apply the acf() function on the “raw” data itself, as an upward trending curve will show autocorrelation, but this does not help us distinguish between the two models.

One out-of-sample method is to estimate each model and separately perform out-of-sample predictions with both models. We can then score the models based on the quality of their predictions (with, e.g., mean-squared error). The model with the lower prediction error would be deemed more appropriate.

Reference: Ch4 Slides 32-34

5b) Suppose instead we decide to use an ARIMA(1,1,0) model. The code below estimates the ARIMA(1,1,0) model on the natural log of monthly US GDP data for the 287 months ending on November 2018. Using the data below and the fitted ARIMA(1,1,0) model, what is the predicted GDP in December 2018? (Hint: not the log GDP)

Date         ln(GDP)
2018-08-01   9.810481
2018-09-01   9.815965
2018-10-01   9.826152
2018-11-01   9.834753

lmSumm(lm(diff(lnGDP)~back(diff(lnGDP))))

Coefficients:
                  Estimate Std Error t value p value
(Intercept)          0.005 0.0006724    7.44       0
back(diff(lnGDP))    0.360 0.0553200    6.51       0

\[\begin{align*} \log(Y_\text{Dec}) - \log(Y_\text{Nov}) &= 0.005 + 0.36 \times [ \log(Y_\text{Nov}) - \log(Y_\text{Oct}) ] &\\ &= 0.005 + 0.36(9.834753 - 9.826152) &\\ &= 0.00809636 & \end{align*}\]

\[\begin{align*} \log(Y_\text{Dec}) &= \log(Y_\text{Nov}) + 0.00809636 &\\ &= 9.834753 + 0.00809636 &\\ &= 9.842849 & \end{align*}\]

\[\begin{align*} Y_\text{Dec} &= e^{9.842849} &\\ &= 18823.27 & \end{align*}\]

Reference: Ch4 Slides 37-42

Question 6 [5 pts] – Penalized Regression

6a) The complexity parameter ($\lambda$) in the LASSO model is usually determined by k-fold cross validation. What is k-fold cross validation? (Hint: one way to answer this question is to provide suedo-code that outlines the steps of k-fold cross validation.)

Cross validation is a resampling technique where we partition our data into, say, $k$ groups. We then use $k-1$ of those groups to calculate a quantity of interest, and then we use that calculated quantity of interest in combination with the held out group of data. For example, we might estimate our parameters using the data from $k-1$ groups and then predict the outcome ($\hat{Y}$) for the $k^\text{th}$ group.

In psuedo code:

for(i in 1:k) {
    # limit data to k-1 groups
    sub <- dat[-cv_group, ]
    
    # fit model
    out <- lm(..., data=sub)
    
    # predict on held-out group
    yhat[cv_group] <- predict(out, new=data[cv_group, ])
}

# calculate mse
mean( (y - yhat)^2)

Reference: Ch 5 Slide 81

6b) Why does it make sense to standardized the $X$ variables (i.e., scale each $X_j$ variable to that it has unit variance) before using those variables in a LASSO model?

The scale of a parameter is tied to the scale of the corresponding variable, and so if we changed the scale of the variable from, say, miles to feet (and changed nothing else) the estimated coefficient would increase to offset that change in units.

The LASSO model imposes a penalty based on the absolute value of sum of the $\beta$ coefficients. Thus the magnitude of those coefficients is important. Therefore, the $X$ variables are typically scaled prior to use in a LASSO model so that the units of measurement do not influence the estimates.

Reference: Ch 5 Slide 79