Answer the following four subparts of Question 1 as “Yes” or “No” and provide support for your answer. A simple “Yes” or “No” without support will be given zero credit. All questions ask about simple linear regression models of the form \(Y = \beta_0 + \beta_1 X + \varepsilon\), with the usual assumptions about \(\varepsilon\). The subparts are not related to each other.
1a) You run a simple linear regression in R on 5000 data points and then you view the output with the summary() command. You find that the t-statistic for the estimated coefficient (\(b_1\)) on the explanatory variable is more than 1.96. Will the p-value reported by R for that t-statistic definitely be less than 0.025?
1b) Your friend Joe runs a simple linear regression and finds that both the estimated intercept coefficient (\(b_0\)) and the estimated slope coefficient (\(b_1\)) on the explanatory variable are statistically significally different than zero at a 99% confidence level. He concludes that the relationship between \(X\) and \(Y\) is linear. Should you agree with Joe’s reasoning?
1c) Alyssa runs a simple linear regression. In her data, the sample standard deviation of \(X\) is equal to the sample standard deviation of \(Y\) (i.e., \(s_X = s_Y\)). For her regression, is the slope coefficient equal to the coefficient of determination (i.e., does \(b_1 = R^2\))?
1d) Suppose \(X\) can take on values from 0 to 100. In your dataset, you have \(n=1000\) observations and the average \(X\) value is 30 (i.e., \(\bar{X}=30\)). Your boss asks you to calculate a prediction interval for \(Y_\text{f}|X_\text{f}=29\). You need to do this quickly. You know \(s\), but not \(s_\text{pred}\). Can you use \(s\) as a good approximation of \(s_\text{pred}\) to calculate the requested prediction interval?
Question 2 has six sub-parts; they are not related to each other. Provide the required calculation or short answer for each sub-part.
2a) Suppose you create a portfolio from two stocks. The weights of the stocks included in the portfolio are provided in the vector \(w\) and the coviariance matrix of the stocks is provided in the matrix \(\Sigma\). Calculate the variance of the portfolio.
\[ w = \begin{bmatrix} 0.3 \\ 0.7 \end{bmatrix} \hspace{3em} \Sigma = \begin{bmatrix} 2 & 0.5 \\ 0.5 & 1 \end{bmatrix} \]
2b) Suppose your dataset has 100 observations and 6 variables (the dependent variable \(Y\) and 5 independent variables \(X_1\) through \(X_5\)). You regress \(Y\) on the 5 independent variables and find an \(R^2\) value of 0.66. Calculate the overall F-statistic for this regression.
2c) Consider a stationary AR(1) model. Is the conditional mean equal to the unconditional mean? In other words, does \(\mathbb{E}[Y_t \vert Y_{t-1}] = \mathbb{E}[Y_t]\)? Why or why not?
2d) Consider the LASSO technique. Suppose you choose a very small value for the penalty parameter \(\lambda\). Will your estimated coefficients be similar to OLS estimates of the same set of parameters, or will they be very different from the OLS estimates. Why?
2e) Assume you would like to model \(Y\) as a linear function of \(X\), but you are concerned about heteroskedasticity. In particular, you model \(Y_i = \beta_0 + \beta_1X_i + \varepsilon_i\) with \(\varepsilon_i\) independently but not identically distributed, i.e., \(\varepsilon_i \sim \mathcal{N}(0, \sigma_i^2)\). Let \(e\) denote the vector of residuals. Calculate the least squares coefficient estimates using the following information.
\[ [X'X]^{-1} = \begin{bmatrix} 0.5 & 0.1 \\ 0.1 & 3 \end{bmatrix} \hspace{2em} X'Y = \begin{bmatrix} -4 \\ 2 \end{bmatrix} \hspace{2em} e'e = 212.91 \]
2f) Assume you have fit a multiple linear regression model with several independent variables. Suppose the quantity of interest is \(\theta = \beta_3 / \beta_2\). Explain how to find a 95% confidence interval for \(\hat{\theta}\) using bootstrapping.
The newfood dataset in DataAnalytics package contains 72 observations about a new food product. One observation is one grocery store. The variables in the dataset are:
sales – average monthly unit sales of the newfood product at each grocery storeprice – price in cents of the newfood product at each grocery storeincome – average income of families that shop at each grocery storecity – an identifier to indicate which of 4 cities the grocery store is locatedUse the information in the following two regressions to answer the four questions on the next page.
data(newfood, package="DataAnalytics")
out1 <- lm(sales ~ price + income + I(income^2) + as.factor(city), data=newfood)
DataAnalytics::lmSumm(out1)
Coefficients:
Estimate Std Error t value p value
(Intercept) -2287.000 1401.000 -1.63 0.108
price -12.260 1.605 -7.64 0.000
income 809.700 404.800 2.00 0.050
I(income^2) -56.990 29.370 -1.94 0.057
as.factor(city)2 60.320 22.360 2.70 0.009
as.factor(city)3 -6.305 19.270 -0.33 0.745
as.factor(city)4 97.910 74.420 1.32 0.193
---
Standard Error of the Regression: 49.44
Multiple R-squared: 0.662 Adjusted R-squared: 0.619
out2 <- lm(sales ~ price + income + I(income^2), data=newfood)
DataAnalytics::lmSumm(out2)
Coefficients:
Estimate Std Error t value p value
(Intercept) 1248.00 735.10 1.70 0.094
price -12.87 1.54 -8.36 0.000
income -185.10 201.00 -0.92 0.360
I(income^2) 13.23 13.76 0.96 0.340
---
Standard Error of the Regression: 52.09
Multiple R-squared: 0.607 Adjusted R-squared: 0.577
3a) According to the first regression on the previous page, at a 10% level of significance, does income have a non-linear affect on sales of the new food product? Why or why not?
3b) For given fixed values of price and income, which city has the highest average sales of the new food product?
3c) Test whether the set of city dummy variables significantly improved the regression at a 99% confidence level. Note the following:
qt(p=0.995, df=66) = 2.652
qf(p=0.990, df1=3, df2=65) = 4.098
qf(p=0.995, df1=3, df2=65) = 4.692
3d) What does the following R code test for? Given the result, what do you conclude?
lmtest::bptest(out1)
##
## studentized Breusch-Pagan test
##
## data: out1
## BP = 26.173, df = 8, p-value = 0.0009811
Question 4 has three subparts.
Assume the linear regression model in matrix form: \(Y = X\beta + \varepsilon\) with \(\varepsilon \sim \mathcal{N}(0, \Sigma)\). Consider \(X\) as fixed or nonstochastic, and define:
4a) Show that \(b = \beta + (X'X)^{-1}X'\varepsilon\).
4b) Show that var\((b) = (X'X)^{-1}X' \Sigma X(X'X)^{-1}\).
4c) If we now assume homoskedasticity, specifically that \(\Sigma = \sigma^2 I\), show that var\((b) = \sigma^2 (X'X)^{-1}\).
The gtAuto dataset in the DataAnalytics package provides 91 observations of monthly US motor vehicles sales data (in millions of dollars). The data span the time period January 2004 through July 2011.
gtAuto$diff <- gtAuto$sales - back(gtAuto$sales)
out <- lm(diff ~ back(diff), data=gtAuto)
lmSumm(out)
Coefficients:
Estimate Std Error t value p value
(Intercept) 73.1400 642.4000 0.11 0.910
back(diff) -0.2094 0.1046 -2.00 0.048
---
Standard Error of the Regression: 6060
Multiple R-squared: 0.044 Adjusted R-squared: 0.033
Overall F stat: 4.01 on 1 and 87 DF, pvalue= 0.048
5a) If the sales in June 2011 were 70749 and the sales in July 2011 were 69910, what are the predicted sales in August 2011 according to the fitted ARIMA(1,1,0) model above?
5b) The first six autocorrelations of the residuals are provided below. The p-value from a Box-Ljung test of these autocorrelations is \(4.5769 \times 10^{-5}\). What do these post-estimation analyses tell us?
cor(e_t, e_{t}) = 1
cor(e_t, e_{t-1}) = -0.023
cor(e_t, e_{t-2}) = -0.002
cor(e_t, e_{t-3}) = 0.067
cor(e_t, e_{t-4}) = -0.098
cor(e_t, e_{t-5}) = -0.073
cor(e_t, e_{t-6}) = -0.528
Suppose you fit an ARCH-1 model to 1000 days of S&P 500 returns using R’s optimizer optim(). The call to optim() returns the following maximum likelihood estimates (\(\hat{\theta}\)) and Hessian matrix (\(H\)). Use \(\hat{\theta}\) and \(H\) to answer the following two questions.
\[ \hat{\theta} = \begin{bmatrix} 1.037 \\ 0.613 \end{bmatrix} \hspace{3em} H = \begin{bmatrix} -231 & -117 \\ -117 & -267 \end{bmatrix} \]
6a) Calculate the variance-covariance matrix.
6b) Test whether the coefficients are each individually statistically significantly different from zero at the 95% confidence level. For reference, qnorm(p=0.975) = 1.9599.