Setup:
knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 5)
library(tidyverse)
library(readr)
echo = TRUE to make sure the R code is
displayed.In this homework problem, you will
Suppose a multiple linear regression model given below
\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} +\beta_3 X_{i3} +\beta_4 X_{i4} +\beta_5 X_{i5} + e_i, \] for \(i=1,2,3,4,5,6,7\) with \(e_i \stackrel{\mathrm{iid}}{\sim}\mbox{N}(0, \sigma^2)\).
Part 1: Let \[ Y=\begin{bmatrix}Y_1\\ \vdots\\ Y_7\end{bmatrix},\quad X=\begin{bmatrix} 1&X_{11}&X_{12}&X_{13}&X_{14}&X_{15}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 1&X_{71}&X_{72}&X_{73}&X_{74}&X_{75} \end{bmatrix},\quad \beta=\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\\ \beta_3\\ \beta_4\\ \beta_5\end{bmatrix},\quad e=\begin{bmatrix}e_1\\ \vdots\\ e_7\end{bmatrix}. \] Then \(Y=X\beta+e\).
Part 2: \(Y\) is \(7\times 1\); \(X\) is \(7\times 6\) (intercept + 5 regressors); \(\beta\) is \(6\times 1\); \(e\) is \(7\times 1\).
Refer the Arlington data set of House price vs sqft, bed, bath, and col that we discussed in the class. A portion of the data set is given below:
Let \(Y\) be the price,
\(X_1\) be the sqft, \(X_2\) be the beds, \(X_3\) be the baths, and \(X_4\) be the col for this
multiple linear regression.
Part 1: \[ \mathbb{E}(Y\mid X)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_4,\qquad \mathrm{Var}(Y\mid X)=\sigma^2. \]
Part 2: With \(n\) Arlington homes, \[ Y=\begin{bmatrix}Y_1\\ \vdots\\ Y_n\end{bmatrix},\ X=\begin{bmatrix} 1&X_{11}&X_{12}&X_{13}&X_{14}\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ 1&X_{n1}&X_{n2}&X_{n3}&X_{n4} \end{bmatrix},\ \beta=\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\\ \beta_3\\ \beta_4\end{bmatrix},\ e=\begin{bmatrix}e_1\\ \vdots\\ e_n\end{bmatrix}. \]
Part 3: \(Y\) is \(n\times 1\); \(X\) is \(n\times 5\) (intercept + 4 regressors); \(\beta\) is \(5\times 1\); \(e\) is \(n\times 1\).
This question is based on the National Hockey League (NHL) teams
introduced in our class. The following variable in the NHL data set. The
data set nhlfans3301 is available on Carmen.
| Variable | Description |
|---|---|
team |
Team location (city) |
country |
USA or Canada |
avgPr |
Average ticket price\(^*\) |
pgAtt |
Per game attendance |
tts |
Total ticket spending = avgPr * pgAtt * 41 |
estFans |
Estimated fans\(^{**}\) |
latitude |
Latitude of city |
We will be interested in using latitude and
estFans jointly to predict tts for our
following question. Also, for this question, we will use log(tts) and
log(estFans) for this question. Here log is scale is based on
natural logarithm.
Make a scatterplot matrix for the variables
latitude, log(estFans) and
log(tts). Use the plots to describe the pairwise
relationships between the variables.
Use R to fit a multiple linear regression model with
mean function \(E(\mathtt{log(tts)} \mid
\mathtt{log(estFans)}, \mathtt{latitude}) = \beta_0 + \beta_1
\mathtt{latitude} + \beta_2 \mathtt{log(estFans)}\) and variance
function \(\mbox{Var}(\mathtt{log(tts)} \mid
\mathtt{log(estFans)}, \mathtt{latitude}) = \sigma^2\). Use the
summary function to display the standard summary of the
fitted model.
Report the numeric values of the parameter estimates \(\hat{\beta}_j\), \(j = 0, 1, 2\), and \(\hat{\sigma}^2\).
Write the estimated mean function and the estimated varaicnce function fo this multiple linear regression.
Report the numeric value of the degrees of freedom associated with the fitted model. Say what formula you use to calculate this value.
Use R to compute and display the 3x3 matrix \(X^T X\), where X is the \(30 \times 3\) matrix used to fit the regression model.
Use R to compute and report the 3x3 matrix \((X^TX)^{-1}\).
Use the \((X^TX)^{-1}\) matrix
along with \(\hat{\sigma}^2\) to
compute the estimated covariance matrix for the estimated coefficient
vector, \(\widehat{\mbox{Var}}(\hat{\boldsymbol \beta} \mid
X)\). Show your code and computations. Verify your result by
comparing your answer to the result obtained using the vcov
function.
Compute and report the second leverage value, \(h_{22}\), for this MLR model in two ways:
(i) using the hatvalues function and (ii) by computing the
H matrix directly and reporting the second diagonal
element.
Make a plot of the residuals \(\hat{e_i}\) versus the fitted values. Does the plot suggest any problems with our assumptions about the mean function?
Make a plot of the standardized residuals \(r_i\) versus the fitted values. Does the plot suggest any problems with our assumptions about the variance function?
Make plots of (i) histogram and (ii) normal probability plot for the residuals \(\hat{e_i}\). Do the plots suggest any problems about the normality assumption?
Report the coefficient of determination for this multiple linear regression. please interpret it.
nhl <- tibble::tribble(
~team, ~tts, ~estFans, ~country, ~latitude, ~pop,
"Toronto", 184443912, 5090000, "Canada", 43.7, 9696000,
"Montreal", 125672544, 2280000, "Canada", 45.5017, 4714000,
"Vancouver", 131942920, 2260000, "Canada", 49.2827, 3526000,
"New York (Rangers)",112213392,1320000,"USA", 40.7127, 10994444,
"Philadelphia", 79797070, 1150000, "USA", 39.95, 7432000,
"Edmonton", 104940648, 1130000, "Canada", 53.5333, 1687000,
"Boston", 103703760, 1090000, "USA", 42.3601, 6272000,
"Calgary", 115097619, 1080000, "Canada", 51.0486, 1607000,
"Chicago", 155300046, 1050000, "USA", 41.8369, 9339000,
"Pittsburgh", 105334986, 850000, "USA", 40.4397, 3136000,
"Ottawa", 75556686, 780000, "Canada", 45.4214, 1488000,
"Detroit", 68972988, 740000, "USA", 42.3314, 4910000,
"Los Angeles", 74886500, 670000, "USA", 34.05, 10325944,
"Minnesota", 95932989, 620000, "USA", 44.9778, 4614000,
"New Jersey", 59161155, 580000, "USA", 40.7127, 4397778,
"New York (Islanders)",64126788,580000,"USA", 40.7127, 4397778,
"Winnipeg", 88161931, 560000, "Canada", 49.8994, 976000,
"Buffalo", 60180620, 530000, "USA", 42.9047, 1612000,
"Washington", 76739782, 500000, "USA", 38.9047, 5851000,
"Anaheim", 47736546, 390000, "USA", 33.8361, 6571056,
"San Jose", 65960882, 350000, "USA", 37.3382, 6705000,
"Colorado", 49741200, 330000, "USA", 39.7392, 933900,
"Dallas", 56196650, 320000, "USA", 32.7767, 6910000,
"St. Louis", 63868980, 320000, "USA", 38.6272, 3034000,
"Tampa Bay", 57108982, 280000, "USA", 27.7625, 4462000,
"Arizona", 35564425, 260000, "USA", 33.45, 4439000,
"Carolina", 33046656, 180000, "USA", 35.7806, 2830000,
"Columbus", 44516570, 180000, "USA", 39.9833, 2270000,
"Florida", 26788170, 170000, "USA", 25.7753, 3745000,
"Nashville", 53899092, 150000, "USA", 36.1667, 2456000
) %>%
mutate(log_tts = log(tts),
log_estFans = log(estFans))
pairs(~ latitude + log_estFans + log_tts, data = nhl,
pch = 19, col = "steelblue", main = "NHL Scatterplot Matrix")
fit <- lm(log_tts ~ latitude + log_estFans, data = nhl)
summary(fit)
##
## Call:
## lm(formula = log_tts ~ latitude + log_estFans, data = nhl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27427 -0.15533 -0.00669 0.13298 0.49315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.935854 0.615040 19.407 < 2e-16 ***
## latitude 0.014140 0.007546 1.874 0.0718 .
## log_estFans 0.421247 0.057502 7.326 7.02e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1975 on 27 degrees of freedom
## Multiple R-squared: 0.826, Adjusted R-squared: 0.8131
## F-statistic: 64.08 on 2 and 27 DF, p-value: 5.598e-11
b <- coef(fit); s2 <- summary(fit)$sigma^2
b0 <- round(b[1], 6); b1 <- round(b[2], 6); b2 <- round(b[3], 6); s2 <- round(s2, 6)
Part 4: The predicted (or fitted) value of the logarithm of total ticket spending (log(tts)) is equal to the intercept (b0) plus the coefficient on latitude (b1) times the latitude value plus the coefficient on the log of estimated fans (b2) times the log of estFans.
Part 5:
df <- df.residual(fit)
df
## [1] 27
X <- model.matrix(fit)
XtX <- t(X) %*% X
round(XtX, 2)
## (Intercept) latitude log_estFans
## (Intercept) 30.00 1209.52 398.80
## latitude 1209.52 49936.67 16177.62
## log_estFans 398.80 16177.62 5321.52
XtX_inv <- solve(XtX)
round(XtX_inv, 6)
## (Intercept) latitude log_estFans
## (Intercept) 9.698909 0.036492 -0.837780
## latitude 0.036492 0.001460 -0.007173
## log_estFans -0.837780 -0.007173 0.084777
V_manual <- s2 * XtX_inv
round(V_manual, 6)
## (Intercept) latitude log_estFans
## (Intercept) 0.378277 0.001423 -0.032675
## latitude 0.001423 0.000057 -0.000280
## log_estFans -0.032675 -0.000280 0.003306
vcov(fit)
## (Intercept) latitude log_estFans
## (Intercept) 0.378273779 1.423256e-03 -0.0326748360
## latitude 0.001423256 5.693712e-05 -0.0002797507
## log_estFans -0.032674836 -2.797507e-04 0.0033064553
h22a <- hatvalues(fit)[2]
H <- X %*% XtX_inv %*% t(X)
h22b <- H[2,2]
c(hatvalues = h22a, manual = h22b)
## hatvalues.2 manual
## 0.1261212 0.1261212
plot(fitted(fit), residuals(fit), pch=19, col="darkred",
xlab="Fitted", ylab="Residuals", main="Residuals vs Fitted")
abline(h=0, lty=2)
plot(fitted(fit), rstandard(fit), pch=19,
xlab="Fitted", ylab="Std. Residual", main="Std Residuals")
abline(h=c(-2,0,2), lty=2, col="gray")
hist(residuals(fit), main="Histogram of Residuals", col="lightblue", breaks=10)
qqnorm(residuals(fit)); qqline(residuals(fit), col="red")
R2 <- summary(fit)$r.squared