Setup:

knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 5)
library(tidyverse)
library(readr)


Instructions

In this homework problem, you will


Question 1

Suppose a multiple linear regression model given below

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} +\beta_3 X_{i3} +\beta_4 X_{i4} +\beta_5 X_{i5} + e_i, \] for \(i=1,2,3,4,5,6,7\) with \(e_i \stackrel{\mathrm{iid}}{\sim}\mbox{N}(0, \sigma^2)\).


Solutions to Question 1
  • Part 1: Let \[ Y=\begin{bmatrix}Y_1\\ \vdots\\ Y_7\end{bmatrix},\quad X=\begin{bmatrix} 1&X_{11}&X_{12}&X_{13}&X_{14}&X_{15}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 1&X_{71}&X_{72}&X_{73}&X_{74}&X_{75} \end{bmatrix},\quad \beta=\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\\ \beta_3\\ \beta_4\\ \beta_5\end{bmatrix},\quad e=\begin{bmatrix}e_1\\ \vdots\\ e_7\end{bmatrix}. \] Then \(Y=X\beta+e\).

  • Part 2: \(Y\) is \(7\times 1\); \(X\) is \(7\times 6\) (intercept + 5 regressors); \(\beta\) is \(6\times 1\); \(e\) is \(7\times 1\).

Question 2

Refer the Arlington data set of House price vs sqft, bed, bath, and col that we discussed in the class. A portion of the data set is given below:

Let \(Y\) be the price, \(X_1\) be the sqft, \(X_2\) be the beds, \(X_3\) be the baths, and \(X_4\) be the col for this multiple linear regression.


Solutions to Question 2
  • Part 1: \[ \mathbb{E}(Y\mid X)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_4,\qquad \mathrm{Var}(Y\mid X)=\sigma^2. \]

  • Part 2: With \(n\) Arlington homes, \[ Y=\begin{bmatrix}Y_1\\ \vdots\\ Y_n\end{bmatrix},\ X=\begin{bmatrix} 1&X_{11}&X_{12}&X_{13}&X_{14}\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ 1&X_{n1}&X_{n2}&X_{n3}&X_{n4} \end{bmatrix},\ \beta=\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\\ \beta_3\\ \beta_4\end{bmatrix},\ e=\begin{bmatrix}e_1\\ \vdots\\ e_n\end{bmatrix}. \]

  • Part 3: \(Y\) is \(n\times 1\); \(X\) is \(n\times 5\) (intercept + 4 regressors); \(\beta\) is \(5\times 1\); \(e\) is \(n\times 1\).

Question 3

This question is based on the National Hockey League (NHL) teams introduced in our class. The following variable in the NHL data set. The data set nhlfans3301 is available on Carmen.

Variable Description
team Team location (city)
country USA or Canada
avgPr Average ticket price\(^*\)
pgAtt Per game attendance
tts Total ticket spending = avgPr * pgAtt * 41
estFans Estimated fans\(^{**}\)
latitude Latitude of city

We will be interested in using latitude and estFans jointly to predict tts for our following question. Also, for this question, we will use log(tts) and log(estFans) for this question. Here log is scale is based on natural logarithm.

  1. Make a scatterplot matrix for the variables latitude, log(estFans) and log(tts). Use the plots to describe the pairwise relationships between the variables.

  2. Use R to fit a multiple linear regression model with mean function \(E(\mathtt{log(tts)} \mid \mathtt{log(estFans)}, \mathtt{latitude}) = \beta_0 + \beta_1 \mathtt{latitude} + \beta_2 \mathtt{log(estFans)}\) and variance function \(\mbox{Var}(\mathtt{log(tts)} \mid \mathtt{log(estFans)}, \mathtt{latitude}) = \sigma^2\). Use the summary function to display the standard summary of the fitted model.

  3. Report the numeric values of the parameter estimates \(\hat{\beta}_j\), \(j = 0, 1, 2\), and \(\hat{\sigma}^2\).

  4. Write the estimated mean function and the estimated varaicnce function fo this multiple linear regression.

  5. Report the numeric value of the degrees of freedom associated with the fitted model. Say what formula you use to calculate this value.

  6. Use R to compute and display the 3x3 matrix \(X^T X\), where X is the \(30 \times 3\) matrix used to fit the regression model.

  7. Use R to compute and report the 3x3 matrix \((X^TX)^{-1}\).

  8. Use the \((X^TX)^{-1}\) matrix along with \(\hat{\sigma}^2\) to compute the estimated covariance matrix for the estimated coefficient vector, \(\widehat{\mbox{Var}}(\hat{\boldsymbol \beta} \mid X)\). Show your code and computations. Verify your result by comparing your answer to the result obtained using the vcov function.

  9. Compute and report the second leverage value, \(h_{22}\), for this MLR model in two ways: (i) using the hatvalues function and (ii) by computing the H matrix directly and reporting the second diagonal element.

  10. Make a plot of the residuals \(\hat{e_i}\) versus the fitted values. Does the plot suggest any problems with our assumptions about the mean function?

  11. Make a plot of the standardized residuals \(r_i\) versus the fitted values. Does the plot suggest any problems with our assumptions about the variance function?

  12. Make plots of (i) histogram and (ii) normal probability plot for the residuals \(\hat{e_i}\). Do the plots suggest any problems about the normality assumption?

  13. Report the coefficient of determination for this multiple linear regression. please interpret it.

nhl <- tibble::tribble(
  ~team,               ~tts,      ~estFans, ~country, ~latitude,  ~pop,
  "Toronto",        184443912,  5090000, "Canada",    43.7,  9696000,
  "Montreal",       125672544,  2280000, "Canada",  45.5017,  4714000,
  "Vancouver",      131942920,  2260000, "Canada",  49.2827,  3526000,
  "New York (Rangers)",112213392,1320000,"USA",   40.7127, 10994444,
  "Philadelphia",    79797070, 1150000, "USA",      39.95,  7432000,
  "Edmonton",       104940648, 1130000, "Canada",  53.5333,  1687000,
  "Boston",         103703760, 1090000, "USA",    42.3601,  6272000,
  "Calgary",        115097619, 1080000, "Canada",  51.0486,  1607000,
  "Chicago",        155300046, 1050000, "USA",    41.8369,  9339000,
  "Pittsburgh",     105334986,  850000, "USA",    40.4397,  3136000,
  "Ottawa",          75556686,  780000, "Canada",  45.4214,  1488000,
  "Detroit",         68972988,  740000, "USA",    42.3314,  4910000,
  "Los Angeles",     74886500,  670000, "USA",      34.05, 10325944,
  "Minnesota",       95932989,  620000, "USA",    44.9778,  4614000,
  "New Jersey",      59161155,  580000, "USA",    40.7127,  4397778,
  "New York (Islanders)",64126788,580000,"USA",   40.7127,  4397778,
  "Winnipeg",        88161931,  560000, "Canada",  49.8994,   976000,
  "Buffalo",         60180620,  530000, "USA",    42.9047,  1612000,
  "Washington",      76739782,  500000, "USA",    38.9047,  5851000,
  "Anaheim",         47736546,  390000, "USA",    33.8361,  6571056,
  "San Jose",        65960882,  350000, "USA",    37.3382,  6705000,
  "Colorado",        49741200,  330000, "USA",    39.7392,   933900,
  "Dallas",          56196650,  320000, "USA",    32.7767,  6910000,
  "St. Louis",       63868980,  320000, "USA",    38.6272,  3034000,
  "Tampa Bay",       57108982,  280000, "USA",    27.7625,  4462000,
  "Arizona",         35564425,  260000, "USA",      33.45,  4439000,
  "Carolina",        33046656,  180000, "USA",    35.7806,  2830000,
  "Columbus",        44516570,  180000, "USA",    39.9833,  2270000,
  "Florida",         26788170,  170000, "USA",    25.7753,  3745000,
  "Nashville",       53899092,  150000, "USA",    36.1667,  2456000
) %>%
  mutate(log_tts = log(tts),
         log_estFans = log(estFans))


Solutions to Question 3
  • Part 1:
pairs(~ latitude + log_estFans + log_tts, data = nhl,
      pch = 19, col = "steelblue", main = "NHL Scatterplot Matrix")

  • Part 2:
fit <- lm(log_tts ~ latitude + log_estFans, data = nhl)
summary(fit)
## 
## Call:
## lm(formula = log_tts ~ latitude + log_estFans, data = nhl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27427 -0.15533 -0.00669  0.13298  0.49315 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.935854   0.615040  19.407  < 2e-16 ***
## latitude     0.014140   0.007546   1.874   0.0718 .  
## log_estFans  0.421247   0.057502   7.326 7.02e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1975 on 27 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.8131 
## F-statistic: 64.08 on 2 and 27 DF,  p-value: 5.598e-11
  • Part 3:
b <- coef(fit); s2 <- summary(fit)$sigma^2
b0 <- round(b[1], 6); b1 <- round(b[2], 6); b2 <- round(b[3], 6); s2 <- round(s2, 6)
  • Part 4: The predicted (or fitted) value of the logarithm of total ticket spending (log(tts)) is equal to the intercept (b0) plus the coefficient on latitude (b1) times the latitude value plus the coefficient on the log of estimated fans (b2) times the log of estFans.

  • Part 5:

df <- df.residual(fit)
df
## [1] 27
  • Part 6:
X <- model.matrix(fit)
XtX <- t(X) %*% X
round(XtX, 2)
##             (Intercept) latitude log_estFans
## (Intercept)       30.00  1209.52      398.80
## latitude        1209.52 49936.67    16177.62
## log_estFans      398.80 16177.62     5321.52
  • Part 7:
XtX_inv <- solve(XtX)
round(XtX_inv, 6)
##             (Intercept)  latitude log_estFans
## (Intercept)    9.698909  0.036492   -0.837780
## latitude       0.036492  0.001460   -0.007173
## log_estFans   -0.837780 -0.007173    0.084777
  • Part 8:
V_manual <- s2 * XtX_inv
round(V_manual, 6)
##             (Intercept)  latitude log_estFans
## (Intercept)    0.378277  0.001423   -0.032675
## latitude       0.001423  0.000057   -0.000280
## log_estFans   -0.032675 -0.000280    0.003306
vcov(fit) 
##              (Intercept)      latitude   log_estFans
## (Intercept)  0.378273779  1.423256e-03 -0.0326748360
## latitude     0.001423256  5.693712e-05 -0.0002797507
## log_estFans -0.032674836 -2.797507e-04  0.0033064553
  • Part 9:
h22a <- hatvalues(fit)[2]
H <- X %*% XtX_inv %*% t(X)
h22b <- H[2,2]
c(hatvalues = h22a, manual = h22b)
## hatvalues.2      manual 
##   0.1261212   0.1261212
  • Part 10:
plot(fitted(fit), residuals(fit), pch=19, col="darkred",
     xlab="Fitted", ylab="Residuals", main="Residuals vs Fitted")
abline(h=0, lty=2)

  • Part 11:
plot(fitted(fit), rstandard(fit), pch=19,
     xlab="Fitted", ylab="Std. Residual", main="Std Residuals")
abline(h=c(-2,0,2), lty=2, col="gray")

  • Part 12:
hist(residuals(fit), main="Histogram of Residuals", col="lightblue", breaks=10)

qqnorm(residuals(fit)); qqline(residuals(fit), col="red")

  • Part 13:
R2 <- summary(fit)$r.squared