Stat 3301: Homework 5

Setup:

knitr::opts_chunk$set(echo = TRUE, fig.width = 6, fig.height = 5)
library(tidyverse)
library(readr)

Instructions

Replace “FirstName LastName (name.n)” above with your information.
Provide your solutions below in the spaces marked “Solution:”.
Include any R code that you use to answer the questions; if a numeric answer is required, show how you calculated it in R. I have set the global option echo = TRUE to make sure the R code is displayed.
Knit this document to HTML and upload both the HTML file and your completed Rmd file to Carmen
Make sure your solutions are clean and easy-to-read by
- formatting all plots to be appropriately sized, with appropriate axis labels.
- only including R code that is necessary to answer the questions below.
- only including R output that is necessary to answer the questions below (avoiding lengthy output).
- providing short written answers explaining your work, and writing in complete sentences.

In this homework problem, you will

Multiple linear regression parameter estimation, constructing confidence interval

Question 1

Suppose a multiple linear regression model given below

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} +\beta_3 X_{i3} +\beta_4 X_{i4} +\beta_5 X_{i5} + e_i, \] for \(i=1,2,3,4,5,6,7\) with \(e_i \stackrel{\mathrm{iid}}{\sim}\mbox{N}(0, \sigma^2)\).

Solutions to Question 1

Part 1: Let \[ Y=\begin{bmatrix}Y_1\\ \vdots\\ Y_7\end{bmatrix},\quad X=\begin{bmatrix} 1&X_{11}&X_{12}&X_{13}&X_{14}&X_{15}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 1&X_{71}&X_{72}&X_{73}&X_{74}&X_{75} \end{bmatrix},\quad \beta=\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\\ \beta_3\\ \beta_4\\ \beta_5\end{bmatrix},\quad e=\begin{bmatrix}e_1\\ \vdots\\ e_7\end{bmatrix}. \] Then \(Y=X\beta+e\).
Part 2: \(Y\) is \(7\times 1\); \(X\) is \(7\times 6\) (intercept + 5 regressors); \(\beta\) is \(6\times 1\); \(e\) is \(7\times 1\).

Question 2

Refer the Arlington data set of House price vs sqft, bed, bath, and col that we discussed in the class. A portion of the data set is given below:

Let \(Y\) be the price, \(X_1\) be the sqft, \(X_2\) be the beds, \(X_3\) be the baths, and \(X_4\) be the col for this multiple linear regression.

Solutions to Question 2

Part 1: \[ \mathbb{E}(Y\mid X)=\beta_0+\beta_1 X_1+\beta_2 X_2+\beta_3 X_3+\beta_4 X_4,\qquad \mathrm{Var}(Y\mid X)=\sigma^2. \]
Part 2: With \(n\) Arlington homes, \[ Y=\begin{bmatrix}Y_1\\ \vdots\\ Y_n\end{bmatrix},\ X=\begin{bmatrix} 1&X_{11}&X_{12}&X_{13}&X_{14}\\ \vdots&\vdots&\vdots&\vdots&\vdots\\ 1&X_{n1}&X_{n2}&X_{n3}&X_{n4} \end{bmatrix},\ \beta=\begin{bmatrix}\beta_0\\ \beta_1\\ \beta_2\\ \beta_3\\ \beta_4\end{bmatrix},\ e=\begin{bmatrix}e_1\\ \vdots\\ e_n\end{bmatrix}. \]
Part 3: \(Y\) is \(n\times 1\); \(X\) is \(n\times 5\) (intercept + 4 regressors); \(\beta\) is \(5\times 1\); \(e\) is \(n\times 1\).

Question 3

This question is based on the National Hockey League (NHL) teams introduced in our class. The following variable in the NHL data set. The data set nhlfans3301 is available on Carmen.

Variable	Description
`team`	Team location (city)
`country`	USA or Canada
`avgPr`	Average ticket price\(^*\)
`pgAtt`	Per game attendance
`tts`	Total ticket spending = `avgPr * pgAtt * 41`
`estFans`	Estimated fans\(^{**}\)
`latitude`	Latitude of city

We will be interested in using latitude and estFans jointly to predict tts for our following question. Also, for this question, we will use log(tts) and log(estFans) for this question. Here log is scale is based on natural logarithm.

Make a scatterplot matrix for the variables latitude, log(estFans) and log(tts). Use the plots to describe the pairwise relationships between the variables.
Use R to fit a multiple linear regression model with mean function \(E(\mathtt{log(tts)} \mid \mathtt{log(estFans)}, \mathtt{latitude}) = \beta_0 + \beta_1 \mathtt{latitude} + \beta_2 \mathtt{log(estFans)}\) and variance function \(\mbox{Var}(\mathtt{log(tts)} \mid \mathtt{log(estFans)}, \mathtt{latitude}) = \sigma^2\). Use the summary function to display the standard summary of the fitted model.
Report the numeric values of the parameter estimates \(\hat{\beta}_j\), \(j = 0, 1, 2\), and \(\hat{\sigma}^2\).
Write the estimated mean function and the estimated varaicnce function fo this multiple linear regression.
Report the numeric value of the degrees of freedom associated with the fitted model. Say what formula you use to calculate this value.
Use R to compute and display the 3x3 matrix \(X^T X\), where X is the \(30 \times 3\) matrix used to fit the regression model.
Use R to compute and report the 3x3 matrix \((X^TX)^{-1}\).
Use the \((X^TX)^{-1}\) matrix along with \(\hat{\sigma}^2\) to compute the estimated covariance matrix for the estimated coefficient vector, \(\widehat{\mbox{Var}}(\hat{\boldsymbol \beta} \mid X)\). Show your code and computations. Verify your result by comparing your answer to the result obtained using the vcov function.
Compute and report the second leverage value, \(h_{22}\), for this MLR model in two ways: (i) using the hatvalues function and (ii) by computing the H matrix directly and reporting the second diagonal element.
Make a plot of the residuals \(\hat{e_i}\) versus the fitted values. Does the plot suggest any problems with our assumptions about the mean function?
Make a plot of the standardized residuals \(r_i\) versus the fitted values. Does the plot suggest any problems with our assumptions about the variance function?
Make plots of (i) histogram and (ii) normal probability plot for the residuals \(\hat{e_i}\). Do the plots suggest any problems about the normality assumption?
Report the coefficient of determination for this multiple linear regression. please interpret it.

nhl <- tibble::tribble(
  ~team,               ~tts,      ~estFans, ~country, ~latitude,  ~pop,
  "Toronto",        184443912,  5090000, "Canada",    43.7,  9696000,
  "Montreal",       125672544,  2280000, "Canada",  45.5017,  4714000,
  "Vancouver",      131942920,  2260000, "Canada",  49.2827,  3526000,
  "New York (Rangers)",112213392,1320000,"USA",   40.7127, 10994444,
  "Philadelphia",    79797070, 1150000, "USA",      39.95,  7432000,
  "Edmonton",       104940648, 1130000, "Canada",  53.5333,  1687000,
  "Boston",         103703760, 1090000, "USA",    42.3601,  6272000,
  "Calgary",        115097619, 1080000, "Canada",  51.0486,  1607000,
  "Chicago",        155300046, 1050000, "USA",    41.8369,  9339000,
  "Pittsburgh",     105334986,  850000, "USA",    40.4397,  3136000,
  "Ottawa",          75556686,  780000, "Canada",  45.4214,  1488000,
  "Detroit",         68972988,  740000, "USA",    42.3314,  4910000,
  "Los Angeles",     74886500,  670000, "USA",      34.05, 10325944,
  "Minnesota",       95932989,  620000, "USA",    44.9778,  4614000,
  "New Jersey",      59161155,  580000, "USA",    40.7127,  4397778,
  "New York (Islanders)",64126788,580000,"USA",   40.7127,  4397778,
  "Winnipeg",        88161931,  560000, "Canada",  49.8994,   976000,
  "Buffalo",         60180620,  530000, "USA",    42.9047,  1612000,
  "Washington",      76739782,  500000, "USA",    38.9047,  5851000,
  "Anaheim",         47736546,  390000, "USA",    33.8361,  6571056,
  "San Jose",        65960882,  350000, "USA",    37.3382,  6705000,
  "Colorado",        49741200,  330000, "USA",    39.7392,   933900,
  "Dallas",          56196650,  320000, "USA",    32.7767,  6910000,
  "St. Louis",       63868980,  320000, "USA",    38.6272,  3034000,
  "Tampa Bay",       57108982,  280000, "USA",    27.7625,  4462000,
  "Arizona",         35564425,  260000, "USA",      33.45,  4439000,
  "Carolina",        33046656,  180000, "USA",    35.7806,  2830000,
  "Columbus",        44516570,  180000, "USA",    39.9833,  2270000,
  "Florida",         26788170,  170000, "USA",    25.7753,  3745000,
  "Nashville",       53899092,  150000, "USA",    36.1667,  2456000
) %>%
  mutate(log_tts = log(tts),
         log_estFans = log(estFans))

Solutions to Question 3

Part 1:

pairs(~ latitude + log_estFans + log_tts, data = nhl,
      pch = 19, col = "steelblue", main = "NHL Scatterplot Matrix")

Part 2:

fit <- lm(log_tts ~ latitude + log_estFans, data = nhl)
summary(fit)

## 
## Call:
## lm(formula = log_tts ~ latitude + log_estFans, data = nhl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27427 -0.15533 -0.00669  0.13298  0.49315 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.935854   0.615040  19.407  < 2e-16 ***
## latitude     0.014140   0.007546   1.874   0.0718 .  
## log_estFans  0.421247   0.057502   7.326 7.02e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1975 on 27 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.8131 
## F-statistic: 64.08 on 2 and 27 DF,  p-value: 5.598e-11

Part 3:

b <- coef(fit); s2 <- summary(fit)$sigma^2
b0 <- round(b[1], 6); b1 <- round(b[2], 6); b2 <- round(b[3], 6); s2 <- round(s2, 6)

Part 4: The predicted (or fitted) value of the logarithm of total ticket spending (log(tts)) is equal to the intercept (b0) plus the coefficient on latitude (b1) times the latitude value plus the coefficient on the log of estimated fans (b2) times the log of estFans.
Part 5:

df <- df.residual(fit)
df

## [1] 27

Part 6:

X <- model.matrix(fit)
XtX <- t(X) %*% X
round(XtX, 2)

##             (Intercept) latitude log_estFans
## (Intercept)       30.00  1209.52      398.80
## latitude        1209.52 49936.67    16177.62
## log_estFans      398.80 16177.62     5321.52

Part 7:

XtX_inv <- solve(XtX)
round(XtX_inv, 6)

##             (Intercept)  latitude log_estFans
## (Intercept)    9.698909  0.036492   -0.837780
## latitude       0.036492  0.001460   -0.007173
## log_estFans   -0.837780 -0.007173    0.084777

Part 8:

V_manual <- s2 * XtX_inv
round(V_manual, 6)

##             (Intercept)  latitude log_estFans
## (Intercept)    0.378277  0.001423   -0.032675
## latitude       0.001423  0.000057   -0.000280
## log_estFans   -0.032675 -0.000280    0.003306

vcov(fit)

##              (Intercept)      latitude   log_estFans
## (Intercept)  0.378273779  1.423256e-03 -0.0326748360
## latitude     0.001423256  5.693712e-05 -0.0002797507
## log_estFans -0.032674836 -2.797507e-04  0.0033064553

Part 9:

h22a <- hatvalues(fit)[2]
H <- X %*% XtX_inv %*% t(X)
h22b <- H[2,2]
c(hatvalues = h22a, manual = h22b)

## hatvalues.2      manual 
##   0.1261212   0.1261212

Part 10:

plot(fitted(fit), residuals(fit), pch=19, col="darkred",
     xlab="Fitted", ylab="Residuals", main="Residuals vs Fitted")
abline(h=0, lty=2)

Part 11:

plot(fitted(fit), rstandard(fit), pch=19,
     xlab="Fitted", ylab="Std. Residual", main="Std Residuals")
abline(h=c(-2,0,2), lty=2, col="gray")

Part 12:

hist(residuals(fit), main="Histogram of Residuals", col="lightblue", breaks=10)

qqnorm(residuals(fit)); qqline(residuals(fit), col="red")

Part 13:

R2 <- summary(fit)$r.squared