library(tidyverse)
library(ggplot2)BIOS 507 – Homework 2
Background
Problem 1
- Write the model matrix, X.
Answer:
Model Matrix X:
matrix_x <- matrix(c(1,1,1,1,1,8,4,0,-4,-8), nrow = 5, ncol = 2)
print(matrix_x) [,1] [,2]
[1,] 1 8
[2,] 1 4
[3,] 1 0
[4,] 1 -4
[5,] 1 -8
- Find Y′Y by hand.
Answer: Matrix Y’ is [1x5] and Matrix Y is [5x1]. Matrix Y’Y will be [1x1].
Element (1,1) = 7.8 * 7.8 + 9.0 * 9.0 + 10.2 * 10.2 + 11.0 * 11.0 + 11.7 * 11.7 = 60.84 + 81.0 + 104.04 + 121.0 + 136.89 = 503.77
Print Matrix Y′Y, calculated by hand:
matrix_p1q2 <- matrix(c(503.77), nrow = 1, ncol = 1)
print(matrix_p1q2) [,1]
[1,] 503.77
Check calculation using R:
matrix_y <- matrix(c(7.8, 9.0, 10.2, 11.0, 11.7), nrow = 5, ncol = 1)
print(matrix_y) [,1]
[1,] 7.8
[2,] 9.0
[3,] 10.2
[4,] 11.0
[5,] 11.7
matrix_y_t = t(matrix_y)
print(matrix_y_t) [,1] [,2] [,3] [,4] [,5]
[1,] 7.8 9 10.2 11 11.7
matrix_yty = matrix_y_t %*% matrix_y
print(matrix_yty) [,1]
[1,] 503.77
- Find X′X by hand, where X is the model matrix.
Answer: Matrix X′ is [2x5] and Matrix X is [5x2]. Matrix X′X will be [2x2].
Element (1,1) = 1*1 + 1*1 + 1*1 + 1*1 + 1*1 = 1 + 1 + 1 + 1 + 1 = 5
Element (1,2) = 1*8 + 1*4 + 1*0 + 1*(-4) + 1*(-8) = 8 + 4 + 1 - 4 - 8 = 0
Element (2,1) = 8*1 + 4*1 + 0*1 + (-4)*1 + (-8)*1 = 8 + 4 + 1 - 4 - 8 = 0
Element (2,2) = 8*8 + 4*4 + 0*0 + (-4)*(-4) + (-8)*(-8) = 64 + 16 + 0 + 16 + 84 = 160
Print Matrix X′X, calculated by hand:
matrix_xtx_byhand <- matrix(c(5,0,0,160), nrow = 2, ncol = 2)
print(matrix_xtx_byhand) [,1] [,2]
[1,] 5 0
[2,] 0 160
Check calculation using R:
print(matrix_x) [,1] [,2]
[1,] 1 8
[2,] 1 4
[3,] 1 0
[4,] 1 -4
[5,] 1 -8
matrix_x_t = t(matrix_x)
print(matrix_x_t) [,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 1 1
[2,] 8 4 0 -4 -8
matrix_xtx = matrix_x_t %*% matrix_x
print(matrix_xtx) [,1] [,2]
[1,] 5 0
[2,] 0 160
- Find X′Y by hand.
Answer: Matrix X′ is [2x5] and Matrix Y is [5x1]. Matrix X′Y will be [2x1].
Element (1,1) = 1*7.8 + 1*9.0 + 1*10.2 + 1*11.0 + 1*11.7 = 7.8 + 9.0 + 10.2 + 11.0 + 11.7 = 49.7
Element (2,1) = 8*7.8 + 4*9.0 + 0*10.2 + (-4)*11.0 + (-8)*11.7 = 62.4 + 36 + 0 - 44 - 93.6 = -39.2
Print Matrix X′Y, calculated by hand:
matrix_xty_byhand <- matrix(c(49.7, -39.2), nrow = 2, ncol = 1)
print(matrix_xty_byhand) [,1]
[1,] 49.7
[2,] -39.2
Check calculation using R:
matrix_xty = matrix_x_t %*% matrix_y
print(matrix_xty) [,1]
[1,] 49.7
[2,] -39.2
- Evaluate (X′X)−1 by hand. Note that this involves inverting a 2 × 2 matrix. You can find instructions on how to do this online (for example https://www.mathsisfun.com/algebra/matrix-inverse.html).
Answer: From #3, we already have Matrix X′X:
print(matrix_xtx) [,1] [,2]
[1,] 5 0
[2,] 0 160
To find the inverse, I need to switch elements (1,1) and (2,2), reverse the signs on elements (1,2) and (2,1), and multiply the matrix by 1/(ad-bc).
Calculate by hand:
#Step 1: Switch elements (1,1) and (2,2) and reverse the signs on elements (1,2) and (2,1):
matrix_xtx_flipped <- matrix(c(160, 0, 0 , 5), nrow = 2, ncol = 2)
print(matrix_xtx_flipped) [,1] [,2]
[1,] 160 0
[2,] 0 5
#Step 2: Multiply the flipped matrix by the determinant, 1/(ad-bc)
#[use the original a,b,c,d values, not the flipped ones]
detX = 1/(5*160 - 0*0) #= 1/800
#1/800 * Matrix X′X:
matrix_xtx_inverse <- matrix(c(160/800, 0, 0 , 5/800), nrow = 2, ncol = 2)
print(matrix_xtx_inverse) [,1] [,2]
[1,] 0.2 0.00000
[2,] 0.0 0.00625
Check calculation using R:
matrix_xtx_inverse_R = solve(matrix_xtx)
print(matrix_xtx_inverse_R) [,1] [,2]
[1,] 0.2 0.00000
[2,] 0.0 0.00625
Problem 2
- Write the assumed model.
Answer: Y_i = \beta_0 +\beta_1*X_i +\epsilon_i
Y_i: Returns for the second film.
X_i: Returns for the first film.
\beta_0: The y-intercept.
\beta_1: The slope coefficient.
\epsilon_i: The random error term.
- Write the model matrix (it will be 4 × 2.)
p2q2_ModelX = matrix(c(1,1,1,1,90,85,87,103), nrow = 4, ncol = 2)
print(p2q2_ModelX) [,1] [,2]
[1,] 1 90
[2,] 1 85
[3,] 1 87
[4,] 1 103
- Find the least squares solution for β0 and β1 using either SAS or R.
df <- data.frame(
first = c(90, 85, 87, 103),
second = c(120, 90, 87, 111)
)
model <- lm(second ~ first, data = df)
coef(model)(Intercept) first
-0.9606099 1.1283355
summary(model)
Call:
lm(formula = second ~ first, data = df)
Residuals:
1 2 3 4
19.410 -4.948 -10.205 -4.258
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9606 105.5606 -0.009 0.994
first 1.1283 1.1534 0.978 0.431
Residual standard error: 16.18 on 2 degrees of freedom
Multiple R-squared: 0.3236, Adjusted R-squared: -0.01455
F-statistic: 0.957 on 1 and 2 DF, p-value: 0.4311
- How would you interpret β1?
Answer: \beta_1 is the expected change in the second film’s returns for every 1 million dollar increase in the first film’s returns. For every additional 1 million dollars the first movie earns, the second movie is predicted to earn approximately 1.1283 million. In general, it seems like if the first film does well, the second film will do even better in terms of dollars in returns. Notably, the p-value for \beta_1 is > 0.05 indicating that this relationship is not statistically significant for \alpha = 0.05.
- How would you interpret β0? Is this meaningful?
Answer: \beta_0 is the predicted returns for the second film if the first film earned zero dollars. \beta_0 anchors the line at around -960,000 dollars. In real-life, a movie that earns no money would almost certainly never get a sequel, and it’s not very helpful to say a movie that earned zero dollars at the box office would have a sequel earn negative ~960,000 dollars; this term is not very meaningful for real-life application because a movie predicted to lose money would not be made, but the term is useful mathematically as a starting point for our regression line.
Problem 3
Part A: Given that the problem specifies that “non-smoker” is the reference level, the correct model matrix is X_A because each row in X_A that corresponds to a “non-smoker” has a second-column (smoking status) value of 0. Likewise Matrix X_B is the incorrect model matrix because it uses “smoker” as the reference level by assigning a value of 1 to each “non-smoker” in the second column (smoking status).
Part B: Exposure level has three levels (low, medium, high), therefore two dummy variables are needed. In the Model Matrix X, Column 1 is the intercept, and all rows will have a value of 1. Column 2 represents medium exposure. Column 3 represents high exposure. Rows with zeros in both Columns 2 and 3 represent low exposure. Rows with a 1 in Column 2 and a 0 in Column 3 represent medium exposure. Rows with a 0 in Column 2 and a 1 in Column 3 represent high exposure.
model_p3q2 = matrix(c(1,1,1,1,1,1,1,1,1,1,1,1,
0,0,1,1,0,0,0,0,1,1,0,0,
0,0,0,0,1,1,0,0,0,0,1,1), nrow = 12, ncol = 3)
print(model_p3q2) [,1] [,2] [,3]
[1,] 1 0 0
[2,] 1 0 0
[3,] 1 1 0
[4,] 1 1 0
[5,] 1 0 1
[6,] 1 0 1
[7,] 1 0 0
[8,] 1 0 0
[9,] 1 1 0
[10,] 1 1 0
[11,] 1 0 1
[12,] 1 0 1