Regression Homework 3

Question 1

This problem continues to use the data set prostate. Please use head(prostate) to see the variables (columns) in this data frame. The data comes from a study that examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, on 97 men who were about to receive a radical prostatectomy. We use lpsa as the response, which is the log of PSA. The predictors are log cancer volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.

library(faraway)
head(prostate)

(a) What are n and p for this model and data?

n represents the number of observations/rows and p represents the number of predictors. There are 97 men that are being observed and 8 predictors (lcavol, lweight, age, lbph, svi, lcp, gleason, and pgg45). Thus n = 97 and p = 8.

(b) Obtain the design matrix X. Print only the first 3 rows.

design.matrix <- as.matrix(prostate[, c("lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")])

#adding a column of 1s for the intercept
design.matrix <- cbind(1, design.matrix)

head(design.matrix, 3)
##         lcavol lweight age      lbph svi      lcp gleason pgg45
## 1 1 -0.5798185  2.7695  50 -1.386294   0 -1.38629       6     0
## 2 1 -0.9942523  3.3196  58 -1.386294   0 -1.38629       6     0
## 3 1 -0.5108256  2.6912  74 -1.386294   0 -1.38629       7    20

(c) Calculate the least squares estimator without using the lm() function

# the response variable
y <- as.matrix(prostate$lpsa)

# Computing the least squares estimator (beta-hat) using linear algebra 
beta_hat <- solve(t(design.matrix) %*% design.matrix) %*% t(design.matrix) %*% y

print(beta_hat)
##                 [,1]
##          0.669336698
## lcavol   0.587021826
## lweight  0.454467424
## age     -0.019637176
## lbph     0.107054031
## svi      0.766157326
## lcp     -0.105474263
## gleason  0.045141598
## pgg45    0.004525231

(d) Calculate the least squares estimator using lm()

model <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate)

# get the coefficients, aka get beta hat
print(coef(model))
##  (Intercept)       lcavol      lweight          age         lbph          svi 
##  0.669336698  0.587021826  0.454467424 -0.019637176  0.107054031  0.766157326 
##          lcp      gleason        pgg45 
## -0.105474263  0.045141598  0.004525231

(e) Compare the results of (c) and (d)

The results of (c) and (d) are identical because both methods compute the least squares estimator (aka beta hat). The first method is essential the “behind the scenes” of what the lm() function does.

(f) Report the sum of squared vertical distances from the best regression surface

#the vertical distances refer to the residuals, so question is asking for sum of squared residuals

# Calculate the fitted values
fitted_values <- design.matrix %*% beta_hat

# Calculate the residuals
residuals <- y - fitted_values

# Calculate the sum of squared residuals
SSE <- sum(residuals^2)
print(SSE)
## [1] 44.16302

Question 2

on an attached pdf

Question 3

(a) Fit the MLR model using all 8 predictors. Report ˆσ2, i.e., the MSE s2e . (5 pts)

model <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate)

#this gets sigma (the residual standard error) from the model and squares it
sigma_squared <- summary(model)$sigma^2

(sigma_squared)
## [1] 0.5018525

(b) Write the R code which will extract the vector of fitted values and the residual vector. (10 pts)

fitted_values   <- fitted(model)
residual_values <- resid(model)

(c) Calculate and report SST, SSR and SSE. (5 pts)

SST <- sum((prostate$lpsa - mean(prostate$lpsa))^2)

SSR <- sum((fitted_values - mean(prostate$lpsa))^2)

SSE <- sum(residuals^2)

cat("\n--- Sum of Squares ---\n")
## 
## --- Sum of Squares ---
cat("SST (Total Sum of Squares):", SST, "\n")
## SST (Total Sum of Squares): 127.9176
cat("SSR (Sum of Squares Regression):", SSR, "\n")
## SSR (Sum of Squares Regression): 83.75456
cat("SSE (Sum of Squares Error):", SSE, "\n")
## SSE (Sum of Squares Error): 44.16302

(d) Calculate and report the R2 using the sums of squares you obtained in Part (c). (5 pts)

R2 <- 1 - (SSE / SST)
(R2)
## [1] 0.6547541
#this is also the same as SSR/SST
(SSR/SST)
## [1] 0.6547541

(e) Calculate and report the inner product of the residual vector with each of 1, x·1, x·2, . . . , x·8. (10 pts)

# Calculate inner products
inner_products <- t(residuals) %*% design.matrix
print(inner_products)
##                         lcavol     lweight          age         lbph
## [1,] 9.734435e-13 1.954736e-12 3.78637e-12 6.016521e-11 4.147051e-13
##                svi          lcp      gleason        pgg45
## [1,] -1.474376e-13 4.339842e-13 5.223377e-12 2.095746e-11
#these values should be 0, but I am not concerned about that here because all the values are extremely small (the largest one is 6.016521e-11 --> 0.00000000006016521)

(f) Perform the SLR using only one predictor lcavol. Compare the coefficient of lcavol under the SLR with the coefficient of lcavol you found in Part (a). (10 pts)

# simple linear regression model
slr_model <- lm(lpsa ~ lcavol, data = prostate)
slr_coef <- coef(slr_model)["lcavol"]

# MLR model (already creating, just getting coefficients)
mlr_coef <- coef(model)["lcavol"]

cat("The SLR Coefficient for lcavol: ",slr_coef, "\n")
## The SLR Coefficient for lcavol:  0.7193201
cat("The MLR Coefficient for lcavol: ",mlr_coef)
## The MLR Coefficient for lcavol:  0.5870218

In SLR, the coefficient for lcavol (0.7193201) represents the direct relationship between lcavol and the response variable (lpsa). However in MLR, the coefficient for lcavol (0.5870218) is adjusted for the effects of other predictors in the model. Since there are more predictors in the MLR model, the residual variance will be smaller so the coefficients will be more precise. Because of this, the MLR coefficient provides a more accurate representation.