This problem continues to use the data set prostate. Please use head(prostate) to see the variables (columns) in this data frame. The data comes from a study that examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, on 97 men who were about to receive a radical prostatectomy. We use lpsa as the response, which is the log of PSA. The predictors are log cancer volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.
library(faraway)
head(prostate)
n represents the number of observations/rows and p represents the number of predictors. There are 97 men that are being observed and 8 predictors (lcavol, lweight, age, lbph, svi, lcp, gleason, and pgg45). Thus n = 97 and p = 8.
design.matrix <- as.matrix(prostate[, c("lcavol", "lweight", "age", "lbph", "svi", "lcp", "gleason", "pgg45")])
#adding a column of 1s for the intercept
design.matrix <- cbind(1, design.matrix)
head(design.matrix, 3)
## lcavol lweight age lbph svi lcp gleason pgg45
## 1 1 -0.5798185 2.7695 50 -1.386294 0 -1.38629 6 0
## 2 1 -0.9942523 3.3196 58 -1.386294 0 -1.38629 6 0
## 3 1 -0.5108256 2.6912 74 -1.386294 0 -1.38629 7 20
# the response variable
y <- as.matrix(prostate$lpsa)
# Computing the least squares estimator (beta-hat) using linear algebra
beta_hat <- solve(t(design.matrix) %*% design.matrix) %*% t(design.matrix) %*% y
print(beta_hat)
## [,1]
## 0.669336698
## lcavol 0.587021826
## lweight 0.454467424
## age -0.019637176
## lbph 0.107054031
## svi 0.766157326
## lcp -0.105474263
## gleason 0.045141598
## pgg45 0.004525231
model <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate)
# get the coefficients, aka get beta hat
print(coef(model))
## (Intercept) lcavol lweight age lbph svi
## 0.669336698 0.587021826 0.454467424 -0.019637176 0.107054031 0.766157326
## lcp gleason pgg45
## -0.105474263 0.045141598 0.004525231
The results of (c) and (d) are identical because both methods compute the least squares estimator (aka beta hat). The first method is essential the “behind the scenes” of what the lm() function does.
#the vertical distances refer to the residuals, so question is asking for sum of squared residuals
# Calculate the fitted values
fitted_values <- design.matrix %*% beta_hat
# Calculate the residuals
residuals <- y - fitted_values
# Calculate the sum of squared residuals
SSE <- sum(residuals^2)
print(SSE)
## [1] 44.16302
on an attached pdf
model <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data = prostate)
#this gets sigma (the residual standard error) from the model and squares it
sigma_squared <- summary(model)$sigma^2
(sigma_squared)
## [1] 0.5018525
fitted_values <- fitted(model)
residual_values <- resid(model)
SST <- sum((prostate$lpsa - mean(prostate$lpsa))^2)
SSR <- sum((fitted_values - mean(prostate$lpsa))^2)
SSE <- sum(residuals^2)
cat("\n--- Sum of Squares ---\n")
##
## --- Sum of Squares ---
cat("SST (Total Sum of Squares):", SST, "\n")
## SST (Total Sum of Squares): 127.9176
cat("SSR (Sum of Squares Regression):", SSR, "\n")
## SSR (Sum of Squares Regression): 83.75456
cat("SSE (Sum of Squares Error):", SSE, "\n")
## SSE (Sum of Squares Error): 44.16302
R2 <- 1 - (SSE / SST)
(R2)
## [1] 0.6547541
#this is also the same as SSR/SST
(SSR/SST)
## [1] 0.6547541
# Calculate inner products
inner_products <- t(residuals) %*% design.matrix
print(inner_products)
## lcavol lweight age lbph
## [1,] 9.734435e-13 1.954736e-12 3.78637e-12 6.016521e-11 4.147051e-13
## svi lcp gleason pgg45
## [1,] -1.474376e-13 4.339842e-13 5.223377e-12 2.095746e-11
#these values should be 0, but I am not concerned about that here because all the values are extremely small (the largest one is 6.016521e-11 --> 0.00000000006016521)
# simple linear regression model
slr_model <- lm(lpsa ~ lcavol, data = prostate)
slr_coef <- coef(slr_model)["lcavol"]
# MLR model (already creating, just getting coefficients)
mlr_coef <- coef(model)["lcavol"]
cat("The SLR Coefficient for lcavol: ",slr_coef, "\n")
## The SLR Coefficient for lcavol: 0.7193201
cat("The MLR Coefficient for lcavol: ",mlr_coef)
## The MLR Coefficient for lcavol: 0.5870218
In SLR, the coefficient for lcavol (0.7193201) represents the direct relationship between lcavol and the response variable (lpsa). However in MLR, the coefficient for lcavol (0.5870218) is adjusted for the effects of other predictors in the model. Since there are more predictors in the MLR model, the residual variance will be smaller so the coefficients will be more precise. Because of this, the MLR coefficient provides a more accurate representation.