I will be using Linear Models with R (Faraway), p 48, ch 3, problem set 1, to apply learned inference techniques on a given prostate data set.
library(faraway)
library(tidyverse)
library(ellipse)
data(prostate)
For the prostate data, fit a model with lpsa as the response and the other variables as predictors:
l.model <- lm(lpsa ~ ., data = prostate)
summary(l.model)
Call:
lm(formula = lpsa ~ ., data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.7331 -0.3713 -0.0170 0.4141 1.6381
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.669337 1.296387 0.516 0.60693
lcavol 0.587022 0.087920 6.677 2.11e-09 ***
lweight 0.454467 0.170012 2.673 0.00896 **
age -0.019637 0.011173 -1.758 0.08229 .
lbph 0.107054 0.058449 1.832 0.07040 .
svi 0.766157 0.244309 3.136 0.00233 **
lcp -0.105474 0.091013 -1.159 0.24964
gleason 0.045142 0.157465 0.287 0.77503
pgg45 0.004525 0.004421 1.024 0.30886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7084 on 88 degrees of freedom
Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
stats::confint(object = l.model, parm = c("age"), level = .90)
5 % 95 %
age -0.0382102 -0.001064151
stats::confint(object = l.model, parm = c("age"), level = .95)
2.5 % 97.5 %
age -0.04184062 0.002566267
In the regression summary we can deduce that the confidence interval used for the linear model is at 95% since it corresponds the jump from 0 at 90% to .0025 at 95%. Therefore, at a confidence interval of 95% we cannot say age is a significant predictor when holding it to a standard of .05.
graphics::plot(ellipse(l.model, c('age', 'lbph')), type = "l",
level = 0.95,
main = '95% Joint Confidence Region')
graphics::points(0, 0, pch = 21)
graphics::points(coef(l.model)["age"], coef(l.model)["lbph"], pch = 19)
graphics::abline(v= confint(l.model)['age',], lty = 2)
graphics::abline(h= confint(l.model)['lbph',], lty = 2)
We can see from above that the origin lies within the confidence region, therefore we do not reject the null hypothesis
set.seed(315)
t_value <- summary(l.model) %>% coef() %>% .['age', 't value']
# reference p 41
permutation_test <- function(nreps) {
map_dbl(1:nreps, ~ lm(sample(lpsa) ~ ., data = prostate) %>%
summary() %>%
coef() %>%
.['age', 't value'])
}
mean(abs(permutation_test(4000)) > abs(t_value))
[1] 0.07975
mean(abs(permutation_test(8000)) > abs(t_value))
[1] 0.0845
Note that from question 1A we can see that the p-value of Age is ~.0823. As expected, when we compute more random permutations (in our case 4K to 8K), we reduce variability in the estimation of the p-value. Recall, “Since the permutation test does not depend on the assumption of normality, we might regard it as superior to the normal theory based value.(p 41)”
l.model2 <- update(l.model, . ~ lcavol + lweight + svi)
summary(l.model2)
Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
Residuals:
Min 1Q Median 3Q Max
-1.72964 -0.45764 0.02812 0.46403 1.57013
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.26809 0.54350 -0.493 0.62298
lcavol 0.55164 0.07467 7.388 6.3e-11 ***
lweight 0.50854 0.15017 3.386 0.00104 **
svi 0.66616 0.20978 3.176 0.00203 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
anova(l.model, l.model2)
Analysis of Variance Table
Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
pgg45
Model 2: lpsa ~ lcavol + lweight + svi
Res.Df RSS Df Sum of Sq F Pr(>F)
1 88 44.163
2 93 47.785 -5 -3.6218 1.4434 0.2167
We can see from above that the ANOVA depicted there is not a significant improved fit (Pr column) when going from model 1 to model 2. Furthermore, it actually showed that the residuals (Sum of Sq column) had a slightly worst fit in model 2. We could also compare both the model summaries and arrive at the same conclusion that model 1 would be the preferred choice.