Summary

I will be using Linear Models with R (Faraway), p 48, ch 3, problem set 1, to apply learned inference techniques on a given prostate data set.

library(faraway)
library(tidyverse)
library(ellipse)
data(prostate) 

Problem 1A

For the prostate data, fit a model with lpsa as the response and the other variables as predictors:

  1. Compute 90 and 95% CIs for the parameter associated with age. Using just these intervals what could we have deduced about the p-value for age in the regression summary?

Solution 1A

l.model <- lm(lpsa ~ ., data = prostate)
summary(l.model)

Call:
lm(formula = lpsa ~ ., data = prostate)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7331 -0.3713 -0.0170  0.4141  1.6381 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.669337   1.296387   0.516  0.60693    
lcavol       0.587022   0.087920   6.677 2.11e-09 ***
lweight      0.454467   0.170012   2.673  0.00896 ** 
age         -0.019637   0.011173  -1.758  0.08229 .  
lbph         0.107054   0.058449   1.832  0.07040 .  
svi          0.766157   0.244309   3.136  0.00233 ** 
lcp         -0.105474   0.091013  -1.159  0.24964    
gleason      0.045142   0.157465   0.287  0.77503    
pgg45        0.004525   0.004421   1.024  0.30886    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7084 on 88 degrees of freedom
Multiple R-squared:  0.6548,    Adjusted R-squared:  0.6234 
F-statistic: 20.86 on 8 and 88 DF,  p-value: < 2.2e-16
stats::confint(object = l.model, parm = c("age"), level = .90)
           5 %         95 %
age -0.0382102 -0.001064151
stats::confint(object = l.model, parm = c("age"), level = .95)
          2.5 %      97.5 %
age -0.04184062 0.002566267

In the regression summary we can deduce that the confidence interval used for the linear model is at 95% since it corresponds the jump from 0 at 90% to .0025 at 95%. Therefore, at a confidence interval of 95% we cannot say age is a significant predictor when holding it to a standard of .05.

Problem 1B

  1. Compute and display a 95% joint confidence region for the parameters associated with age and lbph. Plot the origin on this display. The location of the origin on the display tells us the outcome for a certain hypothesis test. State that test and its outcome.

Solution 1B

graphics::plot(ellipse(l.model, c('age', 'lbph')), type = "l",
                level = 0.95,
                main = '95% Joint Confidence Region')

graphics::points(0, 0, pch = 21)
graphics::points(coef(l.model)["age"], coef(l.model)["lbph"], pch = 19)

graphics::abline(v= confint(l.model)['age',], lty = 2)
graphics::abline(h= confint(l.model)['lbph',], lty = 2)

We can see from above that the origin lies within the confidence region, therefore we do not reject the null hypothesis

Problem 1C

  1. In the text, we made a permutation test corresponding to the F-test for the significant of all the predictors. Execute the permutation test corresponding to the t-test for age in the model. (Hint: summary(g)$coef[4,3] gets your the t-statistic you need if the model is called g.)

Solution 1C

set.seed(315)
t_value <- summary(l.model) %>% coef() %>% .['age', 't value']

# reference p 41
permutation_test <- function(nreps) {
    map_dbl(1:nreps, ~ lm(sample(lpsa) ~ ., data = prostate) %>%
            summary() %>%
            coef() %>%
            .['age', 't value'])
}

mean(abs(permutation_test(4000)) > abs(t_value))
[1] 0.07975
mean(abs(permutation_test(8000)) > abs(t_value))
[1] 0.0845

Note that from question 1A we can see that the p-value of Age is ~.0823. As expected, when we compute more random permutations (in our case 4K to 8K), we reduce variability in the estimation of the p-value. Recall, “Since the permutation test does not depend on the assumption of normality, we might regard it as superior to the normal theory based value.(p 41)”

Problem 1D

  1. Remove all the predictors that are not significant at the 5% level. Test this model against the original model. Which model is preferred?

Solution 1D

l.model2 <- update(l.model, . ~ lcavol + lweight + svi)
summary(l.model2)

Call:
lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.72964 -0.45764  0.02812  0.46403  1.57013 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.26809    0.54350  -0.493  0.62298    
lcavol       0.55164    0.07467   7.388  6.3e-11 ***
lweight      0.50854    0.15017   3.386  0.00104 ** 
svi          0.66616    0.20978   3.176  0.00203 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7168 on 93 degrees of freedom
Multiple R-squared:  0.6264,    Adjusted R-squared:  0.6144 
F-statistic: 51.99 on 3 and 93 DF,  p-value: < 2.2e-16
anova(l.model, l.model2)
Analysis of Variance Table

Model 1: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
    pgg45
Model 2: lpsa ~ lcavol + lweight + svi
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     88 44.163                           
2     93 47.785 -5   -3.6218 1.4434 0.2167

We can see from above that the ANOVA depicted there is not a significant improved fit (Pr column) when going from model 1 to model 2. Furthermore, it actually showed that the residuals (Sum of Sq column) had a slightly worst fit in model 2. We could also compare both the model summaries and arrive at the same conclusion that model 1 would be the preferred choice.